FOSDEM 2024 Transcribed / Subtitled by Whisper

Where have the women of tech history gone?
Good morning everyone. Hope everyone is settling down. We can get started with our first talk. Our first talk is where have the women of tech history gone. Our speaker is Laura Dury. She has been a developer for six years and awarded at World Scales Belgium in Web Technologies category. She has been doing monthly YouTube live discussions on latest tech developments in tech industry. Additionally, she has also started a career of Fourier in France. The talk is mostly about where have women of tech history gone. Addaa Lovelace, Hedji Lamar, the Enoch Girls, Grace Hopper, John Clark. Stemming from the role of calculator, the profession of developer has initially considered a woman's job while hardware design was seen as a man's job. However, who are these women who have shaped the world in tech? Why don't we hear more about them with Laura? You'll attempt to see the record straight bit by bit and provide role models in the tech you've always needed. Thank you. Hi, can you hear me right? Hi everyone, thank you so much for coming today. I just wanted to say that at first I try to do my talk on Sunday because this is way too much for me to handle. Please be kind to me, thank you so much. We're going to talk today about the women in the tech history. First of all, I wanted to talk to you about a little anecdote that happened to me when I was in college. During my first year I had a North History class and I was kind of sad to see there were maximum two women represented. I decided to send an email to my teacher and to ask him why he presented so few women. He answered kindly, honestly, that he didn't have enough time to add more artists to his syllabus. Because of that, some students may not have the required basis for their future career. We think about students in illustration, in painting, art, etc. At first I didn't really pay attention to it, I didn't really see the huge problem behind this. Then when I started to realize that this is kind of weird, that this is not normal, this is not fair, I had two questions in my mind. The first one is why are women not considered as the required basis? Why do they have less than men? The second one is who is the person or the group of persons who decide what someone deserves more than another to be in a syllabus? Spoiler alert, I don't have the answer to this question, I have ideas, I have theories. This is not the aim of this talk, but I hope that this question you can yourself think about it and maybe try to think about it. What I can do is to pay tribute and give a place to women who did a fantastic word to revolutionize the computer science field. This is something I have had in my mind for many years, in fact it's only natural that I'm here today in front of you to speak about that. The problem is present in the majority of fields, but today we're going to concentrate and talk only about computer science, the reason why we are all here today. Of course we're going to do that. Personally, if you go home and you remember two names of women you learned about today, it's a huge win for me. What about you? Do you know some names of women in the tech history? Adelovelle. Kathleen Booth. Margaret Hamilton. Belinda Pearson. Oh sorry I can't hear. Belinda Pearson. Belinda Pearson, yeah that's true. I don't think you know anything about that. Okay I have a lot of names, that's really nice. Okay thank you so much for that. So let's go discover together the stories through the computer science history. And for that we need to go back in time and we're going to begin at the age of enlightenment. So the ancestors of computing machines were human computers and especially in the astronomy fields. So basically computer was a job. And it was about mathematical calculations and very often the job was divided, the computers were divided into groups to compute long and difficult calculations. And the job was done in a way that the calculations were executed at the same time in parallel. And I wanted to talk to you about that because this is really funny, but because still today this is something that we are looking at in our computer. How many operations my computer can execute at the same time. And it was already something that people created, a way of working that was created a long time ago already. So like every profession was dominated by the men. However the first woman to be quoted in articles about the computer science history is Nicole Ren LePote, co-corrigor pour les Français. So she is one of the most famous astronomers of the age of enlightenment. And she is famous because with two other men she calculated the return date of the Hallease comet for April 13 in 1756, 59. Almost exactly as it returned on March 13 in the same year. So I don't know if you understand, we are in the 18th century and they calculated by hands the return date of the comet with only one month of error. So it's really amazing. Maria Mitchell also made a splash for discovering the first telescopic comet, which means it's invisible for the eyes. It will be named after her and she will receive a gold medal for this achievement. So during the 19th century there were a few barriers and contradictions regarding women in the scientific fields. So despite the fact that they had access to degrees, they were forced to resign as soon as they get married. A kindly reminder that a woman that is not married at that time doesn't exist in the eyes of society. So yeah. The history of computer science starts in 1840 with a woman that you obviously know and if you don't know her you should ask yourself some serious questions. Who is that Pokemon? Well of course it's Ada Lovelace. So I think that everyone in this room know who is Ada Lovelace. But for me she is not only the first programmer and this is my thought and I wanted to speak to you with you about that. So for that I need to explain you something. So Charles Babbage is the person who built the difference engine and the analytical engine. However he was messy and he couldn't stand back from his machine he was building. So he had ideas but he didn't have a concept that embraced his machine. Hence the arrival of our sweet and dear Ada. She invented the concepts behind the analytical engine by providing the first algorithms. And I have something to say more and I forgot about it. So she invented the concept of the analytical engine by providing the first algorithm. And ladies and gentlemen computer science was born. So this is why I think for me that Ada Lovelace isn't just the first programmer but she is the mother of computer science by giving these first algorithms. And by the way you can find the first notions of loops and functions in these algorithms. So despite this extraordinary invention it was way too innovative for that time. I remind you to tell you again where we are in 1840. So it was way too innovative and the analytical engine was forgotten for lack of funding. Before being rediscovered in 1937 to inspire the Mach 1 the first general purpose electromechanical computer. But let's take it easy. Alright we are in the end of 19th century and Edward Charles Pickering is the founder of a group of women called the Harvard computers. These women listed over 10,000 stars and developed a system to describe them. But a woman, a particular woman stood out, Aynie Jump Cannon. So she pioneered, this is hard to remember this one, she pioneered a new spectral type classification system and she developed the Harvard classification scheme which is still in use today. No sorry, which is the basis of the system used today. Between 1911 and 1915 she classified over 5,000 stars a month at a rate of one star per 3 seconds. I don't know what you can do in 3 seconds. I mean I can chug a beer in 3 seconds but that's all I can do, right? Okay girl, you have my respect. And in the 19th century the growth of industries opened up opportunities for women to join the field of technology. One notable woman, Great Hermann, made significant contributions with her advanced work in mathematics and physics. She played a key role in her early philosophical work on the foundation of quantum mechanics. But in 1920s her doctoral thesis led the groundwork for computer algebra and it first established the existence of algorithms for many of the basic problems of abstract algebra. So we are going to see a little more of computing here. I promise it's coming. Between computer algebra, I don't know if you know this app or a definition over there if you want to look at that after. So between the 1940s and the 1970s women were widely hired as coders and there are numbers of reasons. The first one is that coding programming was an emerging field so you didn't need a diploma to be hired. As new hires only had to pass a straightforward test, logic test, sorry, to work in a computer science job. Another factor was that despite the fact that women had diplomas degrees in scientific field, they faced a lot of challenges like finding a job or even advancing in their career. So they decided to turn to opportunities in the IT field. The last one is the shortage of manpower during this time and the fact that women cost very little. Grace Hopper. So during the World War II Grace Hopper, a 36 years old mathematician, decided to serve her country. This is very American. I'm sorry for the Americans over there. She decided to let her job, her teaching position at Vassar College to enroll in the US Navy expecting to decode enemy messages and serve her country. Surprisingly, the US Navy sent her to Harvard where she became the third programmer of the Mark I. If you remember, earlier I mentioned the analytical engine and how it was rediscovered to inspire Howard Aiken to create the Mark I in 1937. Well, the Mark I is a versatile, punchcard, programmable calculator and it was Grace who has the honor or rather the heavy burden of taming this machine. She wrote her 521 page user manual from scratch with any help of nobody. Like they said, okay, this is the machine. Go yourself and yeah. See you next time. Okay. So this is really impressive to know that and with her work she was engaged in top secret calculation crucial to the war efforts. Involving past like determining rocket trajectories, generating range tables for new anti aircraft guns and calibrating minesweepers. Now look at your computer. Look how easy it is to code. Now imagine doing this with a big, big computer like doing this for day long, for day long and for night long also. This is not the right page. Yeah, this is. We continue in the history and we are in 1940 and this year marks a milestone in the history of computing. The first fully electronic computer, the ENIAC. It was developed to automate and speed up the work of calculators and computers who was first humans. Right. But even if it was faster, it still needed a human intervention called the operator. And this job was largely performed by women. So the operator is the person who will enter manually questions into the machine through switches and cables. So you have a little, I don't know, overview. Can you see it? Well, it's kind of dark. I'm sorry about that. Yeah, you have a lot of cables over there. And six astounding women, Kathleen, Marlene, Betty, Francis, Betty and Ruth were the first six ENIAC programmer and the first programmer by extension also. So they had to install and assembling this machine. You have to know that the operator was the programmer of today. And even if this is the case, even if this is the programmer of today, at that time it was, it didn't receive a lot of credits. And it was very often belittled because it was performed by women. And hardware was the main job. Yet the line between these two jobs wasn't really clear cuts because women, so operators, needed to have little or in depth hardware knowledge to do it. To control this machine, to program these machines. Because this is still hardware. We didn't have in graphic interface or things like that. You needed to touch the hardware to use the cables, the switches. So this is where we see there is a big difference between a job description and what these women really had to do. Hello. I have a little anecdote. So first of all ENIAC for those who didn't know, means Electronical, Numerical, Integrator and Computer. So all these six women had a mathematics degree in common. They were responsible for installing and assembling the ENIAC. And the most important thing, they were the ancestors of the debugger. So look again to this machine and imagine you have a bug but you don't know where it is. So there were six. There were a group, so they had to work together to try to understand where a bug come from. And why is it a bug? So they created a system to work together as a debugger when there is a bug. And this is quite impressive. I don't know if there is people in this room already saw a machine like that or not. Yeah, okay. That's so nice. I'm jealous. So now we are in 1942 and a significant innovation emerged unintentionally driven by Hedila Marm, a renowned movie star. So to understand what happened, we need to rewind a little bit and delve into her background. So Hedila Marm is really famous for her role in the first non-pornographic film featuring a nude orgasm scene, which is really like, people were, oh my God, oh my God, this is so, yeah. And she also recognized, she recognized as the face of Disney's animated film Snow White. I don't know if it made sense. So, yeah. But she was facing a troubled marriage and Lamar decided to fled from Austria. But she had a really interesting alter ego. Like she was super duper into war and technologies, advancements. Well, it was influenced by her former husband who was a prominent Austrian art manufacturer. And during that time, she crossed path with a pianist named George Entail. And together they created, they invented top secret communication for radio-controlled torpedoes called, if I remember, Frequency Hoping Spread Spectrum. Is it right? Yes, it is right. Okay, Arda FHSS. Thank you, thank you. Okay, let me correct this. And so they patented this idea in 1942 and what is surprising, singly, what is really awesome is that to see that this technology is still in use today. And for all those who are on social network right now on the web, you can think, Adi Lamar, because of her that we have Wi-Fi and Bluetooth today. And a little thing that I have to say is that when it comes to unusual career changes, I think that we are reaching new heights. At the same time, a new way of thinking could emerge in the 50s. So the programming was involving way faster than hardware, which is still the case today. And so they begin to think because they had to begin to optimize their algorithm. And this lead to an image of the singular creative genius who wielding a form of black magic. And with that, the first stereotypical of the programmer emerged. So the white, hairy, antisocial men. And even if this is more to the realm of fantasy, studies in the 60s showed that it was a profile sewed after and it was more easily hired by companies. So you thought you were done with Grace Hopper? Now she's back. And you have to know that after the war, she worked on the Univac. So it was the more powerful computer at that time. And when she was put in charge of the automatic department, sorry, when she was put in charge of the automatic programming department, she had the idea of the compiler. So this person, this one there, she saved our life because now our computer can understand languages that we can read. We don't have zeroes, one or very low, low, low level languages. So thank you, thank you Grace Hopper for that. And as the idea was revolutionary, she started to observe that every manufacturer, every brand of computer was started to develop their own compilers. So in 1969, sorry, this is not the right date. In 1959, almost, in 1959, she faced a potential chaos that it could be. She decided to call on her old Navy connections to organize a meeting with every manufacturer of the country. And when they came out of the meeting, they all agreed on a simple universal language. The common oriented business language or COBOL was invented, which is still in use today in banks. Who do COBOL? Who can code in COBOL? Here, some people not a lot, okay. Are you happy with that? Okay, that's nice, the love man is in there, thank you. So I have two little anecdotes about Grace Hopper. I mean like this woman, like who didn't know Grace Hopper before coming today? You're gonna love her, okay? I mean, I already, we can love her, but the first anecdote I have about her is that she was also the person who think about the software portability. So before, we had to rewrite every program on every computer. And she then had the idea of why? Why we couldn't compile the code to just put a software in between computers without having to rewrite them? Thank you Grace, thank you so much, oh my god. And the second thing is like a little bit funny is that she is the one who decided to call the process of writing instruction, coding, coding, coding, coding, coding. And it's funny to know that this term replaced by programming because you know this is a woman, so no coding, we're gonna say programming. Today is coming back to our vocabulary and today is way more cool, cooler to say coding than programming. Okay, now look at this graph. So this is the percentage of women majors by field. So we have the medical school, low school, physical sciences and computer science. And what we can see, I was going to speak in French, what we can see is that there is a kind of rupture between women and computer science between 1980 and 1995. So this is a big question and I think that if you are interested in, by women in computer science, I think that you already heard about that, about this thing and what happened and why. This is not the aim of the speak but I think it's still important to speak about that. And this is, so there are a lot of reasons, there are a lot of theories about that. And I really invite you to discuss about that with people, older people, younger people and to see what can be done to try to make this curve up again, really higher. But today one of the reasons I saw when I did my research is the arrival of the personal computer in 1981. Woohoo, PC. Before the PC, the thing is that university students had little to no exposure to computer because they were rare, expensive and oh my god, it was like the size of a house. So they were relatively on equal foot. However, with the introduction of the PC, a new stereotype emerged and I love this one. This is a joke. The perception arose that to be a proficient programmer, you have to spend countless hours obsessively on a computer, which is still the case today. So leading to the notion of the real programmer who sported a computer screen tan from constant screen time. This is my case. I don't know if I'm good, but this is my case though, sadly. Funny thing is many men in the business didn't even fit the stereotypes and it could be a little bit controversial. And however for the women it was different. You couldn't have this kind of stereotype on women because either they were not tough enough or they were too tough and then annoying. So many women begin to doubt about their ability to code and dropped out school. And the last thing I have to say about that is the fact that when households acquired a PC, a personal computer, it was mostly put in the boys room with the father taking a coach role and trying to push his son to explore programming. Does people here live that? Or not? Yeah? Okay. Okay. And this is one of the multiple reasons why there is the wear, sorry, a gap gender who began. It's not the only one. I'm not saying that because people after my conference were, no, this is not the only reason. No, I know I didn't say that. I'm sorry. And so before I said, like they were relatively on the equal foot and with that they weren't because the women, the girls, wasn't pushed to, not a majority, so there are exceptions, all right? A majority of girls wasn't pushed to try the computer or programming. And so at the end, before university, the boys were more experienced than women. So today we hear every day, we hear every day about Chagipiti and AI. That's so cool. I'm sick of it. Thank you. Thank you. That's cute. And during my research, I discovered several women who have advanced the field of artificial intelligence, including Alice Chocock and Karen Spark Jones. And today we're going to speak about Karen Spark Jones because I had to do a choice. Scientists and researchers in computer science, Karen Spark Jones' work focuses on natural language processing or NLP and information retrieval. So this is a good anecdote to say when you are with your friends in a party with your friends from programming and everything. She developed the, you know, to seem intelligent, smart. She developed the TF IDF. I don't know if people know that. Perfect. Yes, some of you. Okay, nice. Okay. So this is the term frequency, inverse document frequency. And if you may let me read this because this is impossible to read by heart because this is not my field. This is a weighted relevance measure that is still used today by most search engines. And it's an important tool for SEO. So if you are doing web, if you are web developers, it's kind of important to know it. And this is this woman who developed it. This method combined the physical presence of a word in a text with the weight of its importance in general. It does make it possible to define the relevance of a specific keyword in a text. So finally, this is kind of charge PD due to understand what you're saying when you are writing a prompt in big. I don't know. I'm bigger. Oh my God. What did I say? And then after she decided to work with Margaret mastermind and they wanted to do, to have a little challenge to challenge themselves. So she decided to program a computer to understand words with multiple meaning. And the result of that was a dictionary of synonyms. Karen published an article in 1964 that is considered as a fundamental document and the foundation in the field of natural language processing. I think that if you are interesting in that, if you are working, if you are coding in this field or just interested, I think this is really, could be really nice to read more about her and to let people know about her work. So her ideas was little appreciated at that time, but they are implemented today and continue to inspire. Okay, I'm going to say something now. Please don't stay here. Okay. People go out because I'm saying that is going to be a little bit. She also mentored a generation of researchers, both men and women, and she coined the slogan, computer is too important to be left to men. Thank you. Thank you to her. Nobody is living? Perfect. Okay. I also discovered something really interesting is that there are no sexism in hacking. Why? Because the philosophy of the hacker is that only the work of the hacker is judged in the hacker itself and not the hacker itself. So it means that we don't care about where you come from, your age, your gender, what you look like, or anything, or your orientation. It's hard to say this one. This is, you are only judged by your work. However, I had the luck to type, to do a research on Google in French and trying to search the top 10 female hackers of the world. So, yeah. The funny thing is that if they are French speakers here, it's written, Le dit plus belle accuse du monde qui te font chaud, which is a literal translation from another language. So it makes sense, but a half is not really making sense. So this is the 10 hottest female hacker in the world. So I watched the article and they were quite impressive for their work. Well, this is true they were impressive for their work, but sad to be to finish inside that. And what I wanted to say is that, yeah. So we see a will of progress, of progressments in, is it English? No, we see a will of doing better about all these ethics things. But, however, we see that in the society, the female hacker still is a fantasy, like this, or we have a lot of stereotypes of women female hackers. So the woman I would like to highlight here is Joanna Ruckowski, sorry for my pronunciation, a Polish computer scientist and security expert. She's best known for her research on low level security and still malware. This is a conclusion. So I can go hours and hours about women. To be honest with you, my first version of this conference, I think I had like 20 women. And they said to me, come down, okay, okay, okay. So today, many actions and associations are being set up to give a place and a voice to women in IT. And this conference is one of them. The reason I'm glad, no, this is not, I'm glad, but no, okay. I have some questions, like have you ever had a role model in your life? And this, sorry, I don't remember. And did this role model help you to dream and give you the motivation to project yourself and believe in your dreams? Yes, no, okay. Yes, okay. Did it allow you to say to yourself, I can do it? Well, role model, like I would like now to speak a little bit about my own experience. Sorry, this is my conference, okay, so you're here to hear me now. I would like to talk a little bit about my experience of discovering my own role model. So this is really weird to say like that. The role model have a lot of consequences and all of them are positive. Not only they can make us think that we can have that kind of dream, dream of reaching great heights, just as they do, but above all, we allow ourselves to think that we have the right to do so. It may sound weird and simplistic, you know. Often when I suggest to my friends, female friends, you know, because I'm passionate of what I'm doing and I don't have a lot of co-dra friends. I don't know, I have Twitch, okay, it's good. So I'm like, oh, do you want to learn a little bit? You know, HTML, CSS, it's really funny. You don't have to, you know, to do a trigger warning is going to be a lot of flash colors, okay, trigger warning. You know, the little rotations and colors, CSS animation, this is so funny. I love to do that. And this is really funny. Okay, it's going away, trigger warning is done. And they always said to me like, oh, no, no, I don't want to because I'm not good at math. So even if computer science have a basis of mathematics, depending on the field, it doesn't like require a lot of mathematics, depending on the field. And I love this sentence and you'd be surprised by how many of my buddies who were not brilliant at math at all have gone on to study computer science or engineering without ever asking themselves whether they're good or not at math. I love that, I love that. And this is kind of a sad situation, all right. So now we all agree and I think we all agree in the room here today that the fact that women and men are smart to do mathematics. I don't know. What did I write? Okay, stereotypes linked to women in mathematics no longer put people in agreement. And I think that we are all agreed today to say that. But the fact is that they persist unconsciously in society. A woman will often feel inferior to her male peers in math because of conditionings and stereotypes that persist. I know that this is not the case of everyone. So I had this case, I felt that until maybe I was 15 and then after I met people who let me learn math and say, okay, no, I'm good at math and I love it. So personally, when I discovered my role model, it was maybe two years ago and her name is Aureligeant. I don't know if you know her here in the room. Okay, so yeah, she's from France and she, I never know how to describe what she's doing, all right. She's a numerical physicist. I don't know how to explain. She's doing AI. She's a physicist. She's doing a lot of things and she's really impressive. She wrote a lot of books. She's like trying to help people to understand the AI. And I just fell in love with what she's done, her background, her career. When I read her book, I don't have the translation. If you want to read the book, you should really read her book, her first book. Where is the mic over there? Okay, and if you want to know a little bit more, like for the book, don't hesitate to come after and to ask me. I can show you the book. So like that you can see if you want to buy it or not. And discovering this woman let me think that, okay, no, even if I was already a programmer, you know, I was already working. I was already having, did my studies and everything. But it made me think that I can do more because I wanted to do more, but I was afraid. I was like, what do I have to say? What can I say? I'm like, I mean, I'm a woman. I'm afraid. It's sad, but I think that this is what I thought unconsciously before. And meeting this woman, like being in the highlights, being in front of people, writing books and being known, and give me the courage, give me the, it opened the door for me to go in to say, okay, I can do it too, and I have the right to do it. So the aim of this conference is to highlight women who have changed the course of IT history and who can inspire young girls today or women or all the people like. But I ask you to those who have patiently listened to these stories, when you get home to write down at least two names you discovered today and spread the word, the word, the word. To share the stories of these women with your daughters, with your students, with your friends, with your cousins, your niece, with the people in the street. I don't know, your bar mate, well, I don't know. And create them to show these women. These girls don't have to become, they don't have to become programmers, but you can open their horizon and show that being a girl, being a girl doesn't have to limit the choices and their dream. So please narrate and create and propagate. Thank you. It's literal translation of French, so if you have better translation, don't hesitate to tell me. So to finish my talk, my, why, I didn't, oh no, this is internet. Oh no, oh no internet. Go buddy. Okay, try again. So I know you have talk to see, I hope I'm gonna do it faster. Oh. Okay, we're gonna do it like that. So, nice to meet you, my name is Laura Durieux, a.k.a. Deaf Girl. So I'm a full stack web developer, WorldSkills Belgium Gold Medal in 2020 and 2021. I am a streamer on Twitch and we are doing code on Twitch, so don't hesitate to come and say hi. I'm also the show presenter of On est pas des Yankees on RTBS X-Pay, which is the national media of Belgium. So here you can take a picture and see, and come to see me on my social media. So the slide gonna be available for after. Thank you, if you have questions, don't hesitate. Thank you so much. Thank you, thank you.
Outreachy: 1000 interns
Hello, folks. Good morning, evening, afternoon, wherever you are. Welcome to the Outreachy Talk and Celebration of 1000 Interns. So before we start, I just want to see a show of hands has anyone participated as an Outreachy mentor, a coordinator, an intern before? Woohoo! Thank you for coming. And for folks who haven't heard about Outreachy before, Outreachy is an internship program that provides internships in free and open source and open science internships. And our internships are open to people who are subject to systemic bias, discrimination, and impacted by underrepresentation in the technology industry of their country. Outreachy is truly remote all around the world. We have mentors are remote, interns are remote, we have interns on all the different livable continents, not Antarctica yet, but maybe soon. And the interns are paid $7,000 total for the internship stipend. And that's a three-month internship. We run internships twice a year, May to August, and December to March. And as of our most current cohort, December 2023, we have had 1097 internships. And to celebrate that 1000 interns, we had a bunch of celebrations. Awesome. Okay, so we celebrated milestone in six countries. We had the celebration in Cameroon, in India, Nigeria, Kenya, and of course in USDSE. And this celebration is awesome because we had past interns. I mean, folks who have gone through this program, they were able to like organize, they led the celebration, and they made everybody to feel included across the celebration. Aside the six countries that we celebrated, we also had the celebration virtually. We had three sessions, and it was really awesome. I also want to talk a little bit about our accomplishments. Not only do we have 1000 interns, we have a 96% internship completion rate. And that's part of because our internships, we consider more of a fellowship. We want to make sure that the interns complete the internship. If they get sick, if they, you know, have family issues, we extend the internship. And so we want to make sure that this is more about them learning about free software and open science than trying to get a particular project done. And we not only have this great completion rate, we also retain people in free software as well. So 80% of past interns continue to contribute to free software, and 44% of those interns are employed to contribute to free software as part of their job. So we want to talk about a little bit about how did we get here? How did we get to 1000 outreach interns? As we talk about how did we get here, you're probably wondering who we are. So let us introduce ourselves. My name is Karen Sandler. I am a co-founder of Outreachy. I'm the executive director of Software Freedom Conservancy, which is the home of Outreachy. I'm from Brazil, came here from a trip of 11,000 kilometers. It took me a while to get here. I was a past intern when we came here, and I'm the current information and process architect of Outreachy. Awesome. And I'm Omotala Eunice Omotayo. I'm from the giants of Africa, Nigeria, and I'm the community manager at Outreachy. Hi, I'm Sage Sharp. I use Dave M pronouns and I have one of the Outreachy organizers from USA. So we're going to go back to Outreachy history. Oh, right. Before I can tell you, I'm going to just quickly introduce why I wanted to help co-found Outreachy. I have a heart condition. I literally have a big heart. I used to think it was very rare, but it's actually quite common. I'm at a high risk of suddenly dying, and so I have a pacemaker defibrillator implanted in my body. I can't see the source code in my own body, and I was shocked unnecessarily once while I was pregnant, actually more than once, while pregnant because my heart was doing what a normal pregnant woman's heart does, but my defibrillator thought I was in distress. The only way to stop it was to take drugs to slow my heart rate down. And this made me realize that our technology may not be made for us despite the best intentions, and what are we going to do when that happens? And so I became really passionate about software freedom and learning about, like, as I've lived with this heart condition and I've participated in the free and open source software communities, it's become very clear that our software can never be made for everyone, unless it's made by everyone, unless everybody has a chance to contribute. And so this is where I sort of entered the role as I found out about my heart condition and started speaking about it. I became the executive director of the GNOME Foundation, where I met a woman named Marina Zurahinskaya. So this is a picture of Marina, this is me, ages ago, presenting the award to Marina. So Marina was a GNOME shell developer, and she was very involved in the GNOME community. And when the GNOME board evaluated their applications to Google Summer of Code, they noticed that out of 181 applicants, none appeared to be women, and they realized that there was a problem. And so the GNOME board eventually brought Marina in and said, what should we do about this? And Marina wanted to start a program to help address this issue. And so she looked back, and in 2006, the GNOME board had decided to do a summer outreach program, which they did a few internships, and it was a one-off thing. It was successful, the interns finished their internships, but none of those interns continued with the GNOME project, and it was just kind of left behind. And so Marina decided to reinvigorate that program. She is not on stage now, you're probably wondering. She's not on stage because she died of breast cancer last year, which is really tragic, but she leaves this amazing legacy that she created of outreach, and I'm so excited to be able to tell her story to you. And so in the 2009 guatech, there were so few women attendees that the GNOME board and Marina decided that this was the moment that we were going to pick this up and we were going to create this internship program. Raise your hand if you were at that desktop summit in 2009. Nobody! That's great! I'm so excited to tell you about it. No, it was a really interesting experience, and so the GNOME board went back with Marina and we decided to launch a new internship program, and Marina very thoughtfully tried to say, what are all of the ways that women are not participating in free and open-source software? Why don't they get started? And she systematically tried to address those issues, connecting interns with mentors and helping them make their first contribution. And so in 2010, the first outreach round, so this is the beginning of what we considered to be outreachy, and for a while we did the first round, the second round, and then we started using the months and years, because saying that you were part of the 13th round or the 15th round didn't make a lot of sense. So we started with that. If we could just go back to that previous slide. So if you notice, this program at the time was for women, and so you see we have this logo of this karate lady sticking her foot out, kicking forward. I love this picture, but it's very much of how the program started, very, very gendered. It was open to anyone who identified as a woman, and the program had interns, and it was a really amazing cohort for the next one. So in 2010, we had eight interns, and then you can see all these pictures that were of the interns at the different guatechs in the coming years. And so a community was starting to be formed, and one of the things that Marina did was she created meetups so that people could meet each other before a conference so that you could walk in there and know that you would have the confidence of knowing you had met someone before you entered it. So as RIT progressed, the internships again continued to be all with GNOME, and I was executive director of the GNOME Foundation, and the internships were so successful. The interns that came through the program were core contributors to GNOME. We had the GNOME planet, and so the interns would be blogging on the planet, and we would see their avatars, and people would come to Guantanamo and they would become so connected, and we realized that this was a program that really needed to expand beyond the GNOME project. And so I started talking with my friend Bradley Kuhn, who was the executive director of Software Freedom Conservancy. Now he works with me at Software Freedom Conservancy still, and Marina connected to Jessica McHeller of the Twisted Project, and Twisted was a Software Freedom Conservancy member project, and so we decided to do experiment and see if we could expand the internship beyond GNOME, and so we did, and it was hugely successful, and so we went from there and offered it to connect it to a lot of other member projects. So now today we tend to have 35 to 40 different free software communities and open science communities participating in each cohort. Yeah, we used to have a slide where we put all of the communities on it, but it just became too difficult to read. Yeah, so as Karen mentioned, originally in 2010, our criteria for who could participate in the internships was anyone who identified as a woman, and then in 2013 we decided to expand that to make it more trans and queer inclusive, and we said the internships are open to women, both cis and trans, trans men, and gender queer people as well. I think in 2014 or around around that time, we also started expanding tech companies published a lot of their data about their employees, and so we realized that in the United States we were able to expand to people of color who were underrepresented in the US tech industry, and I launched this effort to kind of try to expand outreach to country by country. I was talking to lawyers in France and lawyers in Australia, and we were starting to like figure out a way to expand place by place, and it was a lot of work and very difficult, and you know free software is global, and outreach participants were always global, the mentors and the interns, and it really didn't make a lot of sense to try to do that. Yeah, so instead of country by country, the internship criteria we have now is anyone who faces underrepresentation, systemic bias, or discrimination in the tech industry of their country. Now, how do we determine that? We've come up with a series of essay questions that we ask applicants, which is, you know, tell us which country are you going to live in during the internship? How are you underrepresented in that country? How has your learning environment been? Did you see people, you know, the last slide, the last talk, talked about role models. Did you see few role models who looked like you, who represented your identity and background, and then we talked about, you know, what systemic bias or discrimination have you faced both in while you're building your skills and if you were to apply for a job in the industry of your country, and so these essays over time we found ways to evaluate them in a global scale while still being, having, allowing people to talk about their experiences at a local level. I love this because we don't decide what it means to be discriminated. We don't decide what counts as discrimination. We don't, like, have a list of anyone who is subject to systemic bias. We don't have classes of people. We let people tell us about their own experiences and because we don't presume to understand every single experience of systemic bias, discrimination, and underrepresentation. So then we get into sort of middle history. Well, can I do one more ancient history? Because it's so exciting here at Bosnium. I was on this very stage in 2014 when I announced that Outreachy was coming to, well, it was rebranding, Outreach Program for Women was rebranding to Outreachy because it was no longer just for women, and we also announced that it was coming to Software Freedom Conservancy. The project outgrew the Gnome Foundation. You know, there were still only a handful of Gnome interns and the rest of the internships were with the Linux kernel and Wikimedia and Mozilla and a ton of other communities. And so the Gnome board and Software Freedom Conservancy and the Outreachy team all got together and we moved the program over to Software Freedom Conservancy where it remains today. So I got involved as part of Outreachy and I think it was was it 2014 or 2015? One of the two. I think 2014. I think 2014. Yeah. As the Linux kernel coordinator. So I originally helped find mentors in the Linux kernel. I connected them to Outreachy, got them prepared to help applicants during the contribution period. And then in 2016, I stepped up to become part of the actual Outreachy organizer team and passed off the Linux kernel coordinator position to someone else. So in 2015, we have opened up our program and said, hey, let's write these essays about the discrimination and bias that we face. We started having issues with reviewing those because we started to get thousands and thousands of initial applications and also a lot more communities involved too. So in 2017, I sat down with my spouse, Jamie, and he helped me understand a little bit of Django and we built Outreachy, a Django based website where mentors could sign up, where applicants could sign up and it really fits the the customness and fit what our our program was. And so big shout out to Django and Python and that wonderful community. And I want to say, like, this is a reflection of, you know, I talked a little bit about Marina and how she founded the program. One of the things that is the most impressive part of her legacy is that she built up this program, but then Sage came on board and she worked with them and she was able to transfer that knowledge and create a program that was robust and that could could exist without her. And so we're here on stage with this project that Marina really started with her personal passion, but that she thought about how it would continue without her. And so Sage coming on was this absolutely essential and then bringing all of this maturing the program. Yes, and I would say my role has been how do we scale. This is how do we scale. And so the next part was we really need to it to just be more than me and Karen at this point. And so we brought on Anna. My story about Alt Ritchie starts in early 2017. I heard about Alt Ritchie from an Alt Ritchie intern working with Mozilla, she gave a talk, a lighting talk at a women technology conference in my city. And at the time I had the so crushing realization as these mechanical engineering major that as a partially sided person, I wouldn't be able to find a job in my state and when country, I had too many obstacles to face and to overcome. So I applied to the December 2017 cohort was accepted in my first try. And I had a really good experience in my internship. I had mentors who believed in me. And if you're seeing these, Beno and Johan, thank you. And the community was happy to have me as a member. It was a really transformative experience as one who faced ableism all my life. I had people who believed in me in my potential and didn't question whether I was capable or not of doing my job. And when you were switching careers through a program like this, you will experience something that's called a liminal moment. You are not the person who you were before it started and you are still not the person you were about to become. You are in between states. It's disorienting and scary. And you have to find yourself again at the end of the program. And that can be a really difficult task. Interestingly, when I joined Outreachy, Outreachy itself was facing a liminal moment. Things were changing. And we asked ourselves, what is Outreachy exactly? I remember when we created a Zulip server and we started connecting with interns by running bi-weekly intern chats about career in free and open source software conferences, et cetera. Interns were no longer experiencing their internship in isolation. And they were connecting to each other without depending on proprietary software or proprietary social media. That was when something clicked. What was once something more of a liminal online space where people would just go through with an adjacent community, it became a communal space. And with a communal space comes coexistence, the need of permanence and a sense of belonging. With a thriving community comes management's challenges that were beyond our capacity. At that time, we were just too few. And we published a call for a community manager. And I will say that before we posted the call for community manager, we tried to scale by improving our documentation. We said, okay, if we can't answer everyone's questions, if we can't answer all the applicants' questions, especially with so many, could we scale our documentation? And that worked for a while. But eventually, we said, no, we really need an actual person that can help us. Present day, yeah. So we can... So we're going back. What would you like to do? We can continue. All right. So present day, one of the things too is as we expanded, we really need to make sure that we could find additional funding. Right. So I want to... I do want to start by saying, Outreach was originally funded by corporate sponsorship, which was great. I definitely want to give a shout out to Google, which is the company that sponsored the first... Like all the first rounds and every round since then is the only company that has sponsored every single round of Outreach. Plus, they gave us a lot of help. The program is modeled in part after Google Summer of Code. And the Google staff has always been very supportive and helpful and has given us the information and assistance throughout. And I really also want to give a huge shout out to Red Hat because Marina worked at Red Hat and Red Hat contributed her time. It's safe to say that there would be no Outreach-y without Red Hat's contributions early and then continuing in those years after. But nonetheless, the program is not... We deeply appreciate our corporate sponsorships, but it is very tough on the program to have to continue to get corporate sponsorships and then to be responsive to the interests that a lot of companies have and want to put on internships that they're funding. And so in this period, we shifted a lot more to grant funding to supplement the corporate sponsorship. And that was really transformative to the program because we were able to plan a little bit more long term and Ford Foundation, ARDC and Chan Zuckerberg Initiative were the foundations that came in. I would like to say if any of you work at a company that want to sponsor Outreach-y, definitely get in touch. We really can use the support. We also have some individual funding support. And having that mix of funding is really important to be able to have the internships that we want to have. And honestly, being able to say no without having to think twice to a company that wants us to have an internship that's too tightly tied to one company, we're not going to do it. Having an internship that is not going to be a good experience for an intern, we're not going to do it. And having all that... Having this independent funding, we would have said no before, but it's even easy. It's very possible and easy to do it. And one of the interesting things that comes from grant funding is that we can decide, hey, there might be some initiatives that really need our support. And so one of the things that we did in 2020 was we started funding humanitarian free software. And so this is things like public lab that did citizen science and... Mobile lab. Mobile lab as well, which is a open science hacker space and biomedical research. All peer. Yes. All kinds of interesting things. And so these are projects that don't necessarily have enough funding on their own to support an intern. But because we were able to get grant funding, we could offer both funding for humanitarian open science... open source, and then eventually we moved to funding open science as well. So again, citizen science, scientific research, we had outreach projects that were actually looking at COVID at trying to estimate what was the hospital capacity with COVID. And so it was really a proud moment to be able to fund that kind of research. And then in 2022, we had our lovely community manager come in. Okay. So a little bit about where I was coming from. I have past experience working with marginalized population, supporting them, especially when it comes to their rights, when it comes to them receiving the rights supports that they need. And I also have past experience empowering people into tech through Sheikot Africa, coming up with programs, supporting them and standing in gap as an intermediary between them and also the organization. Then coming into R3C as a community manager, I now stand as an intermediary between the R3C applicants and the R3C community and the program itself. So I was veered the R3C social media platforms, supporting and also responding to R3C applicants, putting out contents that made the applicants, people who were interested in what R3C is doing, understand what R3C stands for. And also I was able to come up with coffee charts. So via R3C platforms as well, we were able to have real conversation, real life conversation, helping R3C applicants to understand the R3C program better and also bringing past interns, mentors, community coordinators to answer questions that the applicants have and also to share their experience through the R3C program. And I've also been able to create more awareness about the R3C program through attending and speaking at various conferences. This has really been awesome. Especially at different conferences, I was able to empower people, tell them about the R3C program and that has created a very good awareness about the program. And also this, I would say, has created a very good and resounding application. We have a big growth about R3C applicants from especially the African perspective, right? People coming not just to participate and also to give back to R3C. As you can see, we have zero interns from the African perspective at 2010. And as at the December 23 course, we have over 44 African interns. So which means, so this way, folks from the African perspective now understand better that there's a space for them in the open source ecosystem. They are coming into this program to contribute and to improve open source and open science projects and also to give back to the R3C and the open source ecosystem in general. I want to say that before we had a solid program, an amazing program, but you gave it a voice, you gave it faces, the recognition it deserves. Thank you. And I'm grateful for that. Thank you so much. I would also like to add that since I joined the R3C program, folks have been, especially the applicants, they now understand the different parts of open source that they can contribute to, especially the fact that it's not just about the code part. They don't have to come into open source to maybe be a programmer. They can come into it to contribute and to give back in various perspective documentation, even event planning, community management, and so on and so forth. And also that because to the new R3C organizer. Yeah, we talked about a sense of belonging that comes with finding a community. Another thing that comes up often is this desire to give back. You offer a great opportunity. You want others to have access to similar opportunities that has happened to me. This is why I joined the outreach team back in 2018. We found that many interns come back as mentors, some as new mentors, some as experienced mentors. Either way, challenging situations require extensive support. And we decided that we needed someone dedicated to supporting and advocating for our mentors. Yes. After Omotolo's outstanding year of supporting applicants and interns, we hired Hilda Udufu. She is someone who has extensive experience with the program. She was an intern. She was a mentor. She was a coordinator all of it for public lab. And I'm proud to say that in turn I've become her mentor when she joined team. She's been facilitating conversations with mentors in office hours, having interviews with them so we can highlight their work, working hard and facilitating relationships between mentors and interns. And I think all of it is an indication of a phase of maturity within the program. We are not only looking for always growing. We are looking for growing sustainably and keeping our community flourishing. And I also love to add that Sage and Karen has mentioned how Astrochi has grown from, I mean the background of Astrochi and the growth so far. And with this we can also point out how Astrochi has grown in the aspect of not just why should we have Astrochi, but now to better support the applicants I come in as a community manager, right? And also we have Tilda. So Astrochi is not just supporting applicants, we are also supporting mentors. So because we understand that the program is not just about interns coming to contribute to open source, the program is also about people staying back in open source and also working together to give back to open source ecosystem. Yes, this is about open source sustainability as a whole, like the ability of us continually to exist as a community, supporting contributions and making sure that software still exists and still maintained. And I would say that you know you can look at the numbers of the people who find jobs that are contributing to free and open source software and the number of people who continue to contribute, but no matter where our interns go after that they always take the values of software freedom with them and they're exposed to software freedom and they take those values and they there's a follow-on effect from these internships. And I would say our interns have won awards, they've joined boards of directors, they've been mentors, grand mentors, great grand mentors and we see graduates of Outreachy everywhere. All right, so then the question becomes what's next for Outreachy? What is the future Outreachy? And the future of Outreachy maybe it's you. Maybe you would like to mentor, maybe you would like to coordinate. If you'd like to know more about Outreachy you can come and ask questions, but also there's a bough in AW121 at 1300 or 1pm and if you'd like to come talk with us, figure out how to get involved, we would love to hear you, we would love to hear what you're doing in free software and come connect with us. If you're interested in being signing up as a free software community, the deadline to sign up as a community is February 15th, so please do check out our call for mentors and communities. This is a celebration, you know, we're celebrating the fact that we got to this point and we can only do it with you really. We are actually gated generally by the number of mentors that we can find, so we we shield for funding already, but realistically most of the time it's finding enough mentors to provide those internships and so you know really that's that's all of you who are here, you're you're you know enough to be here. Actually raise your hand if you're here and it's your first FOSDEM. Wow so it's like it's like a third of the room, that's great. So yeah you know I think one of the things that I'm most proud about Outreachy is that it's a real grassroots program, like it's something that we started by offering something really pragmatic, like just offer internships, have that work pair interns with mentors and have them learn and then we've just been growing it slowly. I remember when when we started back in the day and I was a new executive director there were a lot of diversity initiatives coming up at the time, it was like very fashionable to start diversity initiatives and there were like programs getting millions of dollars based on glossy work you know glossy brochures that they had made having not done anything in the past, but we found it Outreachy with the different mentality with the with the bottom up free and open source software mentality of we're going to do the work and then if people find it valuable then the resources, the time and the money will come after and so we can't you know it's Outreachy is our thousands of volunteers and I'm proud that it is itself a free software project. And also we want to like tell folks that are listening to us that you can support Outreachy in several ways. You can go back to your local communities to tell the story of Outreachy to become an advocate of Outreachy. Tell folks who can be part of Outreachy as an intent to apply to the Outreachy program. You can also contribute to Outreachy right through your various communities, your various projects by bringing your projects as I mean your community, you can be a mentor, you can come in as a community coordinator right and you can also support Outreachy by going back to create more awareness about the Outreachy program. So tell folks about Outreachy, you tell your communities about Outreachy, bring your projects to Outreachy and you can also partner with Outreachy in various ways. So you can reach out to us if you want and connect with us right. We have you can connect with us later today to ask us questions, discuss ways, several ways that you can contribute to Outreachy. Additionally, we may not have the capacity to work as a mentor, but you may have the capacity of reviewing pull requests, of reviewing contributions made by the applicants. Communities need it so much, they get so overwhelmed with our applicants and they will be great help if we could help them. Yeah, even if you have experience with any particular community that's involved with Outreachy going to help out and answer questions in the community chat, that is a great way to help those communities. Questions? No, we're going to the thank yous because there's a lot here. Yeah, so I don't know if we... I don't think we want to read all the thank yous. No, we're not going to. I want to highlight a few people though. I definitely want to... We've already talked about the organizers and the reviewers and all of our volunteers. I definitely need to... We always joke that Outreachy is like a python swallowing a goat. There is so much logistical work to be done to manage Outreachy. It is huge and so we want to thank the Conservancy Accounting and Logistical and Financial staff including Bradley and Roseanne. And also... They're amazing. And I also really want to thank Roseanna who is on the... It says Gnome Board. Oh, and the Gnome Board. Right, Roseanna who did that logistical work at Gnome and helped launch the program. We want to thank the Gnome Board because there were times when running a program like this is difficult. It's a lot. Yeah, and we've had our times where there's been misinformation and people attacking the work that we do beyond calling us names and threatening us. And it's been really stressful. And the Gnome Board spent a lot of time making sure that they were defending the program and supporting it. And then I also want to applaud them for realizing that it had outgrown the Gnome community and that it made sense to move it to another organization. The Outreachy leadership in the past, Cindy Polaris and also Tony Sebro who is now on SFC's board and was our general council and is still involved with Outreachy. Justin Colonino who has given us pro bono legal help actually from Outreachy's inception has been supporting us with legal work. Ropes and Gray who gives us pro bono legal work, Otter Tech and also Owen Taylor and Jamie Sharp. I did read most of them, I'm sorry. But they deserve it. All right, so. So we can take some questions if anyone has any questions. All the microphones so we have to share one here. It is so hard to hear in this room, so you have to speak really loud. Okay, first of all, a huge big thank you. It's hard to overstate the value of what you do. And because it is so valuable, my question is, so in the end you kind of dodged the topic a bit about the future. So my question would be, since it's so valuable, how can you transcend from an organization that depends like so many others on the efforts of some individuals for survival into something that is actually hard to stop, that has a life on its own that you couldn't shut down if you wanted to. Is that for me? Yeah, that's you. I mean, they're pointing at me to answer that question, which you know, I'm executive director, so I have to like be the visionary of this program, you know, like, and give that voice. But I do have an answer after you. But Sage will have the answer. No, I mean, I think that the whole point of it being a grassroots, like free and open source software project is that we grow sustainably, we grow slowly, and we grow carefully. We bring stability. We've been working for the last five years on redundancy, so making sure that we have a team that isn't going to completely burn out. It's so much work to do this program. I don't know how Marina did all of those logistics. She basically did them herself for a really long time, and she maintained all of these wiki pages where she wrote down the names of, she just stayed in touch with every outreach intern and like wrote down where they went to work because she ran into them in a, you know, in the hallway. So like what we've done is with Anna's help and Sage's help, and now what Matola's told us is to make that a lot more systematic. So we've got a robustness so that if any one of us is no longer part of the program, it has a life of its own. I think too, to bring in some of the values of free software, what we have done in Outreachy is we have talked to different communities and learned what are the best practices for being inclusive, for onboarding new members, for designing projects for interns, and we've documented that. So if you look at the Outreachy documentation for mentors and communities, there's a lot there of knowledge that we have learned that was siloed across different communities. And so even if Outreachy goes, I think we still have impact on those communities. Our documentation, our knowledge sharing, the lessons we've learned will move on. And so I think in the future, we'd like to be a little more vocal about why we design the program specific ways, how we be more inclusive and coach our communities on that. And I think too that the grant funding companies that want to fund the Outreachy General Fund rather than specific internships is going to be the way forward. So we keep pushing our sponsors towards that and hoping that they'll allow us to make sure that our team continues and also that we can decide which communities have the strongest interns and allocate funding that way. We have a dream of having an open mentorship alliance with other mentorship programs. We know we are not the only ones and there are many, many more that do things differently, but they are as important and as fundamental to the open source system. I would also say that like historically, we have improved something about Outreachy every single round. Like the idea is that the whole point of like free and open source software is that no one and nothing is perfect, right? And so we've been changing something every round. If you have feedback for us about things that could be made better, we would love to hear it because we're looking at it ourselves and so we expect to change and improve. I was also going to add to that, but it's also really nice to see all four of you on stage and also the diversity of the organizer group as well as really I think a special part of this, but my question is kind of actually building on what you Anna said about the mentoring side. So I'm definitely seeing a challenge in a lot of open source communities and projects around that mentoring side. General on how do you do mentoring and how do you scale mentoring in a community? So my question is like from your perspective and doing all of this kind of working so closely with projects and working with mentors, what are the greatest challenges that you see around mentoring and mentorship and open source? And do you have all the answers? No, do you have any ideas or tips about what you think the open source movement needs today to grow and scale mentoring? I can think of some like cultural differences, the way you talk to someone in Brazil may be different from the way someone talks to another in the United States or in Nigeria. So conciliating those differences when you are doing asynchronous communication, for example, you can create a lot of conflict with some. Another one is safeguarding. This is something that some mentors have told me, especially when we work with more marginalized communities. It's difficult to ensure, it can be challenging to ensure that everyone is in a safe environment. We had some folks that had some really challenging lives at home and needed safeguarding. And that can be a difficult situation, both for the mentor and the intern. So having the psychological support for both of them is important and also challenging. And I think as well for mentors, I think having a pathway to mentorship is important. I think a lot of people assume, hey, I have to be a maintainer for five years to be a mentor. And so finding ways to define a path towards mentorship that doesn't feel like you have to be an absolute expert. So one of the things we've been trying to do in our outreach chats is talk about what does mentoring mean? How do you get to be a mentor and emphasize you don't have to know everything? And so with Outreachy, what we've done is we've encouraged people to co-mentor. So you've got someone who is more experienced in the project and someone who has just been in intern, either Outreachy, Google Summer of Code, and they shadow the mentor. They're starting helping out. We're training mentors. And so figuring out how to create those pathways is going to help. It's difficult to aspire to be something if you don't even know that is a possibility to be that something. And also to add to what Sej just mentioned, the Outreachy organizers, especially the mentor support, to that we have been able to come up with different initiatives to support mentors from the contribution stage. We understand it's not easy from the contribution stage to getting feedback at every point during the internship to our mentors office accession. Because we want to understand the various challenges that the mentors face, we want to be able to support them. Sometimes we also want mentors to come together through our coffee chats with mentors. And even of his accession, we want mentors to be able to discuss with one another from different communities to state the challenges they are facing, to learn from another mentor from another community, how they've been able to address these challenges. That way, happy mentors learn from one another. I would say also you commented on our diversity as organizers, but one of the strengths of the program is recognizing that the burden of bringing diversity to free and open source software shouldn't rely on the people who are underrepresented. And so, and so, mentorship is a great way to be an ally, right? It's a great way to shift that burden. And so, like, you know, I think that's when one of the strengths of the program over time is that it's a great way to get folks that are not subject to systemic bias who are not underrepresented to help bring people up. I think, do we, do we have time for one more question? Or are we? We're done. We're out of time. Thank you so much for joining us. Thanks for supporting our reach. Have a great bottom. Thank you.
How to Chart your own Career Path in Open Source - Panel Discussion
Okay. Can people hear me okay? We're good. All right. Thanks. So I guess we can officially say good afternoon. Thanks for coming. My name is Ray Paik. I'll let the panelists introduce himself in a few minutes. It's a little weird because we have to be here for the camera, but we'll make it work. So I'm a community manager at Pincap. If you're not familiar with Pincap, we're a company behind the open source database, TIDB. And if you're part of CNCF, you may have heard about a couple of other projects that we donated to the foundation. First one is ChaosMesh, and the other one is TikeAB, which is our key value database. I've been at Pincap since April of last year, and I started my career in open source community management about 10 years ago when I joined the Linux Foundation. I was there for about four years. Then I had community manager roles at GitLab and CubeDev before I ended up at Pincap. I don't know how you all felt about 2023. 2023 felt somewhat difficult, especially on the job front for people in open source. I myself was laid off, but I was fortunate. I was wrapping up my interview process at Pincap, so I think I accepted my offer a week or two after I was given a notice. So I was fine, but then you had this constant drumbeat of negative music. It seemed like companies that we thought were at the forefront of open source were making significant cuts to the community teams. Open source program office is just completely being obliterated. I couldn't tell for a long time last year that whether this is just another boom and bust cycle in the high tech industry or there's something more fundamental going on. I did a lot of thinking about open source careers, and then so I decided to propose this panel. Glad to have a wonderful panel this year. I'll let the audience introduce themselves. I'll look at you want to get you want to start? Yes, I hope you can hear me too. So thanks. My name is Ildeco Vanja. I work for the Open Infrastructure Foundation as Director of Community. The Open Infra Foundation is an open source foundation that hosts and helps support open source software development communities into software infrastructure space like OpenStack, Cata Containers, Starling Access, all our examples of the projects that we have. I joined the foundation seven and a half years ago, so I'm already on the record in terms of longest employment of my life. So you can tell that I like working working here. Before the foundation, I used to work for Ericsson, which is a large telecom vendor company, so very different environment. However, that's where I got in touch with open source. I started to contribute to the OpenStack project, and my first experience was so wonderful that I just couldn't stop afterwards. I became a really big open source advocate. And open source became a fundamental part of my life to the level that now my full time job is is all about open source and working with communities and the ecosystem and anyone who would like to get involved or maybe they don't like yet that they would like to get involved, but that's where I come in and convince them that that's the best idea that they will have in their lives. So yeah, that's me in a nutshell. Okay, I had technical difficulties. So I'm Don Foster. I am the Director of Data Science and Governing Board Member for the Chaos Project. I'm also on the board of an organization called Open UK. I live just outside of London. And I'm also a co chair for the CNCF contributor strategy technical advisory group. So I tend to wear a few a few different hats. I got my start, well, I came out of university with a computer science degree in the mid 90s, and I somehow managed to luck my way into a Unix system administration job. So my very first job out of university. And back then I worked for a manufacturing company and manufacturing companies do not like to spend money on software. So I used a lot of open source software just in the nature of being a system administrator. And then fast forward a couple of years, I was at Intel in around 2000 2001. And they needed someone to look at which open source projects were going to be strategic for for them over the next, you know, number of years. So which ones should we be engaged in which ones should we be working with and I was working mostly at the time in the kind of the Linux developer tools space, I think like compilers and IDEs. So so that was sort of my first first role that was more focused on open source. And then over the years, I managed to somehow turn that into a full time thing where I was community manager and a few different companies. I've done I've done lots of different things in my open source career over the years. Most recently, before chaos, I was at VMware and I was their director of open source community strategy. So I've done little little bits of things in open source over the years. Alison Randall. So I also started my career in open source in the 90s or through software since we didn't have the name yet then. I was working at a startup, as an online bookseller that just happened to use Pearl as their development language. I'd used Pearl a little bit before for linguistic research, but that was when I really got into it. And within a year, I was teaching Pearl at the local Linux user group. And then I got sucked into Pearl design work by the development team. And then I got sucked into being project manager and the president of the foundation. And it kind of went from there. So I've been involved in a lot of different projects, but you'll know some of them. Debbie and Ubuntu open stack. I'm currently chair of the board of software freedom conservancy on the board of open infrastructure foundation and also on the board of open usage comments. Cool. So yeah, I mean, I was really excited about the panel because we bring a lot of different backgrounds and different way you've got introduced the open source. So I guess I'll ask this question to you out, Alison is like, so 20 plus years, like what motivates you to keep saying in open source? Like what are the things that you enjoy the most about the open source communities? I mean, for me, hindsight is clear. It really has always come down to the people and the things that we built together. And that's partly the software that we build together. We've built some really amazing tech. But it's also the communities we've built and the styles of collaboration we've invented. And, you know, the legal structures that supported those very different ways of working together. And that's what really stands the test of time. You know, you can get distracted by the politics and all of that. That's not really what matters. What matters is the people and what you're building. Anything else you want to add or? Yeah, plus plus one to the people. It's been it's been an amazing career, right? Like I I've met people and I know people all over the world and I can I can go almost anywhere and find someone that I've worked with on a project somewhere to sit down and have a coffee with no matter no matter where I am. And so it's yeah, I've just met so many amazing, wonderful people. I can I can also plus one that that notion. And the other thing is that when it comes to open source, I mean, the majority of the people are there because they are interested in the project and the technology they had they share the goals. They they work on something that's in their common interest. So you find people who are enthusiastic about what they do. And it is a great environment to be in and to be part of. And like knowing people all around the globe, you learn a lot about cultures. And you just have access to so much knowledge that we share with each other on a daily basis. And you get so many different points of view that it's just it's very hard to match in any corporate environment in my experience. So the flip side of that question is, because we talked about all the positives and what we enjoy the most. Are there examples of time when you wonder to yourself, like, what am I doing with my life? And maybe this isn't for me. Like, I mean, maybe it doesn't have to be that dramatic. But anything you want to share? I mean, I do like what I what I do today. And that's that's why I keep doing it. There's there's ups and downs, no matter what you do. When it comes to open source, like back in my, let's say, corporate days. I think that it would have been better if I spent a little bit more time understanding corporate politics and navigating how open source can fit into a product development environment, and figuring out how to work with with our managers to also help them understand. Because there there are a lot of examples where where you're a developer, you're working on the code, you know what you're doing, you know why you're doing it, you're enthusiastic about it. But there are so many other people in the company who are trying to make sure that there's a product schedule that the customer is happy that the company makes revenue because otherwise, we are all in big trouble. So there are a lot of moving pieces. And you who are actively participating in an open source community to you, it's crystal crystal crystal crystal clear. There you go. What's happening. But but someone like a manager who's a program manager trying to again make sure that the product is on track. They don't know they don't have the experience. They just see something from the outside. So helping them understand how these communities work, what you need to do to be effective in that community and also be effective in the company where you're working. That that can be an interesting balance and an interesting challenge. And when I was very new to it, I think I stumbled on a few mistakes that I would do differently today. Cool. Anything else you want to add or we can move on. But I mean, you mentioned balance. And I think one of the, like challenges that you hear a lot of different events like including your FOSM is that people talk a lot about work life balance or trying to maintain balance in general. And then maybe Don, I'll ask this question to you because you work. I mean, before you joined, you came on board a chaos full time. You're at VMware. So I mean, you're actively involved in Kubernetes and other communities, but that's not 100% of your job. You have responsibility as a VMware employee. And how difficult is that balancing at like trying to be a good open source citizen, but also trying to be a good employee. Yeah, that can be Okay, so so that can be a real, a real challenge. I, you know, I think on the one hand, I was I was just super lucky, right? Because my managers at VMware were really supportive of the work that I did. At the time I was contributing to Kubernetes and to the chaos project and a few other things. And so they were, they were very supportive of me spending that that time. But I also take the approach where I, and I didn't always do this, I've burnt out a couple of times in tech, like, like many people have, where I tried to do all the things. And now I'm super protective of my personal time. And I, you know, I kind of work as a set number of hours. And then when I'm done, I'm done. And the only way I can do that is by being really brutal about prioritization and just saying no to the things that aren't that important so that I can focus on the things that are, whether they're the things I'm working on in open source communities, or the parts of my, you know, at the time, real, real day job. And then I'm, I'm sort of lucky now that I, I do, I will admit right now I have, I have my dream job. So the data piece, the open source metrics with chaos has always been my passion project. So being able to do that full time is, is been pretty great. I would like to applaud Don for being able to do the prioritization that that you made the decision and you're sticking to it because I suck. I am also in my dream job. But to me, that did not help with spending not too much time on it. And I think when it comes to open source and also what we are doing right now, especially after COVID, like so many of us are working from home. And to me, just the working from home setup, whether it's open source or not open source work, I like to be enthusiastic about whatever I do. So that setup, to me, makes it already really hard to find a balance because like the left corner of the table is the work corner and the right is where I have my personal time when I eat lunch. Like that, that just doesn't really work well for me. And I think that Don also mentioned burnout, that that is something that probably most of us who are enthusiastic to an extreme level will experience at least once. So I share all the challenges and I can, I can only recommend that once you experience burnout once, you do have a choice from that point because you do have the full end to end experience. You know what the signs are that lead to burnout. So you do have the choice when you are seeing the signs next time to stop to know that, okay, I'm not going forward like this anymore, because I know where it leads. So you do have the tools with the experience that you're gaining, even if you don't find the right balance right at the beginning. Cool. There's just so many interesting things to do in open source. It's hard to choose one or two. No, I'm like Don said, I'm all for setting boundaries. I mean, I work for work with a lot of colleagues in China, and I'm in Pacific time zone. And between five and seven, it's really difficult to say no to a quick call. But I think most of my colleagues now like between six to 8pm, that's family time, I need to have dinner with them. And they understand like if you have the right corporate culture that works, but it's really hard to do some time. So, so go ahead. Just a quick note, like I just started to work with a new community. And they they are very active in Europe and Asia Pacific. I also am in the US on West Coast time. And I work with two communities in total and the other communities very North America centric with a few people in Asia Pacific. So I have all the three major time zone regions to cover. So I'm currently in the process of trying to find a new balance because I can work 24 seven so easily because there's always someone awake. Who's very active in the community that I'm working with, who I could talk to, why could solve a challenge for and it can be very hard when it comes to the time zone challenges, especially if you're like really working with global communities. So when when I first like opened our talk, I mean, we talked about like the job market like last year. But for people that are looking for jobs in open source, like I mean, are there any advice like any of you like to share like in terms of, you know, first of all, finding the interesting openings that you might you might want to pursue interviewing tips, etc, etc, finding the right culture. Okay, I can start. I would say that my my biggest piece of advice when you're looking for work is is to use your use your network. So I think in my entire career, I have only I've only ever had one job that I got from applying through the traditional channels. Every other job I've ever gotten has been because of someone I knew. And in a lot of cases, these were people that I knew through my work in open source through these open source communities. So when you're when you're looking for work, just, you know, spend some time talking to some of the people that work in the communities that you're interested in, and who work at companies that you might want to work at or organizations that you might want to work at and talk to them, you know, ask them what it's like at that company and see if it might be a good fit for you, ask them what kind of job openings they have, and just just talk to people and get other people's suggestions. Because once you talk to enough people, they will generally know of other people that you can talk to that maybe you weren't already connected to. So so don't be shy about talking to the people that you know and asking them asking them their advice and what it's like where they're where they're working. Yeah, I think that when it comes to open source, like, you're operating in a public environment, whatever that you do is public. So you can also point to things that you've done. It's it's much easier to build a resume as well if you're active in open source. So it's the connections and also the work that you you've already done. And another thing that kind of connects back to sort of early mistakes. That for I think that was the first question or along those lines. Like, building connections really is truly important. Like when you're attending an event, you can you can prioritize to listen to talks. But I would challenge you and say that if you're not interested in talking to the speaker after the talk or talking to people in the room who are interested in the same topic, then is that really the best session you could choose in that particular time slot because you can always have access to the content later. Many conferences are recording presentations and even if they don't, the the information is out there floating on the internet one way or the other. But the person doesn't. And the in person connection is invaluable. Like I have a lot of experience, you know, jumping in new communities and you do that on the online channels first. But whenever you get the opportunity to actually talk to a few people in person, the online interaction just becomes so completely different, way more efficient and usually a much more pleasant experience. And then those connections are also could be the ones that are lending you a new job because those people know you, they trust you and they can give a recommendation at the company where they work that, hey, there's this person, we've been working together in this community and they are so amazing and they're looking for a job or maybe they are not looking for a job, but we should get them anyway. So that's that's a great way to go. I would I would add keep in mind that there's not just one way to have a job in open source, you know, pretty much any job these days that's related to software is going to be related to open source. So in my career, I've often switched between like doing all my volunteer, all my open source development as a volunteer and doing paid work that's like running an open source conference or managing an open source foundation. And then also I've done it the other way around where my paid work was open source development and then I was as a volunteer serving as a board member or you know, a community manager or something like that in an open source project. So like, don't be afraid to mix things up and yeah, find a way to get paid but also find a way to like live your passions. Cool. So somewhat related to that I guess I mean you've done lots of hiring over the years for open source rules. When you interview candidates like what do you typically look for? I can I can start. So obviously the skills that you need for a particular job depends on the job. But when it comes to open source interacting with people and being a team player is kind of a requirement. It doesn't mean that you have to be an extra word. I'm an introvert. I know so many people in open source who are totally introverts. But since we are all so passionate about what we are doing, that is not a barrier for us to participate. So the willingness to interact with people and to even if you're not comfortable with the public environment fully yet, but the willingness to be and to do so, that is very important because you will need to interact with people from all over the place. And if you're quiet and shy and you don't want to be out there, then it is very hard to get successful in open source in my experience. So that's definitely up on the list of being able to do that. Do you want to take the mic? I would say that I generally look for someone who has enough of the skills that we're looking for that they can probably do the job knowing that there will be pieces of the work that they'll need help with later. So one of the things I will caution you about is that job descriptions on the website are wish lists. They are not requirements. I have never in all of my years had every single thing listed on that job description as a skill. And they still give me the job. And I still, I guess, I seem to be successful. And so don't look at those. As a list of requirements, look at those as a list of things that they would like that person to have because they're not going to get that unicorn. They're not going to get that person with every single one of those skills. They're going to get somebody who has enough of those skills to do the job, and then they're going to train them on the rest of it. So make sure that you go ahead and apply for stuff, even if you don't think that you have everything, because in a lot of cases they'll be willing to take a chance on you and train you up on some of the other bits. And also, like, if they see that you're passionate about that particular job and you have an idea of why you would be the best person to do that job, that usually gets you through the interviews as well. And that, like, if you don't have a skill that's listed, then they will more likely overlook that because you're someone who's already in that mindset that you're ready for that job. So I totally agree with that observation. I don't think I ever checked all the boxes either. I think that's impossible. Most of the time the job description is also written in a way to just be a little bit scary. I assume they are trying to limit how many people are submitting applications just because the job description looked like you need a 200-year work experience before you apply. But really, most of the things you will be able to learn and don't be afraid to learn. And if you're also open about that, at least to me that was always appealing when a person is honest about, okay, this I don't know yet, but I can learn it. And for most of the tech jobs, you will never stop learning. So if you have that ambition that will take you from A to B and then from B to C and you're able to grow, that is always very appealing. Like, again, the ability to grow, that is another thing that at least I personally look for, to see that the person will be able to grow into the job that they are applying for. But then they will also be able to grow further out of their job and do something else in the company. Really, a job where you already know everything before you start it is super boring. So look for the jobs that you will learn something really interesting and that will lead you on to other jobs where you learn even more interesting things. Yeah, I mean, we don't mean to harp on a job description too much, but the other thing I want to add is, by the time you accept the position and start, it could have been three, four, five months since that job description was originally written. So think about that, after a quarter, things change, the market's changed, there was a reorganization of the company. So, I mean, that's why I try not to take job description as a gospel, although it's very tempting. Because you want to check all the boxes as many as you can, but it's just a guideline. Like, it's an educated guess as to what the new person might be doing, but it's still a guess. So, sorry, I'm going to the list here. So, I think what I've seen some people do, like, you start in open source, but then you step back, you take a different role, you do a non-open source role. I think some of you have done that in your career. Like, can you talk about that experience and why you, I don't know if you were forced to do that, or why did you just step away and what was it like coming back into open source? I've done it multiple times, I mean, three decades is a long time. I mean, some of it was layoffs, you know, it happens, but more of it was often, I mean, there's reasons like family health, there's reasons like kids, there's reasons like I took a break to do a PhD, you know, there's all kinds of good reasons to take a break, but another one is to avoid burnout. So, if you think you have to stay in the one project forever, you will work yourself and work yourself and work yourself. But if you recognize it's totally okay to just go away for a couple years and, you know, either come back to that project or a totally different project that excites you two years from now. Like, it feels less devastating to step aside from a project and you can do a well planned orderly handoff instead of a flame out burnout. If you push yourself all the way to flame out burnout, chances are you will never work in open source again because you burned yourself too far down. You can come back, but it's much, much harder than if you just recognize the signs and say, oh, you know, I should really take a couple years off and do something else. And then you come back revitalized. So, yeah, I actually, I highly recommend taking a break from time to time. It's a really good idea. Cool. Yeah, I've also had the occasional detour. I had one that was the company that I worked for, the politics around open source internally just got to be too much for me. And so I spent six months working in a, like, a market research department or something, something kind of random that I thought was a little bit interesting. But, you know, in the other way, I think that can help with, you know, burnout and just kind of, you know, doing something new is I've worked in loads of different open source communities. So, chaos is probably the one that I've worked in the longest because I've been working with these tools since before the, before the organization existed. But in a lot of cases, I've worked in, you know, kind of a series of open source projects based, frankly, on what the company that I was working for was particularly interested in. But the thing that I found was that in every time I switched from an open source community to another, I found that there was at least one person that I knew from a previous open source community. So even when you kind of bounce from community to community, there are usually other people that you know from previous lives in other communities. I do not have the experience yet. I'm still burning myself to learn the lesson. But to Don's point, I did see people, like, moving from one employer to the other, but still working in the same open source project. And also the popping up at another community and like, oh, hi. Like, you're here too. Cool. And it's just, it kind of shows you that the world gets a little bit smaller if you keep being involved in open source and you just know people. And the connections that you make will more likely to stay with you longer than in a corporate environment where you're just jumping, jumping companies. And that's a really nice experience. Even if I assume that even when you're taking a break and you come back and you see some familiar faces, but the project is new, that's kind of a nice mixture of I'm doing something new, but I don't have to make all new buddies to go from A to B. So yeah. Cool. So yeah. Yeah. So, I mean, I think earlier we were talking about like regrets or mistakes. I mean, the one I made personally was, I mean, I was working at Intel. We got reorg and then I had to stop working on open source, which was devastating. And then I think the mistake I made was I just spent time like sulking and being depressed. And I mean, that's fine. But what I should have done is do more productive and try to get engaged in the open source community somehow, find a different project, show up to meet ups, et cetera, et cetera, rather than feeling sorry for myself. So I think getting reorg then maybe getting laid off are two good examples. Like if you're a force away from like open source, like, you know, what advice do you have for, you know, maybe just stay engaged in the community somehow. I think nowadays it's easier. But, you know, what kind of approaches do you take to find a new community to join or like how do you keep up to date on what's happening out there? I have an example that's kind of slightly connecting to the questions. I will share it. I used to do trainings like how to contribute to OpenStack trainings one or two days before, prior to big events where a lot of people traveled to. And I met a lot of different people with very different motivations of why they were at that training. Some of them were just like it was free and then they were already there and it seemed interesting. And a lot of people kept asking like they do the training, they learn about the tools, the processes in the community. So what should they do next, what they should work on? And I always kept asking back like what are you interested in? Because I can point you to the low hanging fruit bugs. That's easy. But once you fix the bug and then you fix another one, then the third one is like why am I doing this in the first place? If you're not interested in the particular technology or you don't have a motivation to be at that particular place. So I would say that don't ever let anyone else tell you what you should do. Go where you feel passionate. Like learning, when you learn something new, where you're interested in the technology. And if you get involved then you will have the connections and then the job will come around as well. If you want a paid job that also works with that particular technology. So I would say that here make sure that you prioritize your interest and invest in yourself through that. I think it also partly goes back to the people. Like multiple times, we talked about other open source contributors moving from project to project. Like I was working at Canonical and when I left Canonical, a lot of other people that had also been at Canonical working on the OpenStack project. And I thought, that's interesting. What's that all about? And that is how I got involved in OpenStack. It was just by talking to other people and seeing what they were interested in now and kind of keeping those connections. So your network in open source can be really, really valuable in staying connected and finding out where the new things are and where you might want to keep working. Cool. Okay. I think we have about 13 minutes left. So I think I'll ask one more question and then leave the last 10 minutes for the audience. And I'm not going to hold any of you to this. I think earlier when I started, I felt pretty depressed for large parts of last year. Because I wasn't sure if this shift is unique or we're just dealing with another pendulum swing. So open source careers, in general, what's your outlook? I mean, to be honest, I think I'm a little bit more optimistic now than I was like middle of last year. Middle of last year just seemed daunting. And I just devastating to see a lot of my friends get laid off. But what are your thoughts on where things are headed? Or are we dealing with more of the same? Okay. I can go first. So as I work for an open source foundation that is a nonprofit organization, and I work with a lot of communities, I do see the effect still of where the economy is right now. However, at the same time, even just in the past two days in the co-located events, people are throwing out numbers. Like if we didn't have open source, then it would be like 4. something billion dollars to rebuild what we lost. And the trillions of dollars of demand that is driven by open source software. So those kind of numbers show that open source will not go away. So even if the economy is restructuring itself, companies will restructure themselves too. And I don't think that anyone really has a choice of not using open source software anymore. The software also needs to be maintained because otherwise you're not able to use that. Security is a high priority item in any single conversation that I've been participating in in the past few months. And maybe it's getting up to years now. So there's a lot to do in open source. It is also a model that is very sustainable if it's done right in terms of investment. So I think we will bounce back overall. And I think that the job market will have a lot of opportunities that are more directly focused on open source. And even, I think Alison mentioned that there isn't really a job that has nothing to do with open source anymore. It's just maybe isn't called out directly. But I'm optimistic. Yeah, I'm also optimistic. I do think that the pendulum has swung too far in the cutting of jobs. And in particular, I think some of the open source groups have been particularly bad hit. But I think it's not going to take companies long to realize that somebody has to do that work on the projects that they depend on. And so, you know, I work a lot with CNCF projects. Most of them are understaffed and they don't have enough resources to maintain the software over the long term. And I think that if they don't get more contributors coming back to those projects, because companies have pulled people off of some of those projects. And so many companies, their whole product line relies on a lot of these products. So I think they're going to quickly realize that their new features, their bug fixes, their things they're going to need in the software, that they're going to have to resource some of that. But the other trend that I find particularly promising as well is some of the alternative funding sources. So you look at groups like the sovereign tech fund out of Germany who are funding core infrastructure projects. You look at things like GitHub sponsors. You look at a lot of these other groups that have started funding individual projects, individual developers. And so I think that's also an interesting trend from a career and a job standpoint for open source. I don't know. From personal experience, I was laid off last year and I haven't looked too hard because I was having fun working full time on my volunteer open source projects. But I did see in the new year, towards the end of the year, it was a lot of, oh, this year it's totally blocked off. And in the beginning of the new year, it was like, we're hiring, we're hiring, we're hiring, we have a lot of spaces to fill. So if you think it's been a difficult year last year, look again because things are changing now. Cool. So I think cautiously optimistic is the phrase I like to borrow from the economists. So I think we can open up to audience questions. I don't know if we have a microphone for the audience or I can just bring one. Thanks. So I have a question for Ildiko. You mentioned earlier that when you are at events like this, you have to take advantage of getting to know people and to interact. So I'm also introvert. How do you get past the barrier of talking to strangers at an event like this? Not an easy question I know, but. Excellent question. I can only share my personal experience. To me, if I'm passionate about something that will push me through the first few seconds of awful experience, I can only share my personal experience. I can only share my personal experience. If I'm passionate about something that will push me through the first few seconds of awful experience. The other thing is that what I found is I have days when I just wake up and I'm feeling more social. And there are days when I can do whatever I want. I could write a script for myself before I walk up to a person who I don't know yet. And I would still be totally awkward. And I learned to say that it's okay. I have days like this and it is okay. I also started to kind of be a bit more open about this sometimes just telling the person, you know, I'm socially awkward sometimes. I'm not ashamed about it. So like I'm not afraid of putting it out on the table. And many times the other person can relate that, yeah, well, it's not easy for me either. And it's just so many examples I have where I said something like this and all of the sudden, that was like the icebreaker and the other person is also like, oh yeah, it is hard for me too. And then we have something to talk about. So it is hard. I know that it will drain me like after a conference like this. I need a few days to recover. My mother also knows that she should not call me like two, three days for two, three days because I will not be a pleasant experience on the phone necessarily. But yeah, you learn how the social interaction affects you and then you will also learn how to navigate yourself. So I can only encourage people to get through the first few awkward experiences and then build on what you learned about yourself. Yeah, I mean, just to build on that, like talking to strangers is hard, right? And I share some of Ildigo's like I tend to be a little socially awkward. But what helps me is to talk to people in more social situations. So you know, you're in line for a coffee or you're at one of the after parties or something where it's a little bit more social. And I just sort of have ways of coping with it. Like my question for people is always, you know, are you enjoying the conference? Or you know, and building on that like, oh, it was the favorite thing you saw today or what are you looking forward to tomorrow? So you're talking about the conference and you're also, you know, even if this person isn't all that interesting to you or working on the kind of the same things, maybe you learn something about what they found at the conference and what looks interesting to them. And it can be a good icebreaker. And you know, and then sometimes, you know, if you're standing in a group of people and you know, somebody else will chime in and then pretty soon you've got a conversation. But that's how I start. That's my coping strategy for awkward conversations with strangers. I'm also an extreme introvert. For me, it's about understanding that it's difficult for them too. So if I'm focused on trying to make them comfortable, I'm not thinking about how uncomfortable I am. And also just being super curious. Like, ooh, what do you do? What are you interested in? And like, I get so focused on whatever project they're involved in that, again, I just completely forget about my own awkwardness. But planning time off, like even in the middle of the conference, planning like half a day off, like, oh, I don't have a whole lot of talks I want to see right now. I'm just going to go back to the hotel. And it really helps because you recharge that introvert battery and then you're ready to deal with people again. I just want to say like plus thousand to that one because I think it took me years to be comfortable with saying I don't really need to talk to anyone in the next two hours. Like on the sessions, I don't have a target topic where I need to network with people. So I just, I leave, I find a nice coffee shop somewhere outside of the convention center getting some fresh air and just saying that that's okay. Also like once you get into the environment and you start to know more people, you don't have to go to every single social event after the conference because there's usually happy hours every evening. You don't have to go to all of them. Once you have a base network, then you can pick which one you want to go and the rest don't sweat on it. Because at the very beginning, I was like, my company's, company's sending me like overseas. It's a very expensive trip. I'm missing a week of work. Like the day job kind of work. And I felt obligated to go to every session, talk to people, go to every social event. And like if I stepped outside of the convention center during the conference day, then I felt guilty. So letting that go, yeah, let it go. It's very important for you to take care of yourself first. I mean, in addition to social awkwardness that I also deal with, I mean, it's just like crowded conferences like this. It's challenging to find time to talk to people because they're all busy, especially with speakers. And I've been on both sides of this. It's completely okay to say, could I message you on LinkedIn and have a call with them like a week later. And I actually did that with one of the speakers last year. This is one of the, like it was in the K building, one of the larger sessions. And he was just inundated with a lot of people. And I just said, hey, can I, can I connect you with you on LinkedIn? And then you're going to be in a more relaxed environment on Zoom. You just have a conversation about his talk or his background. So there are, you know, you don't, don't force yourself to just have all the conversations in two days. It's just very difficult logistically. So any other questions? Oh, go ahead. Yeah. So my question is, have you experienced any sort of a significant difference in terms of revenue working on a heavily open source type of project or job versus a, suppose a normal one, if there's such a thing. Thank you. The only thing I can say about like salaries and things, in my experience, that's more tied to geographic locations rather than, and well, the, the job role itself. Like if you're at a hyperscaler and I don't know VP position, then I assume you will not have money problems for the rest of your life. But, but at the same time, I, I really, my experience is I moved geographic locations and that affected my salary more than anything else. I have not noticed a difference. And to the, I have not to the degree that when I was putting my son through college and I very much focused on salary, which I don't anymore. But I was like 1% working on fully open source. Like I was in the 1%. Like fully open source, like nothing, like no proprietary software. So it's not, it does have a lot to do with the company. Different companies have different salary bands. So you're more likely to get more if you work at a big company and then a small startup startups tend to be a bit more weighted towards like stock options. So it just happens. But yeah, there isn't, there isn't really a difference whether it's open source or not. Yeah. And then I mean also comparing like, because I just asked this question because I worked at a foundation working at nonprofit versus like a for profit organizations. When they need to hire people, they need to be competitive. Like, I mean, if you're a nonprofit, you can offer like stock options. That's not viable, but they have to find other ways to make it appealing to attract good people. Right. So you can't be at a complete disadvantage like salary wise as an example. If that's, if, you know, that's, you know, that's my experience. Other questions. Any nine. All right. Cool. Well, just final thing I want to say. So if you want to connect with us, I mentioned LinkedIn. All of us are on LinkedIn and also on Twitter. If you want to continue the conversation, feel free and enjoy the rest of the weekend. Thank you. Thank you.
The Regulators Are Coming: One Year On
Okay. Testing, testing. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Yeah, there we go. If I call your attention. In the next session, we have a one hour and on the regulators are coming. Your chair for this session is going to be Simon Phipps, and he will tell you all about it. Welcome. Thanks for coming. There we go. Yay! Hi. So I'm Simon Phipps from OSI, and I'm part of a group of people from Open Source foundations that have been engaging with the European legislators this year to fix the issues that you all told Benjamin about after his talk at FOSDEM last year. And the TLDR for when you leave early is that thankfully Benjamin and Omar down here listened very carefully and have, I believe, addressed all of our concerns with the impact of the CRA on open source developers and open source charities. There are some remaining issues that are a little more complex to deal with, and they will be dealt with in some guidance that comes from the European Commission. So to speak to you today, first of all I've got Benjamin Burgle, who is a head of unit now, head of sector at DG Connect, and he was one of the authors of the Cyber Resilience Act and has been intimately involved in fixing it with us all year. And he is going to tell us all about the CRA. After that we're going to hear from Gail Blondel from Eclipse Foundation, who was also part of our group that was interacting with the Commission, and he's going to tell you whether Benjamin is telling you the truth or not. And then Omar is going to tell us the same things about the Public Liability Directive, and then Doug Villum from Apache is going to tell you whether Omar told you the truth, and then Enzo here is going to run an audience Q&A so you can ask these people all the questions that you want to. We've only got 50 minutes, so if your question doesn't get answered, come to our dev room, which is all day tomorrow, in AW1120. It's an open source in the European Legislative Landscape, and we're running four 2-hour workshops to give written feedback to the Commission on their digital agenda legislative program. So with all that said, Benjamin, thank you so much for coming back, and they've promised not to throw anything. So go for it. Thank you. Thank you so much, Simon. Thanks for having me again. It's been an exciting year. I was here exactly one year ago. Last year when I was here, I was presenting the Commission proposal, which is the first step of the legislative process. We as the Commission, we make the proposal, and then the co-legislators, the European Parliament, as well as the Council, which represents the Member States, they negotiate on the basis of our proposal, and now I'm here to report back after one year of negotiations. The text is almost done. It's quite stable. We still need the final vote by the European Parliament, so it's not entirely finished, but we are quite confident that what I'm going to present to you today is a rather stable version of the Cyber Resilience Act, the newest kid on the block when it comes to cybersecurity legislation. Last year I presented the proposal. I will repeat some of that this year, but I will focus much more on the open source elements, because there are much more open source elements in the final version compared to the original version. For those that weren't there, what is the CRA about? It essentially requires developers, hardware and software manufacturers to introduce security by design in their development processes. The cheese on the left represents a product with digital elements, as we call them, filled with holes and security vulnerabilities. On the right-hand side, once you've complied with the CRA, there will be way fewer holes, although we do acknowledge, of course, that it will be impossible to get rid of all the holes. That's just the nature of cybersecurity. Here is a brief introduction into the main elements of the law. As I said, it's about cybersecurity rules for the placing on the union market, the entire European Union of hardware and software products. We have three main actors in this legislation, the manufacturers. They will bear the brunt of those rules. They have to make sure that their products are secure, but then there are also obligations on other types of actors, mostly the distributors, so these are essentially either brick-and-mortar stores or online shops. They have to make sure that the products that they sell are secure, as well as importers that import from outside the union onto our market. The rules come in the shape of essential requirements. So essential requirements are high-level, objective-oriented, technologically neutral requirements for the placing on the market of the products. They are things like ensure access control, ensure the confidentiality, integrity of stored and transmitted data, and so forth. So you know all these are high-level. This is the cybersecurity 101 that we're essentially putting in the law. To make it more useful and more easy for manufacturers to comply with those requirements, the European Standardization Organizations, they will develop harmonized standards, and then you can use those standards to comply with those requirements. The European Standardization Organizations, essentially, they gather the manufacturers, so it will be the manufacturers themselves who will develop those standards. Depending on the level of risk that is associated with a product, there will also be different types of conformity assessment. I will explain that in a moment. I also want to mention separately that there are going to be reporting obligations, so if you discover vulnerabilities in your products that are being actively exploited, or you have an incident on your network that affects the security of your product, then you would need to report that. And finally, another important element, of course, is the market surveillance and enforcement. So all 27 member states, they will be required to set up their own national market surveillance authorities to check products and ensure that the products that are on the market are actually secure or at least compliant with the CRA. So these are the main elements. We are tapping into an existing framework. You've all seen it probably, the CE mark. So on your smartphone charges, for instance, you have the CE mark. The CE mark tells you that this product that you're holding in your hands is essentially compliant with all European product regulation. And in the future, when you see the CE mark, it will not only mean that you're compliant with safety regulation at the union level, but also with cybersecurity legislation, the Cybersecurity Act. So which products are we talking about? The scope is quite wide and deep. So when I say wide, I mean that it applies to all sorts of hardware and software products, such as laptops or operating systems. But it also applies not only to the final products, but to the components, because the nature of cybersecurity is, as you all well know, that often vulnerabilities and components can have an impact on the security of the final product. And in many cases, it is very difficult for the integrator who builds a final product to find all the vulnerabilities in those components, often components of black boxes, in particular when they don't come in the shape of open source. So they also need to be secured. And so all components that are placed on the market as separate products, they are also in the scope of this regulation. What is not in the scope? I already explained that last time, but it was not sufficient for you. I explained that non-commercial products would not be in the scope. And I think this has been quite an issue that has been discussed very lengthy. A lot of people have asked, what does it mean non-commercial in particular in the context of open source? And this is one of the reasons why for the last year we've tried to flesh out in more detail what non-commercial means for open source. And I can tell you that during the last year, barely a single day has passed by when I didn't wake up to a message from Simon, Dirk Willem or Enzo trying to help along with this process. So non-commercial products are not in the scope. I will explain in a moment what that means for open source. Stand-alone services, in particular software as a service, that don't come with a product that are stand-alone, that you just access through a website, they are also not covered. And we also have a few outright exclusions of products that are already regulated when it comes to cybersecurity, so they don't need to be covered by the CRA. And that includes, for instance, motor vehicles and medical devices. Okay, so just to understand, I said the scope is wide and deep. I want to talk a bit about what it means that it's deep, right? So when you are a manufacturer of a final product, in this case a smartphone, you will be integrating two types of components. On the one hand, like in blue here, components that you've developed yourself, as well as components here in yellow that you are buying on the market or sourcing from the market, and you're also integrating them. So you are responsible for the security of the entire product as a whole and for its compliance with the CRA. But when it comes to the components that you source from third parties, of course it's much more difficult to have assurance about the security. And for those components, we've introduced a due diligence requirement. That means that as a manufacturer you will have to do the utmost to make sure that the components that you integrate are secure. That can mean that you simply check the change log. Is this a component that is regularly maintained? You check the vulnerability database that are out there on the internet to see if the latest version contains any vulnerabilities. And if it's a commercial product and it is subject to the Cyber Resilience Act, you can also check whether it carries the CE marking. So this is how you can achieve that the product as a whole is CRA compliant. So now to the conformity assessment, I mentioned it earlier, and this is the first time I'm going to mention open source more explicitly. This is where it's explicitly mentioned in the text. For the vast majority of products, which we call the default category, manufacturers, they will have to undergo a self-assessment. That means that it's the manufacturer, Him or herself, that will check and ensure that the product is compliant. But then there are some products that are explicitly listed in the annex of this regulation that the co-legislator have considered as important or critical from a cybersecurity point of view, and they will have to undergo a more stringent type of conformity assessment. So first we have the category of important products, and manufacturers in this category, they will have to apply at least a harmonized standard, the ones that I mentioned earlier, or in some instances they will even have to submit their product to a third party to have it checked if it's secure and compliant with the law. So products in this category are for instance operating systems, antivirus software, or also firewalls. Then there are also critical products. They are also listed in the annex. These are products such as smart cards and secure elements that we consider to be even more important. By the way, only hardware products, no products that are softwares or nothing that is potentially open source. And for these products we may in the future even go a step further and require a certification of the products. Now when it comes to free and open source software, we have a special provision in the CRA that says irrespective of whether your product is important or not, you will always be allowed to undergo a self-assessment. So you will not have to submit any free and open source software that is in the scope of the CRA to a third party. And the reason behind that is that when it comes to open source, it's a transparent product, and anyone including the users or integrators, they can check for themselves whether this product is secure. So you do not need to have a third party that vouches for the product. Now we also try with the CRA to shift the responsibility from the developers of open source components to their integrators. Because so far integrators have often been free riding on open source components and not giving enough back to the community in terms of fixing vulnerabilities in these products. So coming back to the smart phone product that I presented earlier, right? So imagine a smart phone product that integrates an open source component. Here is a silly open source component that prints fruit onto your... that prints fruit. So far it was a one direction thing, right? So the integrator would take the component and, I mean not always sometimes, of course integrators also contribute a lot back. But in many cases they would just integrate the component into their own product and that would be it. From now on the CRA will say, if you find a vulnerability in your component, you have to inform the developer of that component. So that developer can also provide a fix to that vulnerability. In addition to that, since as a manufacturer of a final product, you are responsible for the product as a whole and in absence of a fix from the upstream manufacturer, you will also be required to provide a fix. I mean either you fix the vulnerability in that component or you replace that component by a different component. You just have to make sure that your product is secure. But if you do provide a fix, then you will also have to provide that fix to the upstream manufacturer so that the upstream manufacturer can integrate it. So this is how we want to share the burden on security between the developers of final products as well as the developers of free and open source software. So is your open source software project covered by the CRA? I think this is the question that you are all asking yourselves. I said initially the commission proposal said, if you are not commercial, you are out of scope, right? And now we fleshed this out in much more detail and we've even introduced a new type of actor. The open source software steward which I will also present in a moment to you. So if you are merely contributing to someone else's project, you are definitely not a manufacturer. You're not subject to any obligations. That was a worry that was expressed several times but here I can assure you, you can just keep contributing and you do not need to worry about CRA compliance. Now if you are providing the project and not merely contributing to it, the question is, are you developing in the course of a commercial activity? So if you're not, if it's really just a hobby project, again, you're not in the scope of the CRA. Now if it is in the course of a commercial activity, the next question is, are you directly monetizing that product? I mean, because we know that many open source projects, they do not directly monetize but they're still a wider commercial setting, right? Many companies coming together to jointly develop a component that they will use for their own products that's a wider commercial setting. But we only look here at the direct monetization of the project. If you're directly monetizing it, then you are a manufacturer and then you are subject to the security by design requirements of the CRA. If you're not directly monetizing the project but it's still taking place in this wider commercial context, this new type of actors introduce the open source software steward. So these are essentially foundations, not-for-profit manufacturers and so forth. Here we've invented a new very light touch regime. So if you are a legal person that provides support to specific FOSS projects on a sustained basis and these projects they are intended for commercial activities, then you will have to comply with the light touch regime of the CRA as regards the open source software steward. But if you're just a collaborative project, no governance frameworks to speak of, no direct monetization, then again you're not in the scope of the CRA. That means the vast majority of the open source projects will not be in the scope of the CRA. So I don't know how do we still have time. I can maybe quickly explain what the open source software steward will be. I already gave some examples, right? So foundations, not-for-profits, also companies that build open source for themselves for their own monetization or integration into their own projects, but then make it available to the public, they will all be open source software stewards. And I already said it's a light touch approach. It's not going to be heavy, but the idea is to place some responsibilities on these types of actors, but only responsibilities that they can also bear, giving the nature of their project and their organization. So there are basically three types of obligations. First, you have to put in place a cybersecurity policy. The CRA is not very prescriptive what that cybersecurity policy should look like. It provides some basic elements that need to be mentioned, such as supporting the community in providing information about security vulnerabilities, describing how you will mitigate vulnerabilities and so forth. Secondly, you will be required to cooperate with market surveillance authorities, just like any other actor in the Cyber Resilience Act. And thirdly, you will also be required to report incidents and vulnerabilities, but only to the extent that you are involved in the development. So if you're not involved in the development and you know nothing about the project and the vulnerabilities, then you will not be required to report vulnerabilities. Okay, so this was a high level overview of the CRA. Just maybe very briefly what are the next steps. So we are hoping to conclude the CRA very quickly in the coming months. The entry into force, I cannot be sure, but it will be roughly around middle of 2024, maybe a little bit later. And then there's going to be a three years transition period. During that three years transition period, the European standardization organizations are going to develop the standards. We as commission are going to develop guidance. For this, we will need you because of course the CRA is a high level legislation. Many of the concepts, they need to be fleshed out through the guidance. So I'm actually looking forward a lot to all your questions because these questions, they will help us determine what is relevant for the guidance. Yes, and then in three years time from maybe June this year, so maybe in June 2027, the CRA will enter into force. Thank you very much for your attention. Thank you very much. Thank you very much. Still got to turn it on. There we go. Thank you very much for all that. Now Gail Blondel is one of the leaders of the Eclipse Foundation. Eclipse has been speaking up frequently for the open source community in this legislative process. They've had two staff working on it quite a lot of the time. Deb Bryant and Enzo over here who you'll hear from later. Gail, could you come and tell us how the Eclipse Foundation feels about the state of the CRA now? Yes, thank you very much Simon and thank you Benjamin for the presentation. Well, thank you for coming. You see that went well. That was okay. So far. So one first point is that I think that we have always said that we agree with the goal of the CRA. That was on the first blog that was published by Mike on the topic. We agree on the goal but initially that was very scary. And I think that last year that was the conclusion of your presentation last year. Hey, come on. What are you doing? How can you put us on the spot like that? Because putting C marking on all the open source project was just not an option. One thing and that's very important is that we know that we have lots of open source developers that are volunteers. And even when they are paid to do open source development, what they focus on is doing the features of their project. And non-functional topics like security, etc. I think that as a community, as an ecosystem, we know we have to take care of that because we had lots of issues in the past. But having that coming through a regulation was something completely new to us. And yeah, even if there is a legislative process that is kind of obscure for most of us, I think that what's interesting is that to see that during the year, we managed to establish some enough connections that the co-legislators listen to the open source community. So I think that from your presentation today's obligation to push corrections, to push fixes, upstreams, also the fact that contributing people are not responsible for, have no obligations, etc. And also from my perspective, the introduction of a new kind of organization, that's the first time there is a regulation talking about open source foundations or those kind of organizations as something specific. That's very interesting aspects. But to conclude, and maybe that's an opening for the conversation after, is that that's just the beginning because we have mostly three years in front of us. And in those three years, so you will write guidelines. And hopefully we can collaborate well on writing guidelines. But there will also be the standards. And maybe from the point of view of the open source community, it says that the standard organizations have not been the best friends of the open source community. So that's how do we, I think that when you say open harmonized standards, I guess that a few people in the room say, hmm, it's unlikely we will like such things like an harmonized standard. So that's something we need to keep on our radar. And the fact that the regulators are coming, that's the title of the panel, I think that that's a good thing because that's also the fact that open source has won and is present everywhere. So we used to be under the radar. And now I see several faces from the European Commission in the attendance. You are here to explain to us and we have established some connections. So that's good things. And yeah, the conversation continues tomorrow in the panel, in the EU policy day room. And that's it. Thank you. So that's the CRA. Now, the CRA sets the rules for the market surveillance authorities. It says how countries are going to make sure their citizens are safe from the products that are being sold in those markets. When it turns out those products aren't safe, Europe's Product Liability Directive gives citizens recourse to have justice brought into their lives. And the Product Liability Directive has been in place for many years in Europe, but it doesn't give any liability to software producers. And so within boundaries that I will fix. So the European Commission is going to do something about that. Those big, bold, lettered disclaimers at the end of your software licenses do not apply in Europe anymore. And that's because the Product Liability Directive is being updated to give software producers liability towards consumers. And to tell us about that, we've got the legal and policy officer from DG Grow, Omar Anagi, who was one of the primary authors of the PLD, and he's going to tell us what's in it. So Omar, please. Thank you Simon, and good afternoon everyone. It's a pleasure to be here again this year. Same as the CRA. We are a year after, and we have now more than just a proposal. We have a legislation that still needs to go through the adoption by the parliament. But just as a small introduction, whatever has been said just before, let's try to forget it for the next 12 minutes, because it is not applicable in our case here. When we speak about the PLD, basically it applies to any type of products. The only element that is necessary is whether they are made available on the market, and made available on the market basically means any supply distribution of a use, and whether it's return of payment or free of charge. And the most important element is actually the commercial activity. I know that everyone asks always, especially here last year, those questions, what is a commercial activity? Unfortunately, I cannot tell you exactly if your own product or your own software is in a commercial activity. This is an assessment that is done by the judge itself. There are elements, the number of supply of the product, the number of use of the product, but this cannot be determined beforehand for the PLD, because of its own nature of safety net. So the assessment will be done for each individual product, even if it's let's take the more traditional product like a bottle, and you will have to look at the specific bottle and not the series of bottles to determine whether it is in a commercial activity or not. And I say this is the scope, but then we arrive to the product itself. Any product, the definition is really legalistic, so you don't need to really get that, but basically it's everything, and we have clarified that also softwares, raw materials, and digital manufacturing files are products on the PLD. There is no definition of what is a software, as you probably know, like the software 20 years ago is not the same than today. So the idea was to leave it as open as possible to ensure future proof safety net nature of the PLD. You asked me if SIS are covered, yes they are covered. The PLD disregards how the product is supplied, how the product is bought, how the product is used, how it is the model of the product, where it is stored. All of this is totally disregarded. Any software is covered by the PLD, algorithm, operating systems, AI systems, apps, whatever you want, all of them are covered under the word actually software. As Simon said, the PLD does not kick in. I mean, you do your job, and in the PLD we're not telling you how to do your job. The only thing that we are telling you is, nor the risk profile of your product, because if something wrong happens, and maybe none of you will ever experience the PLD in your life, if something wrong happens, someone has to get compensated for the damage. The damages are pretty straightforward. It's basically death and personal injury, including psychological health, the destruction of property, and the last one is destruction or corruption of data. Those are the three main categories of damages that need to be compensated. If there is a single of one of these ones, you would then have to compensate basically everything that is related to that. As I said, you will not have a case if there is no damage, and you will not have to face the PLD itself. Except in certain situations, you might have liability even if the damage has not yet occurred. Let's take a pacemaker. You know that the pacemaker has an issue. You will not wait that the person dies because of it. You will get preemptively the compensation, namely the damages of going back again for surgery and etc. I use the pacemaker because they are part of the wider range of medical devices, and medical devices also sometimes implies or include software. This is a specific situation in itself. When we talk about the liability, the question is for how long? The main rule is ten years. This is the general rule. Namely, if you place your product on the market, you may have it available on the market from the first day. This is when the time starts running. But as you know, a software might evolve, AI system for example as well. Considering that a software that was placed 15 years ago and has been changed through a lot of updates for instances, it will be kind of limiting to steer or only apply to ten years, because it means that someone who bought the software ten years ago or eleven years ago will not be able to cover the damages in case something wrong happens, although the software has been updated. We have also included a new starting period, which is when the product is substantially modified. I'm not going to go into detail. We're not explaining exactly what is a substantial modification. In most of the legislation, you will find what a substantial modification is, but roughly for software, I don't know if the CRA has a substantial modification definition, but for instance you will have to go under the CRA to see what is a substantial modification in case of cyber vulnerabilities. What we say is basically if you update your software and the update is as such that it changes the risk profile of your software, it is a new product, it is a new software and the time limitation starts running from that moment again. So each time you will change to that point or to that element of your software, you will restart the clock in that sense. If it doesn't, then it doesn't, and then you're ten years, ten years. The extension of the liability has also been put to 25 years in a specific situation, which are the health injuries. That shouldn't concern you that much, but just for you to know, it's basically pharmaceutical. That's the easiest one when you realize that you have some damages because of it, but it took more than ten years to appear. So this is a specific situation, but software you just know. We talk about time limitation and then we also need to talk about the exemptions. Exemption means that even though your product caused the damage, one of the three, you might be able to be exempted from your liability. There is a full list of exemptions, not going to go into details, but maybe two are important for you and I will explain the first one a bit later. If you did not place your product on the market, but it was placed by someone else, or the development risk defense, what we call the state of the art, which I think in your field is the most relevant one, and just to be clear, it's not the knowledge of the developer, it's the knowledge of the community, of the science around. And it's not about the known unknown, it's about only the unknown unknowns. Only in those cases, you will be able to be exempted from your liability. So just to take an example for you and maybe to make it as clear as possible, the PRD does not apply for any product when they are supplied outside of a commercial activity. This is the same for free and open source software. If your free and open source software is developed or supplied outside of a commercial activity, but someone decides to integrate it into another product, and therefore the product is then sold to a person and causes harm, the liability is pretty clear. The person will only be able to go against the integrator of the software, but not against the developer of the free and open source software that has been supplied outside of a commercial activity. That's a bit of clarity that is now in the text, which was not there before, but just for you to really understand how it will work. And the very last point is about, I know that you have clauses in your license. The PRD is pretty simple. No matter your clause, you cannot use it against a person, a natural person that is claiming compensation. So there is no leeway for avoiding liability. If it's a natural person, so me, you, anyone else comes against you, has a damage, asks for compensation, brings you to court, you cannot say that you had a clause in your license that said that you will not be held liable. That will not be accepted by a court. That's a general principle that works for everyone, and for any type of product, is to avoid that the weakest party, namely the consumer, suffers from an imposed contract. But what we have clarified in the legislation is basically, if you, a small company, very small company, decide that, okay, you sell your software to another company to integrate it, but you do not want to take over the liability. If this is your case, you can then have a clause in your license or in your contract. And in that case, the manufacturer of the overall product, the integrator of the software, will not be able to come against you in case he has compensated the natural person. What happens usually is the natural person goes against the name manufacturer of the product, and then it is that manufacturer of the overall product that will go against the other component manufacturer for getting part of the compensation. This would then not be possible if you have such a clause. So that's a bit of the small panorama of the PRD. So I leave you on that, and I hope you enjoy it. Thank you. Thank you very much. Perfect. I'll come get it from you now. And to respond for the defence, we have Dirk Willem van Gulak from the Apache Software Foundation. Thanks, Simon. So, yeah, so basically these were, so I think so like in many ways, what's happening here is that software is becoming, yeah, very grown up, and just sort of like a, I don't know, a phone charger or an electric drill, where sort of like being put under the same rules. Now, I think the positive news is here that in this process, the open source site, the development site, and also like the micro enterprises are largely sort of like out of scope. However, what I want to stress, and also want to stress about the CRA, is that it is a massive change for our industry. Even we as open source developers, we're not alone. We're actually part of that IT industry, and the PLD and the CRA will probably sort of like, or will absolutely affect our industries way more than they do open source, because the industry has to come to the table. The industry is basically squarely in the view of the CRA and squarely in the view of the PLD. So I think one thing sort of like, we can sort of like, yeah, be positive about and celebrate about is that all the worries we had last year around the CRA and especially about the PLD didn't really come to fruit. I mean, things are now sort of like, we've got a fair balance, I think. But at the same time, as an IT community, we've got sort of like some massive challenges sort of like they're left. And I think sort of like some of the questions of you may well be in that area. Thanks. Thank you. Okay. And so we're going to move to an audience Q&A. If you've got a question that you would like to ask the panel, or particularly the guys from the commission, then if you would like to raise a hand, there's a hand raised down here, and Omar is going to moderate for us. I'm Enzo, not Omar. Sorry, Enzo is going to moderate. My brain is gone. Yeah, go ahead. Go ahead. Please, yeah. Yeah. So we're glad that a lot of the concerns of the open source community were heard. We can't hear you. Yeah, okay. So we're glad that a lot of the concerns of the open source community were heard. But for Linux distributions, like for example, Debian, we will be exempt because we don't do anything commercially. But we are worried about our downstream users, which of course use Debian commercially. So for example, a lot of very small and very small local IT providers sell computers with Debian, for example, or do other business using Debian and integrating it into their products. And we are worried about how they will be able to comply with the CRA obligations because they are so small that they can't do it themselves. So it would be really hard for them. And also the margins in the computer industry are not that big that they can just say, okay, I'm going to employ somebody who's doing that. That's not possible for most of them. So that's what we want to have guidance for. And also it's really difficult for them to understand all these regulations and what this means in practice concrete for somebody who's, for example, just selling computers with Debian. Thank you very much. I think it's a very good question. I guess Benjamin, it's pretty obvious that this question is for you if you want to answer real quick. Yeah, thanks. It's a great question, I think. So indeed, if you are selling a laptop, for instance, with an operating system installed, if you're building that laptop, if you're the manufacturer of that laptop with the operating system, you will be in the scope of the CRA. And the due diligence requirements as regards the integration of the operating system, they will also apply to you. I mean, I explained before what due diligence means, right? So there are a lot of ways in which you can do due diligence. The CRA is on purpose not very prescriptive because we want to give a lot of flexibility to the integrators. But one thing is for sure, it doesn't mean that you can only integrate CE mark products. You can integrate any open source component that you like. And there is a myriad of ways in which you can demonstrate that the components that you integrate are secure. I think in a case like this one where the upstream provider, so the Debian project, is such a massive undertaking, I think it would be extremely helpful for your integrators if you provide them with useful documentation on how Debian as a software, how Debian addresses the various security requirements of the CRA. I mean, just because the CRA doesn't apply to you, doesn't mean that you shouldn't take security seriously and I'm sure you do, right? So I'm sure many of the things that the CRA requires, such as the access control and so forth, I mean, obviously modern operating systems like Debian do that. So if you document in a transparent manner how you are actually complying with security by design principles, I mean, you're essentially doing the work for your integrators and then they can just recycle that work for their own documentation. So their documentation doesn't need to be heavy anymore. Thank you very much, Benjamin. Is there another question here over there? Thank you. Yeah, this is a question for the Eclipse Apache foundations. Aren't you afraid that you have kind of doomed the software foundations in shielding the developers? Because when I look at this, the first thing that jumped out of me was, okay, I have to make sure that I'm not going to be a software steward. So if somebody wants to pay me for work, then the best thing I can do is dump the project into one of the foundations and make myself just a contributor. Thank you very much. Dirk, maybe first or Gail? Right, so I think the question is really like what do I do as a small developer, right? And this forced me to dump my projects in one of the foundations. And I think it's useful perhaps to turn this around. I mean, what is happening here is that society is asking the software developers to start producing good secure software to basically use industry best practices. Now in open source, we by and large do that. In fact, we pretty much set every industry best practice around security. And it's our downstream people in the commercial markets who are often not updating. I mean, we update log4j within 24 hours and then like now years later, it's still not being done universally. So I think to a large extent the answer to that question is that as developers basically, we'll have to sort of like get more systematic and more explicit about documenting the good things we're doing. And I fully expect sort of like that a year from now, two years from now, we basically all more or less have documented that in the same way. Because I mean, at Apache we've documented some of the things at Eclipse, at Python, we basically all doing the same thing. So yes, of course, we're going to steal each other's documents, right? It's open source. I mean, that's just the easiest way of doing it. And then indeed, basically, you sort of like get that foundation like style, all those things which are part of an open source steward like being sustained in the market, being responsible about these things. Yeah, simply then becomes much wider available. Thank you, Dirk. Yeah, just maybe to add something like I hear your point that, OK, if there is some constraint due to the fact that there is an open source steward, I absolutely want to avoid being in this situation or I want to make. I don't think that people or organizations bring their projects to a foundation just to avoid the theory or to do something like that. And that's the main point is more likely to set up collaborations or to have a vendor neutral governance or stuff like that. I think that's our main point in my opinion is that we help create consortia, but the open source steward is a good way to implement the requirements of the CRA in the context where.
Privacy-respecting usage metrics for free software projects
Hello, hello, everyone. Welcome to 4th stand 2024. And this agenda speaker is Will Thompson. And welcome him. Yeah. Hello, can you hear me? Great. So, so, how about this? Is that better? We'll see how we go. Cool. Hi, everyone. Thanks for coming today. I've seen a lot of really great talks in this room over the years. It's a real privilege to be on this side of the auditorium for the first time. So, a little bit about me. I'm an engineer at Endless West Foundation, where I've been for seven or so years. And I've been working on GNOME and GNOME adjacent stuff for longer than that. And today I want to talk about why it's useful for free software projects to collect usage data. I want to talk about how this can be done in a privacy respecting fashion. I'll talk about the Endless OS system for this as an example of maybe an existence proof. I'm not necessarily suggesting that other projects should take what we've built and use it directly, though, of course, you can. But I hope to encourage other free software projects on the desktop or other ways to consider adopting similar techniques so we can better understand how the software is used. I mentioned Endless, what's that? So I work for Endless OS Foundation. We are a nonprofit organization. Our vision is simple. The whole world is empowered. And access to the digital tools of the modern world is a prerequisite for being empowered. So we strive to ensure access to these tools and create opportunities for underserved and under-resourced communities around the world. We do a lot of things which are not Endless OS, even though Endless OS is in our name. But it's Endless OS I'll be referring to today. So I'll talk briefly about what Endless OS is and what it's for. In brief, it's an immutable Linux desktop distro. Visually it's known with some modest customizations to suit our target users. The groups we work with typically have little to no previous computing experience, but they probably have user smartphone. You can download Endless OS from our websites and you can in some parts of the world buy it to be installed in OEM systems. But we as an organization are more focused on, okay, we are more focused on working with other nonprofits and with companies aligned with our mission to bring computing to underserved communities. So this might be partnering with another foundation to set up a computer lab in a disconnected rural village or we might work with microfinance organizations to make computers affordable to low-income families and so on. And in these contexts, there's often limited or intimate and internet connectivity. So part of the point of Endless OS is we pre-install lots of apps and lots of offline learning resources and we make sure the whole system is fully usable offline. So what do I mean when I say the word metrics? I'm going to use the word telemetry, metrics, analytics, usage data and so on interchangeably. Sorry if there are technical nuances of those words. But I'm referring to the concept of end user software, so that software that runs on a device in your hands, collecting data about how it is used and then periodically sending this to its developers. You might be saying that sounds a lot like spying. Please hear me out. I'm not talking about that. The other part of the title was privacy respecting. So you might be skeptical because when people talk about usage data, they're often talking about slurping up all kinds of personal data about each user, building profiles of each individual and then you sell it to advertisers, which is so the easiest way to explain what I mean is privacy respecting means the opposite of that. But the easiest privacy respecting thing to do is to do nothing. You don't collect any usage data. You don't have to write any code. You don't have to think about the ethical or legal issues with the data collection because you're not doing it. So maybe for a lot of projects, that's fine. And you might ask why? Why would you do this? Well, software is not made in a vacuum. Normally you're trying to help some group of people do something they couldn't do before. And so in order to build good software, it's useful to know how is your software being used. Is it being used at all? What hardware is it being used on? Which features are used? Which features are not used? And so on. And if we have this information, we can make informed decisions about how to build the software rather than basing it on assumption and guesswork and vision alone. The other strand to this is a lot of people are developing free software at work. I work for a non-profit and I would like us to continue to do the work that we do, to advance our mission and also to contribute to the open source comments. And part of doing that is to demonstrate that the work that we're doing has the impact that we are trying to have on the world. And the organizations we work with have similar needs. They need to justify to themselves and to their own sponsors that it's worth putting their time and resources into working with us. So having quantitative data helps to support the impact we're making. And you might say, okay, that's fine, but why don't you just ask your users, run some interviews, do some surveys, some usability testing and so on. Wouldn't that be ideal? And yes, of course, there's no substitute for actually talking to the humans who are using our software. It's quite rare, particularly in free software projects, to have the resources to scale that. And for some things, users are not consciously aware of the ways they're using their software, the software they're using. There are also limits to what you can learn from a half hour or one hour testing session as opposed to usage over time as part of doing your day-to-day work or life. It's very useful to find volunteer testers from the community. You can learn very interesting things from that. Those groups tend to also be quite self-selecting. So this will sue the results towards people who have a higher motivation to tell you what they would like you to do with your software. So ideally, you want both, I think. I think you want to talk to end users and explain the why behind what you can find in data that you have. And in the other direction, having data about how the software is used can drive the kinds of questions that you want to ask your end users. And essentially, every website online store, app, and mainstream OS provides something like this. I'm not arguing that we should do something just because everyone else does it. And hearing that a big tech company does something might often be a reason to do the opposite thing. But there are non-evil reasons to want to do this. And I think it's reasonable to assume that the people who are developing software free or non-free typically want it to be good and useful. And other projects have similar requirements and constraints to what I've just discussed. So even with more resources, you can't constantly interview your users. And we're often at a disadvantage compared to commercially backed software. The big ones are in people and time and money. All of these things are, of course, related. And I think that rejecting the idea of collecting users' data outright just creates more unnecessary disadvantages for ourselves. We should want to have the information that we need to focus the limited time and resources that we do have. And we have the opportunity to use the structure and the transparency of free software projects to do something that's actually better than the status quo in the wider industry. We want to respect our users and preserve their privacy while still being able to make better decisions and make our software serve them better. The kind of axiomatic thing here is we do not want to collect personal data. We don't want to track individuals. We don't want to sell that data or worse have it stolen through some database hack. We don't want to serve targeted advertising and so on. Of course, handling personal data comes with legal responsibilities as well, which if you can just not collect personal data, it's much better for everybody. So if you want to hold a word in mind, think tally not surveillance. An analogy to Cicassidy, who's here with his phone, is think about a library. So near me, our local library is run by volunteers. And you might imagine that one day you go to the library and there's someone at the door holding one of these little tally clickers. And for each person that goes through the door, they click it once. And this helps them to get some kind of measure of how well used the library is. Maybe they can collect this similar tally on different days of the week or at different times. And this can help them decide how they staff the library, advocate for more funding from the local government and so on. The other end of the scale is if you imagine someone kind of following you around in the library and they're going to look over your shoulders, say, okay, you've gone to the computer book section. You've gone to the children's book section. You probably have a child. Okay, watching what you're reading. Obviously, this is not hyperbole, but this is really not what we're talking about here. So sometimes you can get this kind of tally information from some kind of service that you control. So FlatHub is the de facto standard flatback repo. It has, we recently announced that it's reached one million users. So how do we measure this? Well, it's measured by a proxy. There is a runtime which most users of Flatpack we claim have at least one app installed which uses this runtime. And due to the way that Flatpack downloads updates, you can tell the difference between an update and a fresh install. So when a new point release of that runtime is made, you simply count how many downloads there are of updates for that runtime in a given period of time, say, a week after you've released the runtime. And this gives you a pretty reasonable lower bound on how many installations of that runtime there must be. And there was no identifier needed. We didn't need to look at IP addresses or machine ideas or anything, just having some knowledge of the ecosystem and how the Flatpack client behaves. And there are other places where this idea is used. So in Fedora, there's this thing called count me. EndlessOS has something similar. So with DNF is the package system for Fedora. And it has to periodically update the list of packages that are available. And so the approach here is that in one random request per week, and these requests would be happening anyway, an extra parameter is added, count me. And then it has a value which refers to how long it's been since you first installed your system, which gives you some indication of what retention is like of the system. Then from the user agent, it's possible to infer what the distro version is and what variation of Fedora it is and architecture and so on. And it's a clever idea to piggyback on the meta link request. And again, let's users be counted without personal data because there's a fixed frequency. So they published the aggregate data, which I've doctored a little bit to fit on the slide. So here, again, there's the fixed frequency, which meant that no identifier was needed. And there are also these kind of statically determined segments of the user base, which doesn't identify any individual. It identifies a massive group of systems. So the three main ideas for doing something else here is that we want to generalize this approach to finer grain data. But it's data that we wouldn't otherwise have because we don't get anything as a side effect of stuff that's happening entirely on your local device. In the library analogy, they do have this information, I suppose, of which books are people borrowing. It doesn't matter who's borrowing them. It's just in general terms what's popular. So the three ideas here, the first one I've mentioned already is sending on fixed frequencies. If you send information or rather record it on a daily basis, this means that you can be sure when you look at the data that if two different events on the same day appear, they must have come from two different systems. But you don't have to identify which particular system they came from and you can't tell which events were the same from week one, which systems they were. In the other axis, we're not interested in individual events. We're interested in patterns of usage data. You generally want to be able to compare those patterns between different groups of your users. Maybe it's for software version, maybe it's by local, maybe it's by hardware. It depends on what you're trying to learn. But these are determined ahead of time. They're static and they are common across a large group of users or devices, rather, I should say. The third piece is to do some kind of data, which is instantaneous or which you're collecting on a timer. This is easy, but for some things, it's kind of continuous data, which this doesn't work for. For example, app usage. You might want to understand which apps are used the most in terms of time. This is something where you might, on a given day, open and close an app several times. You need to do some kind of client-side aggregation to turn this continuous value into a single data point on a fixed frequency that you can record by itself. So the end-of-the-sometrics system, you'll be shocked to hear, works as I've just described. It breaks down into a few components, which I'll go kind of following the duration of the arrows in this diagram. We'll talk about what happens on the end-user's device, then how that's transmitted to a server, and then what happens once the data reaches the server. So for local event recording, we have a daemon which runs system-wide. It's a D-Bus service, and applications on the system use a D-Bus API to talk to the daemon to record when certain events happen locally. So some of these components are just regular system components that are doing things that they just do. So our updater, for example, you can see the red box in the bottom left is recording an event when an update has failed. There's also one extra daemon, this metrics instrumentation thing, which is for capturing just general stuff about the system, CPU information, disk usage, and so on. We actually also have a mediocre crash reporting system using this mechanism. It's not ideal, but it's better than nothing. And as we'll see, it works for a system which is intermittently connected. So each of these events has some kind of payload associated with it. So we'll zoom in on the red event from the updater. This is when an update fails, we capture some information here. We capture the time at which the update failed. We capture the OS version that it occurred on. We have this UUID. Now, this is not specific to this event that happened for this one machine. This ID is the same for all updater failures. It identifies the category of event that's occurred. And then we have a payload, which in this case is just a human readable and localized error message. And that's kind of gross. We have some nasty pattern matching to untranslate the string in some cases and take out the values that vary just to narrow this down. We transmit this the raw event because it was the only practical thing to do given the way error handling works in the updater. But it's still very useful. This is, from this we can determine, this is the most common reason updates fail if the disk is full. The updater runs in the backgrounds, so it's unlikely that people will be actively checking it. So it's useful for us to know, are there fixable errors that we can sort out somehow. I also talked about app usage. We've patched an MShell on NSOS to record how long particular apps are used for. And so this one is one which gets aggregated in the way I just described earlier, where you coalesce this continuous variable into you slice it by day and by month. Here I'm showing by day. And it's actually the metrics demon which does this. The shell tells the demon start recording event with this UUID and this payload. And then sometime later, when you close the app, it says stop. And the demon takes care of coalescing multiple instances of that into one in any given time period and slicing it if it runs over midnight or over the end of a month. Okay, so now we've got a load of events buffered in this demon. We have an on memory, in memory, an on disk buffer with a size limit. So we just delete odd stuff if we run out of space. And then if and when you're online, the demon periodically reports these to our server and then deletes the local copy of the event. This is an HTTP request. You might be saying, I've said there's no device unnotified. Yes, there's an IP address. We'll come to that. That's an artifact of the internet. And this this upload contains as many events that will be cashed as we can fit in a single request. Plus a timestamp. Actually, there's more than one timestamp. There's a clever algorithm to correct for incorrect clocks and a channel. What's a channel? So this is the kind of static segmentation that I referred to earlier. And on end of the S we have just a couple of things here. The lesson flags for is it a standalone install, a dual boots or a live system? Interesting. But the main thing is this image identifier. And so this is an artifact of how we build and distribute and the source. When you install in this OS, you're taking a disk image, which has been pre built with a load of apps in it, and you install that by just DDing it directly to your disk. You image the disk with the same image. And so we have custom customized these in various dimensions. There's this product idea, which is how this came to end up on a computer. Was it a download version? Was it an OEM partnership? Or was it another organization we're working with? Or is it a custom built image that someone has built using the tools we provide? There's some other stuff about the original OS branch that was installed and the hardware architecture. And then this is personality, which again is an artifact of the way the OS works. If you're pre installing lots of learning resources, you want them to be in the language that the user speaks. So we have different variations for different decals. And we have a basic one which doesn't contain all of the massive reference apps. And when we work with partners or in particular projects, we often make a customized version for that. And that identifier ends up in this personality field. So if you go to the website today, or in fact at any point since the third of January, and you download the French version, you will get this image. It has this OS product, which is what we refer to the download version. Some attributes about the branches, the time stamp of when that image was built, and the personality. And so any system installed having chosen French will end up with exactly the same identifier ever since the start of the year. So this is what's on my laptop. And I happen to know that there are only other two users of this one, and one of them is over there. That's a unique case because we built this specially for a bunch of laptops in the UK endless team, and we never publish this image. And that's an edge case. In general, the same OS image is used by many different systems. So we have submitted a batch of events together with the channel to the server. What happens? Well, first of all, we discard the IP address. We don't want that. The HTTP, the endpoint adds a yet more time stamps to this bundle of events and puts it in a readers queue. Now, something totally separate, which has no idea where this bundle of events came from, pulls from the readers queue, and it splits the events apart and stashes them into a SQL database. There's one table in that database for each category of events. So I talked earlier about the daily app usage event. So this table has a field for the day. It has a field for the app, and it has a field for the duration. So in this example, of course, in the real database, there'd be many more rows. But just by way of example, you can see there were two different GNOME terminal events on the 30th of January. So we do know that there are two different systems. We don't know if the Chromium user on the same day was either one of those two users or a third user. The next day, there's an event for GNOME terminal, two and a half hours usage. We don't know if that was any of those two or three users we've already talked about, or a fourth user. We also have this aggregation by calendar month, which has higher latency, but it tends to be less noisy. And these tables are not linked to a device identifier. They're linked to the channel that was associated with the event. So this has this image identifier which is shared between many systems. And so we can't match up which different events came from the same system. We can't even identify which different instances of the same event came from the same system. Of course, there's an element of trust in this, like the server could be behaving not in the way I described. The best answer we have for this is that we're not doing that, and the server is all open source. So you can go and take a look at what it is, what's on our GitHub, is what we run. And the system is on by default because we've designed it to be privacy respecting. When you first install endless OS, like many GNOME systems, you get an initial setup wizard, which takes you through some steps to set up your system. This is actually from the development branch. It looks a little different in the released version. There's a toggle for enabling or disabling this feature. The toggle is enabled by default, but nothing will be submitted until the user setting up their system has gone past this page and continued to the end of initial setup. If you set the switch to off, then nothing will be captured, anything that's already been buffered but not submitted will be deleted. Of course, you can control this later. Once the events have been submitted to our server, there's no way for us to delete the events for a particular system because we don't know which events came from which system. And defaults are very powerful. The overwhelming majority of systems leave the default enabled. You might say, well, of course they would. Everyone likes defaults, right? The point of this is to get more representative data about a large body of systems. The system relates no personal data, it's designed not to be invasive. Being on by default keeps us honest about that. We really have to be sure that we're not collecting anything questionable. And some people, you can see some number here, may prefer that we don't do this. Of course we allow that, but we don't force someone to make a choice. Decision fatigue is real, particularly during first boots. We've seen that people get scared off by the number of questions that are asked. What's a keyboard layout? So adding more questions which people don't have the context to answer is not necessarily helpful. I acknowledge that not everyone agrees. There are other opinions. This is what we do for now. So what if we learnt? Some people may have read a blog post that I wrote six months ago with some examples of what we learnt. So for those who have read it, everything here is new. Parental controls. So some time ago we developed a feature in NSOS to allow parents to disable access to certain apps which are installed on the system to control whether their child using the system can install new apps and to set age rating thresholds on those. As part of integrating this into GNOME, which is now upstream, this screenshot is from GNOME OS, and we added this to the initial setup flow. So it's to be more easily discoverable. When you create a new user as well as choosing their name and the username, you can tick a box which is a little out of focus. The box at the bottom says set up parental controls for this user. It's unticked by the form that people tick it. If you tick it, two things happen. One thing is that this three things. The user you create is a standard user, not an administrator. A separate administrator user is created with a separate password. And then on the very next page you're offered the option to choose which parental controls you want to apply to this child. Now in this screenshot, if you sprint at it, no controls have actually been applied. The default is that you have to actively choose which things do you want to enable to to restrict. Do you want to restrict access to web browsers? Do you want to turn off certain apps? Do you want to set an age limit on which apps people can install from GNOME software? And we instrumented this and a large minority just left the defaults. So 40-something percent of people who chose parental controls didn't actually enable anything. That doesn't tell us why they didn't do that. I mean you can come up with some good theories, but it tells us that there's research to do in this area and it can help us to guide what we do next with this feature. A tool. So GNOME 40 introduced a tool that's offered when you first log in and that's whether you've previously used an older version of GNOME on the system or this is a fresh installation. NDSOS 5 was the first release to include GNOME 40 and it looked, as I showed you earlier, very GNOME-y which is rather different to what the previous versions of NDSOS looked like. So we inherited this tool. When you first log in you get this prompt and if you choose to take the tool you get a tool which just briefly walks you through how to use the desktop. I was curious whether people actually take it so I added a very quick patch to instrument this. This isn't really a show me the code kind of talk, but just for an example this is what you need on the client side. It's legible. So the top line is we've just defined a constant for the UUID, we just generate an ID, and then you have the two lines where you create the payload which is a Boolean which is true if they chose to take the tool and false otherwise. And then we call this method on the event recorder class to record the event. That's all you need on the client side. This is a small C library around a small D-Bus API and there's geo object introspection around it so you can access it from JavaScript and Python and all other things. Then the server, this is using SQL Alchemy as the ORM. You define a table like this which has some keys that have a name of the table, the same UUID. Again this is for all events in this category. The payload and how to turn the payload into an instance or into a row of this table. It's a little annoying that you have to do database migrations to add or remove events on the server. It has the downside of having the data in these nice structured tables but there's an upside in that we can generate the documentation which is on read the docs of which events the server understands. So the results are in. We captured this bit of information from 35,000 systems and across those 35,000 systems about 19% chose to take the tool. My assumption was that more people who are upgrading would take the tool than new installs because if you're upgrading you're surprised or this looks a bit different what's going on. Actually it's the reverse. At the top row we see users who are fresh installations and 32% of those users took the tool out of 5,000 total in the period we sampled this. Whereas for upgrade and list OS 4 it was just 15% who chose to take the tool out of a total of 29,000. This is just a snapshot because now that we've answered the question we've deleted this data. We've erased the data from the database. We add the UID to a list which gets discarded as soon as it's received from old plants. We've also updated the OS to remove the three lines of JavaScript I've shown you so we no longer collect this data on up-to-date systems and we discard it if we receive it from old ones. This is the part where I talk about all the things that are subpar about this system and what we might do in future. The big one is it's actually really annoying to have the data split out in this way. All the app users are atomized and we can't answer questions like does someone who uses app X also use app Y? Is there any correlation between groups of apps that people use? We could of course submit one event which contains all of the apps that a given user uses but that starts maybe that's a bit too fingerprinty. It would be nice to find some way to answer questions like that without implicitly fingerprinting users. It's also hard in general to slice this in new dimensions that you haven't already chosen to slice by. One question might be whether parentally controlled accounts behave differently in some ways to accounts that do not have parental controls enabled. The parental controls flag is not part of the channel so we can't see for any other event whether it came from a parentally controlled user or not. This is all just a consequence of what you choose to slice the data by. I think the trade-off is worth it but I need to acknowledge that it is annoying to not have it identify. There's also some kind of indeterminate upload latency. The problem here is how do you know when you have basically all of the data for the last time period? It's particularly bad for monthly events. Today is the third of February. Let's say I left my desktop at home. I switched it off on the 31st of January. We can't submit any data for January until February has started because otherwise we might have to add a bit more to the tally after the fact and you can't do that in the survey. Now my computer at home is switched off while I'm here. I'll switch it back on on Monday. That's the fifth. That's a five-day lag. Is that typical? Maybe we can look at the timestamps when we receive the events but we can't do that because we don't store the received timestamp for each event because if we stored that we could figure out which events came together. You can probably imagine ways to solve this by reducing the precision of timestamps and I think that's true in general. There are some cases where we have more precise timestamps than we might like largely for historical reasons. There are some complications if you can't assume that the local clock is accurate. Of course NTP exists but many endless OS systems are used mostly offline and it's also quite common we found for the real-time clock battery to have run flat. So it's not that unusual for people's laptops to have a totally incorrect time until they connect to the internet and then when they go offline and run out of power it goes back to sometime in the past. There's a lot of research into how to randomize the data that's submitted. There's randomized responses, differential privacy. I'm sure there are people here who know more about this than me. We haven't really explored this but the basic intuition is that you add noise to the data you record. Suppose you're recording a coin flip, maybe the parental controls one as an example. In 50% of cases you just always say true and then in the other 50% of cases you submit the true flag. That of course changes the results you get but once you aggregate it you know that of the 100 responses you get you expect to see 50 truths just from the coin flip and so then you can look at the rest of the batch of events to figure out the true ratio without actually having to know if any of the data points which is true is really true. This might be a way to allow collecting more interesting facts without getting into personal data. There are lots more questions we might like to ask about the software we ship. There's questions like are most, desktop Linux systems, single user or do you have multiple different Unix users on the system. What are the common monitor configurations? How common is it to have an external monitor most of the time? Do people change this around? Do people have their screens arranged horizontally or vertically or in a cool circle shape? Do people use workspaces? How do they use them? Which GNOME shell extensions are in use? I could go on for an hour, I won't do this. I think this data would be much more interesting if we had comparable data from other GNOME distributions. I'm using GNOME as an example just because that's what we ship, insert project name here. Every distro reaches a different group of people. Those groups will have different behavior. For example, I would claim that the typical Fedora user is probably quite different geographically, perhaps economically, perhaps in terms of technical skills, the typical NSOS user. If we have a common structure of data that was shared between all users of a given project, we can compare how the same upstream software is used in different contexts. Other organizations who do this kind of telemetry have public dashboards of the aggregate data. I showed you Fedora's published data from their repo servers. Mozilla has this great Firefox public data report, which gives you sort of daily and active users, monthly active users, version statistics, locale, top add-ons, and this is all, you can slice this by country as well as looking at it globally. Steam has this very interesting hardware survey. They've made a different choice. This is opt-in with a pop-up dialogue and still anonymous. It's very interesting. The median gamer is probably quite a different user to the median desktop Linux user. Kind of, you know, a little tongue-in-cheek, haha, only serious. In December, Spotify publishes this thing where you can open your Spotify app and it tells you in like really garishly bad images if you are like one of the top 100 listeners to some artist. And you see a lot of people remarking when they do this that this is kind of creepy, they have all this data. It's very cynically free marketing for Spotify. Now, of course, that's true. It is free marketing for Spotify. Other streaming services are available. But it's also fun and sociable. I've had conversations off the back of this that I wouldn't have had otherwise. And maybe we can have free marketing too. But we could do this differently. The central entity doesn't know anything about any individual, but we could potentially publish percentile distributions. And then on the local device, you could fetch this and determine, oh, right, you are actually in the top 5% of the ability. Maybe this is a bad idea. I don't know. Anyway, just to wrap up, I guess, from a starting point, I hope to have made the case that telemetry doesn't have to be creepy. There are ways that you can gather data about how your software is used without being invasive and building profiles of your users. And in an industry where I think not enough thought is given to this, I think we in the free software community can lead by example. We can build something that is better and allows us to improve our software while showing that a better way is possible to the broader industry. And if we do that, we can make decisions based on the combination of data and vision. The two work together to make something that's really great. Tomorrow morning, there's a telemetry buff in Room 121 and AW1. I hope to see other interested parties there and for people to tell me all the prior art that we didn't know about. That would be great. Hope to see you there. Otherwise, that's all I've got. There's some various links. If you follow me on Masterdome, don't expect too much discussion of this, but you're welcome to come. My blog has an older write-up on the same topic, which has some more and some less details. And the source code is on GitHub under the endless M organization. The name of the server and the event recorder is the service that buffers and submits the events. EndsOS.org is the place to go for more information about the EndsOS foundation and our work. Thank you. Hey, hey, hey. Does anyone have some questions? Please raise your hand. Oh, okay. I was wondering, you showed us that 10% opted out of sharing metrics, but how do you know that? So, in case the question didn't come across the PA, I think if I'm right, the question is, I said that 10% of people opted out. How can I know this? So, I mentioned that we have a system similar to Fedora's count me system. So, it sends a daily ping with a retention counter with no other identifying information, plus there's a Boolean, which says, is the full-fat metric system on or not? Okay. Thanks. Does anyone have some problem? Oh, I see you. Thank you. Hello. Hello. So, your talk has mainly been focused on how to get metrics. Sorry, I can't quite hear the question. Sorry. I couldn't quite hear what you were saying. Sorry. Is this coming through? Yeah, okay. So, your talk was mainly focused on anonymous metrics effectively, making them as unidentifiable as possible. And you did say that one of your problems is if you wanted to aggregate, if you wanted to sort of correlate these metrics to kind of figure out, okay, if person X uses this app, do they also use the other one? Have you given any thought internally on how you might do this in a way which wouldn't impact privacy? You mentioned fingerprinting, would it be one concern? Have you elaborated on that at all? I didn't touch all of that, but I think you're saying that I mentioned that we would be interested in knowing. This is probably focused on an anonymous system, and so this is one of the reasons we can't answer the question, who uses both app A and B? And you'd like, I think if I understand the question right, it's do we have any ideas for how we could do this? Effectively, yes. Okay, there's a few ways you could do this, right? One idea that we haven't explored, but I think would be interesting, is to layer onto this an opt-in system. So you could prompt people to be part of a time-limited study, and you could temporarily add something extra to this channel, which identifies them specifically for a fixed period of time. Then we'd turn it off on the client side, analyze it on the server, and then delete it. Then I think, so you think you can, it's easier to add more stuff to the channel than to remove it. And the other way to do this would be to look at some of these differential privacy techniques, and then submit a single event containing aggregate app usage for all apps on the system in any given week, let's say, but add artificial noise to that. So with some probability, change the numbers, replace the names of the apps, remove apps from it in a more systematic way than just shuffle it around. And there are techniques you can use to add noise while keeping the distribution of data the same. We haven't had an opportunity to go into that, but I think that's probably, in the general case, the way to address the points. Maybe there are other ideas. I'd love to hear more. Thanks. Any questions? If you have any questions, you can raise your hand. Hello. We still have 10 minutes to ask a question. 10 minutes left. Any questions? You can raise your hand. Okay. Okay. Climb into the speaker. Thank you very much.
Learning from disaster response teams to save the internet
Hi, everyone. It is great to be here. Thank you for coming to this talk. If you're here for the magic show, I'm afraid you have 30 minutes to wait. I'm here to guide us in an exploration of what we as a community, as open source practitioners, can learn from some of the most finely tuned and highly performant teams in the world. First responders. Through the interdisciplinary lens of social network science. So perhaps there is some magic in this talk. The magic of people working together. My name is Hannah Aubrey. I lead fast forward at Fastly. Let's save the Internet. In a past life, I was lucky to serve as a study coordinator of Sonic. No, not the one with the roller skates and the hamburgers. Dang, I knew that joke wouldn't play in the EU. The science of networks and communities research group. Sonic advances social network theories, methods and tools to better understand and meet the needs of diverse communities. They develop cutting edge techniques to study and improve social and knowledge networks and distributed working groups, online communities, virtual teams, and other large communities like the one we're all in. I am thrilled and a little bit washing to share that with the director of Sonic. Professor Nasheer Contractor is here in the audience today. Thank you for coming, Nash. And my dear friends, if you have any tough questions, please direct them at him. Let's start with a history reminder. Our earliest ancestors not only had to contend with the same natural disasters we experienced today, they also had to adapt and survive to nature itself. First, we became bipedal, freeing our hands to reach and to grasp and also to communicate simply with each other. Next, we developed complex brains with prefrontal cortexes, our personality centers, which enabled us to make snap second decisions based not only on external stimuli, but also our past experiences. Then we developed symbolic language to communicate complex ideas and then finally tools to take control of and shape our surroundings. So you see what makes us uniquely human. Actually, what has brought us here together today, the abilities to ponder, convene, reflect, build, collaborate, and coordinate are not only what make us so special, but also so successful. So then our tools got a lot better. The first fire pump was invented in Alexandria in the third century BCE. Unfortunately, it could not save the library, but I digress. As societies and civilizations began to form, the blast radius of disasters grew. We settled into towns that could burn down and buildings that earthquakes could topple. And so those smart brains of ours formed teams whose sole purpose was to patrol and respond to natural and made manmade disasters in the form of firefighters and police forces. Then societies became more complex. And with that came more complex disasters, not only fire and flood, but we created monetary systems, banks that could collapse and food systems that were prone to mass famine, not always for lack of food, but sometimes for lack of transportation or poor planning. Our close proximity to each other in cities and long distance cultural exchange made possible by ships brought diseases, colonization, and war, which ravaged human populations. We think of these ages as dark or undeveloped, but their responses to such crises were surprisingly neither. In fact, we begin to see thoughtful and multifaceted disaster response, not only search and rescue or medical aid, but tax relief, temporary infrastructure, even what we now call refugee camps, providing long term food and shelter for displaced peoples. In 1493, the Knights Hospitalers shipped doctors and surgeons to the Greek island of Kos after an earthquake. And so we see some of the first evidence of multiple different groups or organizations coordinating across disciplines and borders to respond to a disaster. In the intervening years, we've continued to hone our disaster response strategies. Humanities impact on this planet has required us to do so. And besides, those prefrontal cortexes of ours have a lot more data to lean on than our friends the ancient Alexandrians had. If they knew then what we knew now, maybe they could have saved that library. You should pull it together. Anyway, today we have entire organizations, governmental bodies, NGOs, and community groups dedicated to such activities. We have laws by country and internationally to enshrine basic human rights and ideal responses in crises. And now we're building a new frontier, a new form of transit. We're creating massive new civilizations, hosted on smallish, inscrutable, blinky boxes. In this new world, we can't even really see the threats, the crises. We're throwing people together in a way that's affecting global social structures and people's everyday lives. Like every form of infrastructure, like most every place where humans gather to live, to work, to learn, to play. The internet has grown up in an unplanned way. And we're still scrambling to understand it, to learn from our mistakes, to apply those lessons, to build the best internet, to build systems that protect people and systems that react when people are harmed. But don't worry too much. We'll survive these dark ages. Our species has survived every disaster it's encountered, at least so far. A common organizational structure found in groups undertaking large-scale operations to solve big, big problems is called a multi-team system. A system comprised of multiple teams working towards a shared goal. These structures can be found throughout all sorts of industries, working on all sorts of problems, disaster response, space exploration, governing humans, building stuff. If you're part of a business with multiple departments, you're in one. If you attend or work in a university, you're in one. And if you maintain or contribute to a support or, excuse me, contribute to or support or care about an open-source project, you're also in such a system. Because no matter what corner of the internet you occupy or to which technology you contribute, you're working in service of our shared mission to keep the internet open and free. So what makes up a multi-team system? Within the subordinate, the superordinate team, the entire system, we have local teams working on local or proximal goals, which may even be split further into component teams. And directing the subordinate teams is the leader or perhaps the team of leaders, which shares a global or system goal. And when you examine these teams using social network analysis, you find common patterns between successful MTSs. There are many more patterns we could discuss, but let's focus on three. A plan for coordination paired with frequent, clear communication, highly-performant and resilient local teams, and finally, empowered and effective leaders who are willing to sacrifice their local goal in service of the global goal. So before we explore each of these patterns, I want to share this diagram with you to underscore the importance of these patterns in disaster response. Because that term disaster response makes such activity sound reactive, doesn't it? But in reality, the most effective disaster responses begin long before the disaster happens or second best right after a disaster occurs. So I ask you to bear that in mind through the rest of this talk. After all, the best time to plant a tree was 10 years ago, and the second best time is today. First, let's talk about planning, coordination and communication. I don't think I need to talk about docs too much here. I think the OSS communities know this one quite well. And engineers know all about retrospectives. Like I mentioned, disaster response begins well before the disaster occurs. So in terms of coordination and communication, knowing where to turn for help or resources before a disaster occurs spares valuable time, energy and mental load during a crisis. Effective communication prevents errors in the field, helps the even distribution of resources and helps us learn from the mistakes we've made last time so we don't make them again next time. During disasters, response teams crucially over communicate. They share reports on the situation as it evolves. They communicate with stakeholders on the ground. And they report changes or progress to make the best decisions. Leadership and subordinate teams must have the most accurate and up to date information. Because knowledge sharing fosters a coordinated and collaborative environment. It reinforces the multi-team system as a single unit, not a set of separate teams. And because knowledge sharing makes it easier to be flexible and adaptable in rapidly changing environments. Interestingly, research has found that inter communication, communication between local teams is more important than intra communication. Communication within the local team to the success of the whole system. So in fact, there's actually a Goldilocks zone of inter to intra communication. Local teams should communicate half as much between teams as they communicate within their own team. Any more inter communication than that and performance declines any less than that and it declines too. When we talk about the viability of a team, we mean the success of the team. In moments of disaster or crises, the stakes are life and death. And at the end of the day, disaster response teams and open source maintainers too, they're people. They have feelings. So viable teams or successful teams support each other. They lend a hand. They take emotions into account when making decisions. Viable teams engage in what is called disruption, buffering behaviors, which is to say change management. They try to anticipate changes that may occur, plan ahead and invent that some change or disruption occurs. And again, they support each other through those changes. Viable teams also try to balance performance and resilience because when you work with people and you're so hell bent on performance, then the team's physical or mental health is at stake. And the team becomes brittle and the team does not perform well. So when we see teams that are so, because people do not want to be a part of such a team, right? So I'll say that again. There's a difference between successful teams and teams that people want to be a part of. And in the long term, teams that strike the right balance are the ones that are the most successful. Finally, the most important, the most performant teams strike the right balance between reinforcing the team's boundaries, which is to say reinforcing the identity or team spirit of the local team and boundary diminishing behaviors, which reinforce the local team as part of something larger than the team as part of the whole system. So a little bit of silo is actually good but not to the extent that teams develop an us versus them mentality, which brings us to our last assertion today. Empowered and effective leaders. Strong leaders serve as an ambassador to the team and for the team. Internally, they help teams understand why the team has a certain goal or is performing some task. Within the system, they advocate for the team's priorities and points of view. Those are called boundary spanning behaviors. They make sure that the team has the information it needs, not only the what, but the why of a task or priority that they understand their own team's priorities. In a disaster response scenario, times of the essence. Rapid decision making allows teams to quickly assess the situation, evaluate available options and act promptly to address emerging challenges. Delays in decision making can lead to missed opportunities, increased risks, and further escalation of the situation. And as much as we're proud to be a part of our own team, we must recognize and understand other teams' respect and contribute to their priorities and not be too selfish in our own focus. That's why a crucial feature of successful multi-team systems, of disaster response effectiveness, is that local leaders and teams are willing to sacrifice their local goal, if it means more for the common good. So now that we've immersed ourselves in the theory of effective multi-team system performance, let's illustrate it with a real-world example. I recently discovered this amazing YouTube channel. It's called Brick of Mortar. It's all about infrastructure disasters, ship sinkings, critical failures. It's fascinating. If you're into this kind of stuff, check it out. You'll never look at bridges or tall buildings the same again. The sinking of the MV Ferry Wall on April 16, 2014, off the southwestern coast of South Korea, was a disaster, not only in and of itself, but also of multi-team system performance. Over 300 people paid the price for these failures with their lives. On what seemed to be a trip like any other, the ferry suddenly made a series of sharp turns. But as we know, a disaster such as this starts long before the immediate catalyst. Over the years, this ferry had been repurposed many times and additions had been made that affected its balance point. For this trip in particular, the ship had taken on excessive cargo, which compromised the vessel's stability and made it more susceptible to capsizing. What's more, the ship's crew had drained the ballast that's water that's kept in a ship to make sure it doesn't sink, to make sure it's properly balanced. They didn't want it to sit too low in the water, they wanted to be able to pass inspection knowing they'd taken on way more weight than they were supposed to. So the communication breakdowns. First, when the ship began to list, the captain refused a sendage of stress call during the crucial first moments, delaying rescue efforts as the ship began to sink. He told passengers to go to lower levels of the ship after refusing to tell them anything about the impending disaster. For crucial moments when they should have been getting on to the deck, getting ready to be rescued. When he finally sent the distress call and rescue ships came, they quickly learned that the actual communication infrastructure, the radios that the ship needed to call the disaster teams were either malfunctioning or were broken. Something had gone wrong with them. So despite the rescue teams trying to raise the ship's crew on the radio, vital communications failed during those crucial first moments. So you can see the ferry seawall had no plan for intra communication in the event of a disaster. They coordinated poorly, not only within their local team but also with the rescuers. So they failed to inter-communicate with local teams. So the system, the global team failed. So for the sake of this section, let's quickly divide the various local teams. The crew is a team, the rescuers are a team, the passengers are a team, and the South Korean government is a team. What were each of those teams' goals? Passengers wanted a safe trip. The crew should have wanted to get them there safely but they just wanted to maximize profit. The rescuers wanted to make it to the site quickly and save as many passengers as possible. You would think the South Korean government would want to save their people and prevent such a disaster from happening again, but unfortunately that was not the case. Their true goal was to save face on the international stage. We'll talk more about that in a second. Now each of these teams had goals that were in opposition to another team's goals. And as the circumstances evolved, there was no ability of any of these teams to shift their priorities or manage this change to negotiate the priorities and evolve. And each team in the system saw the other team as a detriment to achieving their own goals rather than as a part of a system, as allies, as individuals worthy of consideration. In fact, the crew had never received proper safety training. So even if their goals had been aligned, they were not properly equipped to perform. Now the next example from this horrible tragedy is an example of leadership, failure, and boundary reinforcing. When rescuers arrived on site, the assembled parties included Japanese Coast Guard and the US Navy. When a ship sinks, often there will exist air pockets within the ship. If passengers can find them, they can survive for seven days as long as they have food. The US Navy or water, pardon me, the US Navy and Japanese Coast Guard and private citizens too all had the equipment necessary and were on site, the equipment necessary to conduct such a rescue. But due to South Korea's rigid hierarchical culture and their government's desire to save face, the teams that had the equipment necessary were not allowed to perform the rescue. It's an example of unwillingness to sacrifice the local goal and a lack of emotional and really life support to the passengers who just wanted to survive. In fact, throughout the crucial hours then days when those high school children that were trapped in that ship could have been saved, the South Korean government lied to the parents who had assembled to wait for news about their kids. They said that all the kids had been saved despite that being quite far from the truth. So what do I hope the open source community will take from this line of scientific inquiry, from the lessons of the MVC wall? Because folks, this ship is sinking. Our planet's ecosystem is failing. The climate is changing. I hope when projects, especially leaders, see someone building something similar to what they're doing, they start to think that other project is an ally. That other project is an ally, not a competitor. They think, how can we help each other? Not, how can I win? Or worse yet, how can I sabotage them? I hope maintainers who make the commitment to serve their community understand the commitment they're making and live up to that responsibility. Because remember, it's not a commitment you have to make. You can make something, you can choose not to maintain it, you can choose not to accept issues, change anything about it. But if you make that choice, I hope you live up to it. And I hope you respect your community and listen to what they need. I hope BDFLs, benevolent dictators for life, focus more on the benevolent part and less on the dictator part. I hope we can take better care of each other. So many maintainers and contributors out there in this room are carrying so much weight and holding so much space for all of us. I hope we can do more to help them or at the very least, I hope we can spare them kind words. I'm not under illusions here. I don't expect what I've said here today to do all that much. People have said a lot of what I said here many times before, but maybe, just maybe, I've touched one heart or one mind and maybe that heart or mind will go out there and they'll make a different choice because of what I said here today. Or maybe they'll speak up and share what was touched today with the next person when they see something wrong. Maybe, like our very first ancestor who looked up and reached, maybe we can make a little difference now that will make a really big difference for the people who come after us. Because the last 10 years, the platformification of the web, the inshidification of those platforms, that was not a new normal. That was a glance at a future that doesn't have to be. Our power out as a community is in our principles and it's in our numbers. If we can convene, if we can coordinate, if we can collaborate, if we can take good care of each other and choose kindness every day, if our leaders stay humble and choose the greater good over their own enrichment ego or fame, we can change the course of this information age. We can change the course of history. But it will take all of us working together and it will be damn hard work. The wonderful organizers of FOSSTEM have given me this stage, so to close this talk, I will now issue a challenge as if all of that wasn't already a challenge. From my perspective, and I'm speaking especially to our leaders, we must focus our collaborative energy and kindness on the following three areas. We must make the internet more efficient. We must make our code bases smaller. We need to reduce storage usage, duplicated requests, and reduce the distance data needs to travel. We are in the midst of an energy and an environmental crisis. Half our world is drowning and the other half is on fire. And as the diaspora of people across digital social spaces continues, we must collaborate across the internet community to protect disadvantaged, disenfranchised, and marginalized people. When diversity and inclusion suffers, we all suffer. Our pursuit of knowledge, societal progress, and the advancement of humanity only succeeds when we are inclusive of all walks of life, of all creeds, of all religions, of all races, of all colors, of all communities. Barring those who promote violence or enable hate. And we must protect science and knowledge. We must stand for the truth, not only from a geopolitical and societal perspective, but also on an individual level. We need to protect people and the systems through which we organize into collectives. We have to make the truth resilient. Whether you recognize it or choose to identify as part of it, you are part of a movement. Whether you're doing this in your spare time as a passion or as a hobby, or if you're one of the lucky people who has found a company to pay you to do this. You are part of a movement. You have experience and passion and you're smart as heck we need you. And I believe in us. Thank you.
An engineer's guide to Linux Kernel upgrades
Thank you everyone for coming to my talk. My name is Ignat. I work for Cloudfer. Who here heard about Cloudfer? Who's using Cloudfer? Should be more hands, by the way, because even if he didn't hear about Cloudfer, probably using Cloudfer one way or another. This is my first time at FOSDEM. So thank you very much for exchanging your lunchtime for my talk. I hope it will be really exciting. And today we're going to talk about Linux kernel upgrades and how you should do them, and most likely how you should not do them. So a little bit about myself. I do Linux at Cloudfer. I enjoy system security and performance, and I'm passionate about low-level programming. So the Linux kernel, drivers, bootloaders, and other stuff, reading in unsafe programming languages. Okay, before we start, a little bit of show of hands. So what would you do in this case? Imagine you're working at the shoot on your laptop. You're doing stuff. And yeah, and suddenly this pop-up comes in. I'm like, oh, updates available. What would you do? Like install now? Who's install now? Oh, nice. Well, who's resumed later? Do later? 50-50. So those people who raise their hands for install now, what if instead it wasn't your computer but a production system? Who would press install now? No, very few. But yeah, you like Bitcoin probably, right? Risky. Yeah, and usually it's something like that for production system, right? So it's a difficult choice between remind me later and don't remind me at all. Please don't install. And this is natural, I think. Because it's connected to the fact how do we perceive software updates, especially for production systems, right? Well, we don't perceive them really good, right? So we perceive software updates as kind of these monsters where they come in, they're nasty, they're bugging you. They kind of like an update can break your stuff. Like the traditional engineering motto, if it works, don't touch it, why do we need to install an update, right? Yeah. But the thing is, with regular software updates, we perceive them as monsters, but they're not really scary. They're kind of annoying and ugly, but pesky, but not that much. When it comes to Linux Chrome upgrades forever, it's mostly like this big monster trying to destroy the universe, right? And why that? And again, it's natural because, well, we know how to deal with regular software updates. Yeah, you have a service, it crashes once a week in production, how do we fix it? Well, if you use like system D, you'll just set a policy for it to restart it, and yeah, job is done. It can go home. Well, yeah, you'll be kind of restarting a service once a week. Your service will be in slightly degraded state, but yeah, you'll buy yourself some time to investigate and fix it later. When the Linux crash, Linux kernel crashes, however. Well, technically, this is you, right? So it's end of the world because you don't have any system D to restart it. You don't have any metrics and understanding why it happened. Your service is not reachable. No SSH to debug nothing. Well, it's kind of, it's indeed end of the universe. And that's why usually we're scared of software updates, but when it comes to Linux kernel updates, we're scared like even more. And this why like people avoid updating their Linux kernel for the most part, right? Especially in production systems. But there are common risks. If you don't apply software updates regularly, especially for the Linux kernel. So the first one of them is like your bugs are not getting fixed. And here's some statistics. So I will be talking about the Linux kernel release cycles a little bit later to introduce you. This is basically the preview is a snapshot of all bug fixes releases of a stable kernel branch 6.1. So the latest Linux LTS kernel is 6.6, but because it doesn't have as many releases, so you don't get pretty graphs, I decided to go to the previous one, 6.1. And what this graph shows you is the number commits per each bug fix release on a 6.1 stable kernel. So again, I'll be talking about release types later in this talk, but you at this point, you should know that these bug fixes releases happen roughly every week. And these bug fixes releases are what the name says. They're only bug fixes. There are no new features, no subsystem rewrite bugs and security vulnerabilities. And as you can see, so far the 6.1 stable kernel had 67, 76 releases, and out of 76 releases, there are 50 releases with more than 100 commits in them. So it means 100 bug fixes every week. Almost every release, really, like 80% or something, right, if I'm doing the mass write. 20 releases, so it's 25-ish percent every four release, every fourth release, so roughly every month, have more than 200 commits and maybe 200 potential bug fixes. And there are like these five mega releases with more than 500 commits in them. And actually, if you look in the graph, it's actually seven, but the last two barely made it to the 500. But yeah, these are like these mega releases with a lot of commits. So if you don't upgrade your kernel regularly, your system runs with all these potential bugs, like, and every week you delay, you're kind of missing out at least on 100 bug fixes in your kernel. Second, what you'll be missing out is on potential performance improvements. This is a snapshot from Cloud 4 production systems when we started evaluating, we were using at the time the 5.4 stable kernel and we started to evaluate 5.10 kernel. And so we did like half and half deployment to set of identical servers, like one with 5.4, one with 7. And this is like, this graph shows the, you know, like average memory consumption per server and you can see that on 5.10, we have much less memory consumption. And people are like, what did we break? Like, what happened? And nothing bad happened. It's actually, yeah. So that was 5.4, 5.4 versus 5.7. So we kind of saved something around 5 gigs of RAM per server. And like, at first we thought something broke, but when you dig later into the mailing list, you just, you see that like, you know, like some other folks in this case, this was Facebook now matter, and nice people did some improvements in the kernel code and improved the memory management system. And now you are consuming less memory for the same workload, with the same performance. Right? So it's like, it's almost like downloading RAM from the Internet. And you basically get it for free if you just apply an update, like it's open source, right? And recent news, for example, the latest LTS kernel is 6.6 and it rumored that it has a new scheduler in that. And there is a phoronics article that says like, if you're using Nginx, with that scheduler, it will be much, much more performant. So you'll get it for free as well if you move to 6.6 potentially. I mean, I don't have any pretty graphs because it didn't work better for us, but maybe for you it will. Yeah. And I mean, looking a little bit forward to the next talk, after mine, there will be some discussion, I hope, regarding some security improvements with TPMs and the Linux kernel, and it will involve some code probably, and you only can get it if you upgrade. So let's look at the same data, but from the point of view of accumulating changed delta. So this is basically the same data, number of commits per release, but it's kind of accumulating. It shows the number of commits since the initial release, right? And in this graph, you can easily see you can commit, you can calculate changed delta. For example, if you're on a 6.1.10 bug fix release and you want to upgrade to 6.120, you can commit changed delta is 1,762 commits, right? And basically, if you assume, which would be natural to assume the fact that the number of changes is proportional to risk, so for example, these are like 1,762 bug fixes you're running with, so it's kind of like the amount of risk you're taking by not upgrading is proportional to that number. Now let's say you wanted to upgrade, but for some reason you decided to delay, and you decided for, like, I don't know, it's end of the quarter you had a big incident, you have, you know, like your company gets a big contract, so you decided not to change anything to be more stable for the time being, and you're postponing the upgrade, and when you actually decide to upgrade now, you're upgrading from 6.1 to 10 to 6.1 to 30, which is like you just extended your not upgrading time twice. And you might think naturally that your risk grew 2x, but if you calculate the difference here, you may see that in some cases, a 2x postponing, 2x not time not upgrading, your risk actually can grow higher, now your risk grew 2.221. Right, so the risk sometimes of not upgrading systems and delaying may grow higher than the time you're not upgrading. So yeah, for 2x delay of not upgrading, we get 2.21 more risk of hitting above. If you're not upgrading, security vulnerabilities are not getting patched. So this is a similar graph, but it now shows only publicly known CVEs patched in a bugfix release, and just this data is actually crowdsourced, so it might be incomplete, but even from this you can see that out of 71 releases, for which data is available right now, 56 releases, like again almost 80%, have at least one CVE patched. And there is 18 releases again, 20, 25%, with more than five CVEs patched. So again, if you're not upgrading kernel regularly, you're running not only with security vulnerabilities, you're running with known, publicly known security vulnerabilities, for which most likely an exploit is available somewhere on the internet. Not patching your security vulnerabilities also puts a risk on your compliance, so if your production systems are subject to some sort of compliance, you have a required time at which you should be patching these vulnerabilities. So for example, if you're subject to PCI DSS compliance, like for most payment systems and stuff, it says that the critical or high security patches or updates should be patched within one month of release. So imagine there is a known, publicly known security vulnerability in the Linux kernel and you have one month to fully roll it out to your production systems. Who here knows about Acvifox? What happened to it? A few hands. So it wasn't about the Linux kernel, but Acvifox was running an Apache server, an old version and patched with known security vulnerabilities and people used an exploit on their system and exultrated some data. And it was a big mess. It was really expensive for the company. It cost its reputation as well as a lot of money, compensation, a lot of lawsuits, so very, very, very bad. Which brings us to not so fun fact. You remember like in the old days when you go to admin forums in 2000 and people were boasting around how long their server, how stable their servers are posting their uptime. Like my uptime is two years, three years. Well, since Apache and Linux kernel requires a reboot, now it's not cool anymore. So if your uptime is more than 30 days, you're most likely vulnerable and not compliant to something. So now let's talk about an anti-partness for Linux kernel releases. If you're managing a production system, for most software updates there is some kind of a change management process or well understood practices which usually like sysadmins, sres and engineers apply to manage change. But most of them unfortunately do not apply to the Linux kernel. So when you go and want to update your production system, oftentimes for a software update, the change management process will ask you why. Why do you want to update and which things from the change log on this new version is applicable to us. Like are we really fixing bugs that are hitting us? Are we really fixing TVs that are applicable to us? Well, and it doesn't apply here just because of this graph, right? So remember these bug fix releases happen every week and like with most of the releases having more than 100 commits, so it doesn't mean that every week you should be going through all the commits and trying to understand if that particular fix is actually applicable to your system. For this it's very expensive. You need a huge team of really good Linux kernel experts to understand if you know like this off by one thing in the memory management subsystem is actually triggerable on your work. So if you do go this way, mostly you'll be doing something like this. You will be just continuously stamping releases for no particular reason with no analysis. Then goes for security vulnerabilities. You say, yeah, we need like we have five CVs we need to patch due to compliance and then you may ask somebody may ask the question, is the security vulnerability actually exploitable in our systems? Do we use that subsystem? Sometimes it's an easy answer if it's in a driver for I2C and you're on a machine which doesn't have an I2C, then you can say no, but most of the time it's much more hard and like many exploits, many successful exploits are not like some kind of high severity big vulnerability. Sometimes attackers manage to change smaller vulnerabilities properly to get an exploit. So going back to that, like going back to this question if you really think of it like who can answer this question? Technically this question can be answered by the attacker because if the attacker has the list of the CVs running in the system, they're highly motivated to break into the system and this is their bread and butter. They spend like 24 or 7 to design and implement successful exploits. But unfortunately you're not asking this question to the attacker, you don't know who they are, right? You're asking for a security patch reviewer, you're going to some team for security people and they're like, oh is this vulnerability applicable? And they're highly motivated to go home on time, right? And they need to review several patches a day, not only from the Linux kernel but from many other subsystems and do other stuff like doing security architectures, doing compliance, many things. So it's kind of you're asking this person, right? And the quality of that answer will not be great. They will say like, yeah, maybe yes, maybe no. So the best course of action is just not ask this question and assume that every CV is applicable to your workload and patch it. Well, one of the traditional approaches in upgrading stuff, especially the Linux kernel is soaking. Like let's put it in the cannery somewhere and soak it for one month to ensure we don't get anything. Yeah, but basically you come back to this by soaking it in a subset of your production, you're not releasing to elsewhere and you start accumulating change delta and therefore your risk of not upgrading grows and hitting a potential bug. Same with security vulnerabilities, if you're soaking it somewhere, you're not patching CVs in your production and you'll have the risk of being hacked and you're probably, for one month's soak, you're probably all, like if you have a one month's soak time somewhere in a cannery, you're already violating some of the compliance which dictates you have 30 days to roll out everywhere. But what does high soak time means in practice? It's usually because we just don't know what we're looking for and what it translates to. We don't have any success metrics or observability how our kernel performs, is it performed the same way after upgrade as it was performing before that. We also don't know our workload. My team gets questions, the same question from many teams, right? Will the kernel break my software? But for every team, the subsystem of interest is different. For a database team, they're mostly focused on IEO and file system performance, but for some image processing team, they mostly care about CPU scheduling and CPU performance. The question should be, will it, I'm interested in this particular subsystem, will it break my workload within IEO workload or like CPU bound workload or I'm interested in some hardware or something like that, networking as well. And probably it indicates lack of sufficient production kernel testing. Within the Linux kernel, you can also ensure that an update doesn't break someone's workload if you write a particular Unix test, an integration test. The Linux kernel has this nice suite called case self-test, which is easily extendable. If you care about a particular feature in the Linux kernel or a particular behavior, you can easily write a program which exercises that behavior and verifies that each upgrade keeps that behavior. Even though the kernel itself is written in C, you can write these tests in any programming language and even scripts. Sometimes you just get, yeah, whatever, kernel is just too critical. Let's have more approvals before we deploy. Regular software requires one approval and the Linux kernel should require two or three approvals. And again, this is related to the fact that we perceive kernel as a, you know, like, bad scary monster which can destroy the universe. But what if I told you that kernel deploys are inherently safer than any other software? Would you believe me? Who believes? You're in the matrix, yes. We learned it the hard way actually in CloudFare. So this is like a map of CloudFare data centers around the world. It's maybe even outdated, but the gist is like, yeah, we have a lot of data centers around the world. And with regular software, how do the updates happen? So from a 1000 feet view perspective. So engineers update the software package, push it to our registry package. Registry then the config management picks it up, downloads a new package. Also the config management may be configured to restart the service which uses the package. It can be graceful or non-graceful depending on the context. It doesn't matter. But the gist is new, bad, or good code can propagate through all this network without proper safeguards in minutes. And CloudFare learned it this hard way. So we had several bad outages where we didn't have proper safeguards for stage rollouts of some software. So we almost caused global wide network outages and these are described in these blog posts. On the contrary, how does Linux kernel upgrade works? The gist is it requires a reboot. So we, and to reboot the server, what we do is we drain traffic from the server, put it out of production, actually reboot. Then it comes up, it contacts our config management, we wait for it to be reconfigured. We run some basic acceptance tests and put back the servers into production. And I mean we would be crazy if we reboot everything at once, so we don't. So we have automation rebooting servers in one by one or in batches. So what it means is kind of, it's an inherently natural, slow-paced, gradual rollout with minimal impact. If things go wrong. Did we release kernels with bugs? Yes. But yes, some servers didn't come up properly. Some servers started showing errors and there were only a couple of servers. So we like reverted the release and like there was no visible impact. One problem is why people are afraid of running kernel releases is they don't understand them. How the kernel release process works. So kernel versions are designated by three numbers, like one number dot, another number dot, and then another dot. Example, like 6.132. Who here knows about semantic versioning? Almost everyone. So the gist of this talk is this is not a semantic versioning system. Everyone confuses this with a semantic versioning and it's not. But instead, what really is the first two numbers mean the major version, not major or minor as in semantic versioning. And the right most number means bug and security fixes. And when the right most number increments, you most always never get new features or major subsystem rewrites. So it's not only bug fixes or security vulnerabilities, nothing else, no new functionality. So how do these releases created? So the main bleeding edge we call it source code is stored in a git repository managed by this person. Who knows this person? We call him benelope dictator, right? So, yeah. The features are developed in branches, subsystem branches. So for example, you have subsystem for drivers, memory management, and that. And once in a while Linus pulls changes from these branches. This is where the pull request probably came from. I don't know, I'll note that for that. But the original pull request, not like fancy PRs that we have now, but it was an email saying, hey Linus, can you pull from my branch? This was a pull request. And it still is actually in the Linux kernel. Yeah, so Linus pulls all these changes from subsystem branches. And once in a while, he branches out the main branch into stable branches, which designate a major stable kernel release. And this happens roughly every nine to 10 weeks. Eventually, when bug fixes get accumulated, you get a tagged version on a stable branch, which indicates a bug fix release. So for example, you get 6.2.1. But how these bug fixes get propagated there? So they're not, if you have a bug, you do not submit a fix directly to a stable branch. Instead, you actually have to go through a respective subsystem maintainer to ensure this bug is not only fixed in the stable branch, but in the main branch and all other branches. So you actually commit your bug fix to the particular subsystem where the bug is, which will eventually get propagated to the main branch. But once it's in the main branch, it's not just merged into the stable branch. These bug fixes commit, especially mark, and the maintainers for the stable branches, the stable branches all have maintainers, they basically cherry-pick these bug fixes. And when enough bug fixes are getting accumulated, they do another bug fix release, which happens roughly every week. So yeah, a new major stable kernel is released every nine to 10 weeks, and it's the so-called merged window where new features get merged. There are only two weeks of the merged window usually. And the rest seven weeks are for testing in bug fix. And so even the major version receives a lot of bug fix in testing in the first place. And what you have to remember is leftmost version means nothing. So in Galway we had this problem where we, at some point, when we upgraded for 4.9 to 4.20, it was fine. But when we wanted to upgrade to 4.20 to 5.0, people were like, oh, it's the leftmost major version of this. It's probably really scary. No, it's not. It can even have less features than the previous major release. Linus himself tells that he just increments the leftmost number when he runs out of fingers on his hands and toes. But for whatever reason, sometimes he increments when the middle number is 19, sometimes it's 21, and sometimes it's 20. So apparently he has a variable number of fingers. Yeah, and bug fix or patch releases are released almost around once a week. They are denoted by the rightmost version number. They're usually cherry-picked from the main Linux branch. And the rule is there's always no new features. Therefore, regressions are quite rare. They almost always contain critical security patches, and you almost always want to apply it. Well, the problem with major kernel upgrades is that the major stable branch is kept alive around two, three months, and then it's abandoned. It's declared end of life, and no new bug fixes and security patches are backported there. And the assumption that at this point you will have a new stable merger version available, and you should just upgrade the merger version. But sometimes it's very costly to evaluate the major version because you do get new features and potential regressions. For this, there are so-called long-term stable releases where bugs and security features are backported for at least two years, and it's usually the last stable release of the year. Therefore, the so-called LTS stable release is released once a year, and if you follow these, which we do, for example, it provides you enough time for more rigid evaluation of the next long-term release. And surprisingly, the releases are quite well described on the kernel.org website slash releases. I was surprised how many people don't go beyond the main page of kernel.org to read stuff. So yeah, go and read it. It's quite interesting. Okay, so what do we do for safe and easy production kernel upgrades? First, don't create a dedicated deploy procedure for the Linux kernel, because kernel upgrades are usually less risky than other software who's been convinced today. Well, some hands, okay. A simple stage rollout is usually enough, and kernel upgrades are naturally slow paced because they require a reboot. And because you probably won't reboot everything at once, there is a lot of headroom to abort the deploy if things look wrong. Do avoid justifying bug fix kernel upgrades. Apply them with no questions asked. There is almost always something that is applicable to your workload, and it contains only bug fixes and security vulnerabilities only. And also minimize cannery soak times and prefer to use metrics-diven approach. You can sit in this 30-day window of operating your production kernel everywhere. So if you require high soak time, think about it. What metrics or observability will give you more confidence to roll out this kernel faster? Stay on the long-term branch if validating a major version is costly, so you have to do a lot of analysis and testing. You get at least two years of bug fixes and security patches, but don't wait for the two years, of course. Better what we do, for example, we start evaluating the next long-term release early in one year when it's available. Again, apart from just being proactive, it gives us more features early and sometimes, most of the times, better performance and resource utilization. And we also don't accumulate too much change delta, as I described before. If you don't have it, implement and improve production testing for major version validation. Basically, faith-lab grading the kernel requires you to understand what your workload is. If you're a web server or a database, what specific subsystems are in the target of your workload? Because sometimes, even a bug or an improvement in CPU does not apply to databases. Once you understand your workload, better to write tests which exercise these kernel subsystems and interests required by your workload. Having these tests also really helps with communicating issues to the upstream community, because in Cloud for All, our team is quite small and we're not experts in anything, and I would highly doubt if anyone really experienced in the Linux kernel, including Linus himself, could be an expert in all the kernel subsystems. So sometimes, we had a time where we had a bug in KVM, and we know nothing about KVM at that point, but because we had a reproducible test which triggers the bug, we spent like two weeks trying to understand what's going on, and we couldn't, but since we had a reproducer, we just posted an upstream mailing list, and there's always a person saying, oh yeah, here's a fix in 10 minutes, but you have to create this reproducible self-contained test to actually people to help you. And yeah, make metric-driven decisions whether to upgrade and not time-based decisions, so many might sometimes. One thing also helps with metrics and monitoring, and also automating your kernel releases, is with human risk perception, because sometimes when new people join your team, they still have this mentality of Linux kernel upgrades are very risky, and if you require a human to progress and do these upgrades, they will always be reluctant to do this. Like, automation really helps here to remove the human risk-perseverment factor, because these days, especially in clover, many teams are not even aware the kernel upgrades are happening. They're like happening under the hood automatically, and people don't notice it, just because, and you don't have to ask anyone whether you should upgrade, because you have this more or less, not perfect, but more or less data-driven approach. And I think that's it, whatever I want to talk to you today. So again, Linux kernel upgrades are not more risky than any other software. You need to patch early and patch often, and your bug fixes kernel releases should be applied with no question asked, and understanding your workload, metrics, and monitoring on automation will allow your system to stay patched and secure in the long run. Thank you very much. May I ask something? I know where that fear, it's a fear that we all have, I guess, and it comes from things that I can just say one story, so you have like a 5.4, and it's working fine, and you have some kind of special, maybe, chipset, and it doesn't support everything that chipset can offer, but it runs fine. So you upgrade to 5.0 something or 6, and it starts to crash. And then you roll back, and then you next time you will really think twice if you will upgrade to the next version, which will offer you more support for that chipset, but you still don't know. Then you wait others to upgrade, and to be sure that it's working fine now, and that's why you don't run to upgrade really fast, and then let me see if my dead one and that one did it, and it's running fine, and then it builds fear, you know, these things build fear, that's what can build fear, that's why it's always good to wait a bit more until 5 of them do it, and then, okay, I can see, so when I'm running fine now, I will do it now. Well, I mean, based on our experience, I have this same question from our production engineering team many times, it's like, why do we rush to upgrade? Why don't we wait until all the bugs were fixed and we can upgrade? And I guess it depends on your workload, but for us specifically, I sometimes call CloudFer as Linux as a service, because many of our products are using Linux, are stretching the Linux kernel to its edge. If there is a new feature like, and Linux kernel like XDP, IOU ring, people jump on it and adopt it almost immediately, and the result of that, because we use these edgy features which many people don't use, there is no one for us to fix these bugs, like we're hitting them first, so we tried waiting, and when we're waiting, we're still hitting the bugs, because like nobody else is using that feature in this way, and this is where you just can't, I guess it's the same with very specialized CPU or hardware, if nobody uses this hardware, you can't wait for the community, someone else to fix your bugs, you have to push them through. Of course, you see the bugs, it's always helpful to report them, and there will be some people on the mailing list within a moment, they will send you a one-liner patch to try it out, and usually it works out, but I mean, generally, if your workload is specific enough, or hardware is specific enough, you can't just wait for all the bugs to be fixed, because it's very applicable only to you. Okay, good day. I wanted just to in phase your position to say Linux is safer to upgrade over any other software, and to me the main reason is because of the strong commitment from this community to ensure that all the stable release are safe to upgrade. And I know very few other software that takes this contract with the users to say you cannot grab safely. And I think this is a major point, and I think the Linux community should be recognized for this, because it puts a lot of work to ensure that we are safe to upgrade. That's something very important. More than the rollout points you are leveraging, it's much more because I've searched strong contracts to ensure that every stable release is safe to be used. Yes, you mean you're referring to don't break user space mentality? Or even don't take a patch which is not already in mainline. I mean, if you get your patch into the stability, it's because it has been tested and proved to be safe, and because of the sum of all these patches is not to be safe. And this strong commitment is very important, I think, for the users. Yes, yes. They can press their work. Yes, yes, yes. And many times when you submit patches, there are tons of external, even people or systems, we run your patch in kind of a CI and they will report if there is something back. Yes, I guess you're right that we have to acknowledge that community puts a lot of effort to these stable releases to be actually stable. But also, like the release process itself goes a long way. So, technically, again, you have only two weeks to deploy new features and then you're stuck with seven weeks of bug fixing. So, yes, the emphasis on stability is a real win, I guess, for this community. And another thing, the sum of security issues is not only counting the CVs. Greg made a great presentation around that. If there is CVs, there's probably a security issue. But there are also fixes which are not as stacked as CVs, which could be our security issues. So, to evaluate the security risk of a given version, it's not only counting the CVs, it's much more complex than that. Yeah, I agree with that. And this is what I partly mentioned, that data is crowdsourced and probably incomplete. It's kind of like the minimal baseline of risk. But there is more, of course. There is like, these are publicly non-vulnerabilities which have been tagged on this project. There is like a lot of them which are intact with no CVs attached, as well as like a lot of unknown security vulnerabilities hiding in this system. So, yeah, definitely. Anyone? Hi. Here. I don't see. I'm here. Oh, okay. Hi. I have a question about Livepatch. Do you use in your company? Livepatch, we don't use Livepatch. And my personal view on this, I'm not... So, like, I don't fully see Livepatch technology covering all the use cases. So, I think it is useful for patching vulnerabilities really fast. Yeah, yeah. But on the particular type of vulnerability. Yes, yes. With Livepatch, you're basically replacing a piece of code in the kernel with another patch piece of code in the kernel. But we have to remember that in kernel, kernel API is not stable. And basically, you can only do that if your patching requires not changing some kind of structure. It may fall apart if you're required to adding a mutics into the structure if you have a race condition. And this is where Livepatch fails. And moreover, implementing Livepatch is very complicated. And it's kind of like you can crash the system as well because you're messing with the kernel code. So, in my perspective, in my opinion, the effort is kind of not worse of the return of investment. Like, if you don't have any company, like a Linux enterprise, Linux distro doing it for you and doing it for yourself, you're putting a lot of effort to make it. You can't patch all the security vulnerabilities with that. You're putting a lot of effort and you don't get much benefit. If you instead just focus on building a system where you can reboot anything at any time, that kind of gives you, like, much better, like, long-term result. Because you just can't reboot with a new kernel and, you know, your system kind of is resilient to that. And it takes as much effort. Thank you. Hello. Thanks for your detailed explanations and for outlining that the December version doesn't actually work the way we think it does. Now, I have questions. So, you mentioned that we usually install the rest of the software out of some side bound that we don't have control over. And actually, I do that for everything. Can you kernel? I don't usually compile it myself. So, the question is, can we, should we aware, should we be aware of particular tricks? Because this process is actually mediated by the distribution. Like, do the people who do the distributions know all the stuff you mentioned? Yes. And actually, the model which I described following LTS release and, like, rolling out bug releases regularly is what most distributions actually do. You might not see it because, for example, Debian, you kind of, they version the package differently. So, you think you're always on the same version, but you may notice if you're doing, like, regular up-get upgrade that when your new Linux kernel is installed, it actually installs you a new bug fix version, which is hidden under the hood. So, this is what most distributions do. They either follow LTS or they take a non-LTS branch and maintain it for longer. But when you upgrade your system, you just get bug fixes and security vulnerabilities patched as this bug fix release. Hello. I'm not completely sure how the kernel process works still. How about a firmware that's just dropped into the kernel? Is that included in those bug fixes? And if so, how are data set? How are you ensuring that those binary blobs don't change something that breaks everything? So, in modern distribution, and like within the Linux kernel upstream as well, the binary blobs are now managed separately. They're managing the separate git repository. And on distributions, there is a separate package for it usually called Linux firmware. So, basically, the code for the kernel and the binary blobs are upgraded at different cadence and have different release procedures. So, they are not included in the code upgrade these days. Hi. Over here. Yeah. So, you were talking about the fear in upgrading kernels, but to me or when I'm looking at my team, sometimes it's more of the tedious task in having to reboot or to migrate the service. And then, you know, doing it over and over like Groundhog Day. Now, my question is, what would you consider a reasonable cadence for that task? Or do you see even like a need at the system to align on a specific kernel and, you know, and zeroing out the whole system or just having some routine monthly maintenance that jumps a few versions? What's your take on that? So, again, for bug and security releases, my preferred kernels is weekly. So, they released every week. You have to compile it and roll. I mean, not roll it out everywhere, but start its rollout at some set of production then more and more and more. And again, basically, the more you delay, the more change delta you accumulate, the more risky you're bringing. So, if you do it as regularly as possible, your change delta is small. And technically, like within a couple of two bug fixes, even if it's something breaks for your particular service, you can kind of bisect it and understand what's happening much more easily than you have to go through, you know, like, thousand and thousand of commits. So, if it's hard, you have to think about how to make it easier and how to do it more often. The more often you, it's like gym, the more often you do it, you kind of build that muscle, you build the tooling around it, you build the metrics and observability around it, and then you build, eventually, you build your confidence that kind of, it takes you very fast and effortless to actually do it much, much more. Yeah, my question is mainly about the time spent. My question is mainly about the time that you spend, you know, managing that as part of your day-to-day. Well, again, it's basic calculation of return on investment, right? If a kernel upgrade is too costly in terms of like spending, you're spending a lot of time doing that, think about if you can invest this time to build some kind of automation. And that's what we basically did. Like, when I joined the company eight years ago, like, it was very manual and time-consuming and it required a huge team of SREs to actually do a kernel upgrade, but now they're not even involved anymore. And, like, it just happens. Thank you for the interesting talk and nice present for you. Thank you. Enjoy it. Thank you very much. Thank you.
The D Programming Language for Modern Open Source Development
Hello. All right. Great to see everybody. See some familiar folks here. Just a quick show of hands here. How many folks have heard about the D programming language? Oh, wow, awesome. Keep your hand up if you've used the D programming language or tried it out. Okay. Yeah, I see you there, Dennis. Yeah. A few other folks here. Great. This is perfect. You're in the right space. We're going to have a lot of fun today. And I'm going to give you an introduction to the D programming language here. I'm not going to show you everything because the D programming language is a really large programming language, but hopefully enough to get you excited here. And ultimately to show you some open source projects where you can get some inspiration. So let's go ahead and get into it here. So again, it's been six years since my last Boston talks. I just want to thank the organizers for inviting me back and letting me talk again. So again, the goal today is just to have fun. You can kind of sit back, relax, have a good time and just learn about, again, what I think is really interesting programming language that's expanded my mind as far as how I think about programming. So with that said, hopefully I'll come back sooner than every six years. So a little bit about me. My primary role is to do teaching. So I'm an associate teaching professor. So I love teaching stuff. I do teach the D programming language. I'll talk about that towards the end or give you a reference for that. Otherwise, I'm really interested in other sort of performance systems, these stuff. Again, you folks are my crowd. So again, I'm really excited to be here with you. And with that said, here's the abstract, of course, that you read and led you here. Again, to get you excited about the D programming language. And any code that I have for the talk will be linked here. If it isn't already shortly after this talk, I'll post it. All right. So again, what I want to do today is, again, get you curious about a really, really cool open source project. Now that open source project happens to be the D compiler. In fact, all the D compilers that we're going to find out have the source code available. So how cool is it that you can actually look at a programming language that's been around for quite some time and see some really awesome work by some really smart engineers. So at the very least, I hope that's exciting for you that you will have some place where you can look or send other people to look and see how optimizations are done or code is written or organized. So again, I think that's in itself very interesting. And maybe one day you yourself might find yourself contributing to this compiler, this ecosystem, or find inspiration elsewhere for using this programming language. And again, my secret dream for you, if I do a good job during this talk, is to get you excited enough to say, yeah, I'm going to contribute. There's been some awesome videos on how to just do that. Again, a lot of the open source projects that we've seen today and we'll see tomorrow have these resources. So again, I just want to point out that those are available as well. So again, it's really cool to look through the source code of the D compiler, which is a very, very, very fast compiler for the D programming language. Okay, so with that in mind, with my interest out there on what I want you to get out of this, or maybe to get excited about, again, whether you're a student practitioner, somebody in industry. Again, we'll continue moving forward here. And as I'm talking about this, I do want you to know that I'm a bit of a programming language enthusiast myself. So I love using different programming languages. This has been a problem for me since I started programming, always looking and kind of moving around to different languages, seeing what was new, what kind of features. And honestly, I think there is some value in that. You get to see how different languages approach things. Actually, we were just at a previous talk on the Hector script talking about actor model and mutability, how parallel processes are organized. I think there's a lot of value in taking away some of those core concepts from different languages. So what I've been doing lately is, again, every few days now at this point, I've been just turning on my camera for an hour and live streaming myself, learning a programming language for the first hour or so. And you pick up interesting things from different languages. But just to be clear, the languages that I use professionally and teach most are C++ and the D programming language. I'm always kind of thinking in terms of, oh, you know, Golang does it this way with their defer statement and D has scope, or oh, there's message passing in this language and this is how you do it in D in this way. So it's been a really interesting sort of experiment going through this process. And I'm thinking in the language that you ultimately, well, use. You kind of wire your brain a little bit sometimes. So that could be something kind of curious, again, looking at new languages, looking at languages that are popular, looking at languages that are maybe not so popular as well as far as mainstream. At the end of the day, what I hope one of your other takeaways will be is, you know, as we know, sometimes it doesn't matter what the language is. It's going to be what gives you a competitive advantage, what is fun for you to build software in, what is, you know, the tool that you can use to create something. So then my goal is not going to be to convince you today one programming language is better than the other. Even as I look at those programming languages, I try not to do that. I'm smarter than that. I think I am. We'll see if I slip today. You know, we sort of like our program languages and get used to it, right? We have our favorites. But again, I do want to share my enthusiasm for D, why it stands out, and why you might also have fun with it. So with that said, we're going to do that same little experiment that I've been doing, just turning the camera on for an hour, looking at a programming language for the first time, and just investigating some interesting parts of it. I hope that will get you curious about the different parts of the D programming language and again get you excited. And maybe, just maybe if I'm successful, and I looked around, I saw everybody who rose their hand and who didn't. We'll see more hands raised. What was it? Six years from now when I'm invited back. So anyways. All right. So I'll show you a few cool projects for inspiration. Most if not all are open source. The only ones that aren't are the scripts that I haven't put in my GitHub repo yet. So we'll be true by then to this top. And all something that you can learn from a specific feature. I'm a big proponent. Again, my background being in teaching in some industry, that we need to read more code as we're learning as well, because there's lots of smart engineers, you folks, writing that code and I want to learn from you. So with that said, we'll look at these projects all in the D programming language here. So let's go ahead and begin. I'm going to go ahead and start with something cool made in D. Why not get some inspiration to start this talk off. And here it is. A project that's built in the D programming language. TILIX. How many folks have used this terminal emulator? Yeah, I'm seeing a few hands go up here. Yeah, this is something I like to occasionally download and try out different ones. But to my surprise, I actually looked at the source code. One of my students actually told me TILIX is built in D. I didn't know that. So that was really cool what you find sometimes in the wild. But again, oftentimes as a user, you don't really care. Just a cool piece of software as an end user. But you get to see as a practitioner some of the cool tricks they do. So along with just showing you some different tools that have been built in the D programming language, I think it's important to say, well, why don't we care to look at this closer? So with all these slides here, again, I'm not going to ask you to read these or click on all the links. The slides will be available. But what you might be curious about or with this particular project, what's interesting is to see that, well, it's something that's very visual. And if you dig into the source code, it's using the GTK libraries. And those are C-based libraries. So how does D interface with C code? Well, the answer is D actually does a really, really nice job interfacing with C code. So if you are C programmers or have been using C, you can basically call directly your C functions in the D compiler. Easy as that. Now, of course, there are bindings and wrappers and other things that folks do with the D programming language. But that's nice. You get a head start by being able to use some of your C code or even C++ and Objective C. There's ways to squeeze stuff in. So I thought that was very neat, just looking at the main app file from this particular program to see the different libraries that they were bringing in and was it just straight C code or library? Some other neat things that I'm just going to trickle in some details about the D programming language as we go along here. There is something called import C, which is a really cool, well, it's effectively a C compiler built into D. So you can, on the command line, like you would with whatever your tool was, type in compiler, DMD, your D source files, and your C source files as well. So that's kind of neat there. Again, just giving you a head start if you're going to consider migrating to different programming language, which is a big decision to make if you already have some open source project. All right, so that's Tylix. That's kind of a fun one. I'm learning about how D can also play with C code. Okay, so let's just get a first impression of the D language. Again, pretend you're doing this experiment that I'm doing. You go into Google, you type in Dlang, and you go to the homepage, Dlang.org, and what do we see here? We'll actually see something that looks like this. I'm going to give everybody a minute or so to just look at this piece of code. There's a sample code there. And then I'll ask for some participation. We can make this interactive on the afternoon. But just take a look at this and let me know what you think it does or what's interesting. I'll take hands and get some volunteers here. I'll give everyone a minute to think about that. What's popping out there, folks? Give me a hand and then something out. Yeah. So the few things I see on the first line, the import is local to main. On the second one, there is an object-style notation for a string. On the third one, there is an enum for an array, so that one I don't get. Then there is this immutable keyword, which is interesting because it's doing a mutable operation on A, but then B is immutable, I guess. And MSG is apparently a program that you send to the compiler, so I suspect it emits something at the end of compilation. Okay. How many did I get? Yeah, so we got a good staff at it. I saw other hands going up here. There was one actually right behind, if you wanted to share. Yeah, it could be the same thing or to add on. It really looks like a C, next one, like a C+++, free plus, so I don't know why they gave a name D, but they could just keep it with the pluses. In a way, it's really kind of easily to read. If you know anything of C or family, you can easily jump in and just do it. Yeah. So immediately when we're looking at program language, just to recap, we see it's sort of a curly brace, C style, algo style language, right? So we can kind of read it if we know C or C++, objective C, whatever. And it does look like a C+++, kind of language. We'll talk about that in a second. There's another hand here. Is that program manipulating types as values at compile time using decode like you would do in the ZIP program language? So the question about is it manipulating types here? Or something's kind of interesting about the types, certainly here. So for instance, what's the type of B, for instance? So what's it doing with the types there? Okay, it's static. We sort of know static and C and stuff, something about memory storage. Immutable, some sort of qualifier. It turns out a stronger than const. But what's the actual type? Well, there actually is some types being inferred here for us, like auto and other languages. Now I will let you know, again, I'll repeat some of these details. D is statically typed, but at compile time, yeah, we do have to make a decision about what the actual type's gonna be and what's returned. Yeah, this is great. I'm gonna advance it one slide forward here, and you'll see what the label is on the program here on the D language homepage here. And it's sort of an array at compile time. And that's kind of cool. Just this first example, this is usually the first example that comes up here. And I've got a description of the stuff that you folks recapped very nicely. But let's actually, we'll run or look at a few codes, but I think we should at least look at this basic one here. Let's make it a little bit bigger here. Just to get a feel, again, this is the same Hello World sort of program. Well, this is maybe even after Hello World, I would say. But interesting enough here. And let's just go ahead and compile it. So with DMD, again, I'm looking towards the bottom of my terminal here. I'm gonna compile it out. This program I called compile time sort.d for the extension. And the output file is going to be prog or prog. And as soon as I hit enter, interesting here. It's finishing compilation here. And boy, I didn't run the program. I'll tell you I didn't run it. But while I was compiling it, yeah, there is something interesting here going. It is called compile time sort. So you might have guessed that. But interestingly, and this is one of the big, why should you care? Or things to look out in languages that you care about. We can do computation at compile time. So this is a really powerful feature of the D programming language, the D compilers specifically, that we can take something like in a new, something that would maybe be a constant, right? Usually in another language. Set some values here, like an array. And then actually evaluate it with sort. But again, if you look at sort, this looks like a function that you might just call in your regular programming language, right? So there's nothing really different than the compile time sort to the run time sort. That's probably what we want, right? To be able to execute as much as possible at compile time and save our work for when we're actually running, right? Before aiming for performance. Of course, there's always trade offs for that. You notice that might take a little longer to compile. Again, let's go ahead and compile it. Again, pretty fast. Actually, we're gonna talk about how fast the D compiler is later here. Now if I actually run it, the program here, PROG, right, we just get hello, Fostum, because that's the actual run time computation that's going on, okay? This part here, this is the only thing we're really doing at run time. Now if we go on and later do something with B or print it out, we'll get to our sorted array, but that's the point there. So, already kind of neat. This is kind of an attention grabbing thing. And again, something that might be new depending on what programming languages you've looked at. And the thing, again, one of the things that certainly caught my attention. All right? And I mean, there's some other interesting stuff here like the, I think was mentioned here, the quoted string before, right line, dot right line. Okay, we'll talk about this. It's called universal function call syntax, but you know, some nice potential quality of life features for us. All right, so that was our pop quiz, only pop quiz we have here. But I do invite folks to raise their hand high if they see something interesting as we move forward. All right, so again, the sample and why you might choose to care. Just to go back, we call this CTFE or just Compile Time Function Execution. This idea that we can do work at compile time. Didn't we know there's a lot of other languages, templates, or sort of a mechanism to do this if you're coming from C++ background. To various extravagance levels of metaprogramming that you can do. Other languages might do this a little bit more explicitly otherwise, but that's the idea with the decompiler. So a big win in my mind. In a big win, how clean this syntax is. Okay, so a little bit about this deep programming language. Somebody mentioned, it kind of looks like C, plus, plus, plus, plus. Yeah, so a little bit of history here. Walter Bright, who's highlighted there with the arrow, that's him at D-Comp. A few years ago, two years ago now. He was the initial creator of the deep programming language. It's called the Digital Buyer's compiler originally. But folks kept saying, hey, it looks like C++, plus, plus, or whatever. And they just started calling it D, and that just sort of stuck. So that's what we got here. So a little bit about Walter, again, he's a compiler expert. He's worked on C compilers. Hence why there's sort of a C compiler in the D language. C++ compilers. And then of course, thought about it for a while and said, well, I'd like to make something new, something that's fun and efficient to program in as well, and that's where D sort of came about. And then also, a major collaborator was Andre Augsindrescu, around 2006 or so, joined. And then for the next 10 plus years was a very active contributor in building what we now use as D2. And we actually got other audience members who are contributors. I don't know if you want to out yourself, you can raise your hand, but you don't have to. So anyway, so there's a full history with the D programming language and a really interesting article if you want to learn about the history and the origins about how to evolve and the sort of why's you do things in the programming languages. Again, that can be interesting sometimes if you know the historical context, why things look a certain way they do. Sometimes that helps you understand when or when not to use a feature. So anyways, that's just a little bit about the history of the D programming language here. So again, what is the D programming language? Still on the front page, it's a general purpose programming language with static typing. So whether or not you see those types, they can be inferred. It's a systems level programming language. So you have low level access to things like pointers, for instance, and you get the C likes syntax. So it's relatively familiar again if you've used C or C++, right? I imagine pretty much everyone who knows I hand who had heard of it's new. Yeah, something like the next C or whatever. But the mantra with the D programming language, at least on the home page, is write fast, read fast, and run fast. So we'll try to see if it holds up to those things and again why it might be a good choice for playing around with or maybe your next open source project. So over the last 25 years now, there are three compilers for D. There's the DMD compiler, that's the main one that Walter has and works on. And that compiler is completely open source. So you can dig into it, you can make a fork of it and modify it and play around with that DMD compiler. And it's a very, very fast compiler as far as compiling your code. So you can compile the actual D compiler, I want to say in a matter of seconds, tens or hundreds of thousands of lines of code. And that has in part to do with these module system, being able to do concurrent builds and how many passes it does over the language. But it's very, very fast. Your edit compile and run cycle is very quick as you're iterating and doing development, which I find something important. There is also the equally as important the GDC front end for the GCC compiler suite. I think it was around GCC 9 or 10 that was added in officially. So you've got the front end there with Ian Buchwald working on that and LDC work on by Martin, which gives you all the LLVM infrastructure. So if you're trying to target lots of different platforms, for instance, the LDC or the LLVM based D compiler is available for that. So you've got three compilers, which is great. So you don't have to worry about it disappearing anytime soon. And it is very common for D programmers to take advantage of the very fast edit compile cycle with DMZ. And then when it comes time to build an optimized build, you want to take advantage of all your GCC tool sets and infrastructure or your LLVM infrastructure and all the optimization passes. You can use those compilers afterwards. So as far as downloading the tools, don't need to spend too much time on it. But again, if you're on one of these platforms, you probably have a way to get the D compiler built for that platform. Or otherwise, there is a zip file or something on your package manager available. And with the D programming language, you get a package manager that's called dub, which will help you manage dependencies, bring in packages, and these types of things. It's also sort of a lightweight build tool as well. There's other tools that you might expect, like Dformat, which are being worked on and already exist for code formatting. Dscanner, which is like a linter. And if you're a VS code user and want to intelligence and these types of things, there's support for that, as well as for IntelliJ. Okay. So D, where is it being used right now? Again, we've heard of this language. Maybe we've used some of the applications without realizing that they were written in D. Again, from the website, lots of different companies have used it internally. Again, folks like myself just use it for our own projects or research. But I think D has done a really nice job finding itself in various performance based niches. From some of these various companies, there's different stories about how different tools were being used, which I'm happy to go into. So I want to go ahead and show a few. And this was another built in the deep programming language tool. I tried to pronounce it correctly. I think it's Elmer, but it's a compressible flow simulator. Okay, super cool. So they're doing computational simulation, something very expensive to do. So this tool now is 10 plus years old, being used by various PhDs and postdocs and researchers. But again, why should we care about this tool other than it generates really pretty pictures? Their website has some really beautiful pictures. These are just the ones I sort of understand. So I could post in case anyone asked a question. But again, it's a project that's been around for 10 plus years. Most of the code is in D and it's shelling off high performance. And I thought this was a great message to share from their GitHub saying, our focus is on open source development to give a simple access point for doing their gas dynamics and research and teaching. So what a great place to start if again you're in this area and want to look at some open source D software. Okay, so that's a nice tool. Getting back to some of the D language features. Sort of already thrown out one of the main big ones here, the compile time function execution. Which again, we're starting to see in more other modern languages, but that's sort of a staple of D and why I think it's really interesting. But the language itself has a lot of really nice quality of life features. So these are things like you get a bunch of built in data structures without having to import anything. Dynamic arrays, associative arrays or maps or dictionaries. They're bounds checked, which you can enable or disable. There's always a path to performance here. You get things like your land does and delegates. The object oriented functional style, generic programming designed by introspection, concurrent paradigms, all of that. Again, they said it's a really big tool. We can't cover all of it, but there's probably something interesting here for you or it's a domain where you might expand. I personally found that I started doing more functional style programming when I started using D because it was very accessible in their standard library. The D language also by default is garbage collected. But you can turn that off if you want. You can malloc and free. You can do reference counting. You can implement from scratch your own strategy if you want. There is a question and I'll repeat. Yeah, so the question, just to repeat and I'll break it into two, is how granular are these, this ability to turn off things like garbage collection. If you do need performance in a certain sectionary code. That's as granular as putting an attribute on the function. You could say at no GC on it. And in practice, no garbage collections will happen. And you can do that. You can do that. You can do no GC on it. And in practice, no garbage collections will happen in that section there. I think there are more in the actual tools you can do like a GC.disable, which I think is similar to what Java and other languages have. I think you could do like a system.noGC or whatever. So you get that granularity. That could be at a function based level, saying this code, no GC, and being able to handle it. The array bounds checking, I know that is set as a compiler flag. I don't know if, for that one, I actually don't know the answer if you could do that on a per function level. What I would say is if you wanted an array that wasn't bounds checked. There is, I think, of a standard library. I think one of the standard array containers doesn't do allocations. And then I would also just say, so you don't have to worry about that container of garbage collections. But for the bounds checking, I would probably just, to be sure, you could implement your own dynamic array. No problem, just like you wouldn't see if you want that granularity. I will also show, what will I show here? Yeah, so does that answer the question? GC as granular as garbage collection per function, you can enable, disable. And then for the array bounds checking, you can always implement your own. But there is a compiler flag for on or off. And typically, folks would use that again for that last little performance game, if they're like building a video game or something, and super certain, there's not gonna be any arrays that got bounds. Because typically, you know the thick size allocation, you would just turn that off. Perfect. All right, questions or features that look exciting here? And there's lots and lots, and the point is you have control, which is really, really cool for what you need. And we're gonna even dive a little bit further into this. There's some other cool stuff you can do, if you only need a subset of this feature. But let's continue getting inspired here. So we've got a standard library. So again, batteries included, like pretty much every other programming language these days, you have to have a standard library with containers or data structures, various algorithms, right? We've already seen sorts in the very first example, but there are things like map and filter and fold and so on. There's various concurrency primitives and so on, and we'll take a look at some of those. So you have a pretty decent standard library here. That's in discussion about expanding and refactoring and so on. So most of the common stuff you would need, handling JSON, CSV, files and so on. So that brings me to another built in D here, why do we care? I'll get into my code, so B's, Yanmi. But so here's just the type of way that I started using the D programming language was writing these little scripts, 50 line, 100 line, throw away codes to automate some tasks that I'm doing at my desk. I found myself doing a lot of queries to YouTube to gather data about what videos have been published in a channel or what videos are in my playlist, these types of things. So what was really nice was just to find that there is the standard.net.curl in the D standard library. And then I could just build a query string and then effectively make a query and retrieve my data from that curl request. And then I have standard JSON and then I can just, again, if I'm retrieving from some API, JSON data, again, common format for that JSON or JSON, then I can work with that data as needed here. And then you've got other sort of quality of life things like range-based loops. So you can go through the keys, you can put the keys and the values here if you wanted to iterate through them as well. So nice little script you end up writing a few of these here. So there's one example with YouTube. I do this a lot for GitHub for, again, pulling repos, looking at them, pushing code to students. So again, same sort of pattern that I'm always using with any rest-based API where I'm pulling data in. One little interesting thing here, looking at line 53, we can start to see that if you want to set various event handlers, again, here's just a little example of a lambda function here. You can have anonymous functions. You can have delegates and these types of things in the D language. So nice little quality of life things here. Okay, so this is kind of interesting here. My little scripts, and I'm sure many of you folks have your shell scripts or Python scripts or whatever. And again, that's what happened to me. I had a bunch of shell scripts. Mostly I had scripts in Python. And then I just started translating them to D because again, I liked it, it was a little bit less cognitive overload for me. Again, if I'm working in C++ and D, they're pretty similar in how I can think about some of those intakes. But it's sort of interesting that when I'm using D, I'm still effectively executing my scripts like I do in Python. Okay, so let me go ahead and explain this here. And what do I mean by that? Yeah, question first. Let's see, line 54. Maybe a bug or something there, line 54. Sorry, I didn't hear. Unreceived. Oh, the E and the I backwards, uh-oh, okay. I knew I shouldn't have put my code here. Good catch, I'll fix it in the post, yeah. Gotta do some fixing tonight. But the good news is, right, we can iterate quickly. So I'm gonna give you an even faster tool that I use to iterate and run these scripts. Just a little helper tool, it's called rdmd, run, you know, the dmd, basically just does on the fly compilation. And it does, you know, it compiles, you know, as fast as your decompiler basically does, dmd, but then it'll just execute your program immediately. And the advantage of this is then you can use D like a shell scripting language, right? You can actually, if I get a read down here, I'll try to highlight my cursor. I know it's a little bit small here, but you can just put the pound into the bang sign slash user slash bin slash environment, rdmd, you know, chmod, execute, or whatever, and then you just run your program, just like a regular script. So again, that's a really nice way to, if you need to, transition your scripting language to something that's statically typed, or you can just think in the deprogramming language rather than multiple languages. I found that as a nice quality of life improvement. Again, I understand I'm the enthusiast here, but I found that as a really big win for me. So rdmd is available. The LDC compilers, you also have this available as well with L, dmd2. I haven't checked the GDC one actually. So that was really cool. So, you know, generally speaking to, you know, my effort because I was running somebody who does little scripts was, you know, if you use a compiled language, generally, gotta be careful with talking about performance. You get better performance than an interpreted scripting language. So again, a big win for me and my projects. But there is still more to this performance story beyond just, you know, switching to a compiled language here. Because I started stumbling upon other really cool things in the deprogramming language, the community pointed me to. I started doing this in my scripts here. So you'll see here highlighted, let me draw your attention towards the top. Dot parallel here. So I just kind of stick that on the end of some collection or some array. And basically what I get is the equivalent of, for those of you who've done OpenMP, a parallel-based for loop here, right? We're able to launch multiple threads here. That's a small change that you can make, right? If you don't have any dependencies on the data in between, you still have to think about it, certainly to make sure you get correct code. But imagine just going through all of your range-based for loops and doing dot parallel. And if you're doing separate tasks, getting a performance boost, right? Use your CPU. You paid a lot of money for it, so put it to work. So again, quality of life feature there. Now does it make things faster? Again, you have to profile. You always got to check these things out. So, you know, maybe a better use case, another open source project from a D conference, just a standard, you know, Hello World Ray Tracer project where I used standard parallelism. And again, if you're looking per pixel or doing something graphically, right, you have a lot of pixels, however wide your resolution is, a thousand pixels, but, you know, a thousand something of that nature. You can try dot parallel on it and see if it speeds things up. And of course, my performance wizards and the spiniers. You're launching too many threads or what's going on, you know. So, you know, does it make things faster? I'll get to that in one slide here. Because I also see something interesting that I've touched on but haven't explained. What's going on in this for each loop? For each Y and for each X, okay, those must be like the pixels going across and up and down. Okay, so there's a lot of them. But this next part's kind of interesting. Okay, I've got a camera dot get screen height dot iota, which is like a range, and then that dot parallel. Well, what this is, is an example of that uniform function call syntax. This idea that we can sort of chain functions together with a dot. Again, maybe you've seen this about programming languages. Maybe you've implemented design patterns that allow you to do this. But it's a really nice quality of life feature if you just sort of compare the camera dot get screen height dot iota dot parallel versus, you know, trying to figure out how do I nest these things. Parallel, okay, iota, and then you're counting your parentheses or, you know, you're hoping them or whatever counts them correctly for you. Again, just a little quality of life thing, more readable code, and you can actually think and sometimes see like, oh yeah, I see that is just a range there. Maybe I can paralyze it. Maybe there is some data independent thing there. So anyways, that's just, you know, following up on that. And then as a little aside, and you can look a little bit more, there is a built-in profiler in the D compiler for seeing how many times a function executes, how much time you spend in it, and there's also a memory profiler so you can see how many garbage collections you're doing if you're using the garbage collector. Okay, so built into the compiler, you don't have to search for them. You know, I do use other tools like perf and choose your favorite tools, but nice that it's there, okay? It's an easy tool that you could build into a continuous integration system or whatever. Okay, so, you know, speaking of some graphics projects, again, that's sort of one of my passions, so, and it turns out that D is a great language for building graphics projects. So, you know, the must-need-it, you know, pretty picture slide, and there's actually games and physics, you know, if you click into this. The cool D language project, Daegon, here is a game engine, so, you know, something sufficiently complex. Why do we care about this, though? Other than it's, you know, pretty in a slideshow. Very, very beautiful. Lots of hard work there. But again, just to see the substantial project by engine and graphics developers, you can see how it's laid out, how different core systems are laid out. Again, might be interesting for you to, again, think about if you're going to use D for building games, how you organize different components and game objects and these types of things. And you can kind of look through the directory structure. D uses a sort of directory structure for packages like Java or other languages, and that's kind of interesting. And there's also just a fun comparison to C++ here if you want to see the video. It's not really to say anything. Both these applications are very GPU bound, so that's sort of the point, right? Use the language you want, and if you're GPU bound, that's all on the GPU anyway, so, you know, you can think about those tradeoffs. So there's one game engine. Another one, Dash, this is a cool, I think it started off as a student project, and then I gained some steam with several folks. So there's a little game they made. Why do you care about this? Well, you know, I spent just a few minutes looking at the code to see how things were structured. And very interestingly, they were using this idea of mixins in their code. How many folks, just as a survey, have heard of a mixin by hand? Okay, we've got about 40% or so around there. But that's the idea that you're literally just taking in a string and pasting in your code, and it should be like valid decode that gets compiled. Sounds trivial, sounds like, kind of, why would you do this? But it makes sense in use cases if you've got graphics code, if you can just import or paste in some shader code and do a mixin. Or maybe you can use other compiler or compile time techniques to sort of build out a string at compile time and then generate code. It's a very simple idea that you can compose and generate some really cool graphics things. I think it tends to work well in this use case that the game showed. Another later project here, Hypreem Engine. So they built, you know, some nice stuff. Why do you care about it? Why should we look at it? Well, Hypreem is very active in the community, so a good person to know for one. But a really interesting example of just seeing how to support multiple platforms. So again, Hypreem can build a D project on PlayStation Vita, Xbox, Mac, iOS, Android, et cetera. Just to see that that's accessible, I think that's a project worth studying and to see, you know, how did they get there? Okay. All right. So there's lots of other graphics resources. Mike Parker, who's a member of the community, has done a great job with common libraries and graphics stuff, sort of an FYI. We're talking about open source today, so I'm going to sort of ignore the commercial game projects done in D, but there's a few interesting talks, again, if that's your sort of domain. And, okay, so talking about a few of the other D language things of interest, the paradigms, okay? Because again, I said when I started using D, I started doing things more functionally. I started thinking more about concurrency. I started thinking about object-oriented programming, I think in the right way. At least, you know, how message passing is supposed to be one of those pillars of object-oriented programming that kind of gets forgotten sometimes. At least that's what I think of with object-oriented programming. But anyways, just a couple of examples. You can take a peek at these again after the talk, but I've got the range-based loop here, and then I've got the sort of the mantra of no raw loops. Get rid of those raw loops and just use functions like filter or, you know, these types of components here. So again, very nice, often easy to substitute, often you find instances where you can just do a dot-parallel much more easily. And on the right here is just a classic. You've got an interface, and you want to create a type of dog, a husky golden retriever, you know, like your favorite dog, Belgian Shepherd, et cetera. Okay, and then I can't leave D without giving a hollow world of metaprogramming, because that's really, again, one of the strengths here, right? We talked about stuff that you could do at compile time. So just a sort of simple function here. It's called print data. So I'll draw your attention towards line 38. T is the template parameter, so there's no angular brackets. You just put the template parameters right after here. So T, whatever the data type would be, and I've got another T for whatever that type is, and then the struct. Okay, what is the struct and why do we care about it? Well, we care about this struct only if it has members, right, attributes called memory and elements. Okay, so memory might be a chunk of the, you know, I don't know, some attribute of the elements is maybe, again, an array of the data. So what's sort of interesting is, one, you can think about this as a sort of template constraint, or a concept, again, depending on what language you're coming from, that has to be adhered to. So I can only use this templated function on structs to print their data if it has memory and elements. Well, I think that's kind of a nice constraint to think about or to have that ability to do it. So that's kind of interesting here. If we have time at the end, I'll flash some of the examples that I'm going to put in the GitHub repository for other introspection things you can do. You've got a traits library, so you can see, you know, what member functions you have. Is this thing a unit test? Does it have some attribute on it, like no GC or whatever? A question? And the question was, is there static if? There is static if. There's static for each. My question is why you use static. Why it's not static? Here, it's, I guess I could make this static. I don't know if it's implicit actually here. I need to think if it is or not. Yeah, I guess, yeah, we don't need it because technically we wouldn't generate this template if it wasn't valid, since that's happening at compile time. Okay. So, you know, here's, you know, leading us towards the end. So, I know, I've gone through this. I've tried not to make it a sales pitch just to show you things that I'm excited about. But if you're not ready to try, D, there's still yet other interesting things in the compiler. There's something called better C, which is a subset of the D language. And basically what this does is it gets rid of or sort of removes a lot of the language run time. So, this is if you want to do some more like bare metal programming, for instance, and you don't want to carry the standard library phobos, or you don't need some of these other features. You get most of the quality of life things, like the bounds checking with arrays, you get slices for working with them, you know, delegates, land as all those nice things, all the compile time execution, but you can sort of just use it as a better C language. Some of the stuff you're starting to see in C23, for instance. And there's a really nice talk introducing that on kernels and how they're using better C for kernel development. So, again, getting into a little level stuff here. So, as far as learning more about the language, again, great tour on the website. The good news is, you know, anybody who's written a book on the D programming language, and there's seven or eight, I think, they're all good books, right? They're all written by enthusiasts, reviewed by the community. These are the first two I'm going to recommend that folks who are beginners take a look at. They're more, you know, for someone who's an audience who knows how to program, and we'll get you started here. Forums and Discord, otherwise, are very active as well. YouTube, that's me. And then teaching the D language. So, you can hear it from my perspective again, but even better if you hear it from the students, right? They're unbiased thoughts on what the value was, if it was useful for them. And the last sort of resources as we're kind of wrapping up here to talk about, again, from Andre, he wrote this really nice piece here called The Case for D. This was in 2009. I think a lot of it still holds in a way, but, you know, basically, he summarizes it as a high-level systems language where it can be productive and enjoy coding. That's what I found. You know, maybe you'll find that too. Okay, again, that's up to you to decide. I hope I just shared some cool stuff for you to get excited about otherwise. So, again, what do we care maybe as an open-source developer? You know, you've got a readable, writeable, performance language that hopefully gives you a lot of quality of life features like fast iteration time. You know, I think there's a competitive advantage here with any project. I found it with my students. Again, that's something you'll have to test, but that's what I found. My students get further using D than other programming languages. And there's three compilers available. You don't have to worry about it disappearing or, you know, other stuff, you know, going on here. All right, what's next for me? Well, I talked a whole lot about graphics. That's my passion. That's what I've worked in. But I'm now working on learning a web framework called Vibe, which is super cool. If you're more on the website, there's a great book about it to get you started on building, you know, scalable and performance web applications. Alrighty, so we learned a bunch of things. Here's sort of a summary slide on some of our takeaways. Again, I'm going to leave that wall of text for you because I want you to leave excited and not tired from reading. I just want to go ahead and close off by thanking you. I'm going to be around so you can ask me any questions now or after as well. Thank you. Thank you. A question? What you will say, why Rust and why not Rust and why D? That's a good question. I don't want to pit languages against each other, so it's why Rust or why D? What I would say, because that's a hot question I get asked a lot, D's code is very plastic, the plasticity is high. Meaning I can mold it and change it, which I very much like. In a way that, again, I'm not as much a Rust expert. I've used it a little bit, but D's plasticity is very good. It writes how I want to write the code. It's got the memory safety with the garbage collection itself. I find it very, very productive. I find if you're going to write an application, again, I'm in games and so on where there's lots of mutable state. D's a perfect fit for that, for writing safety and maintainable code that I can change later. Yeah. So the comment was coming from a C, this was sort of easier code to C and to read. Yep, yeah. That's the other read. It's easy to get into. Yeah. Another question? Testing. The UFCS looks very cool, but how do I know if it's like a function or if it's a method of the object like that I'm calling? Because it felt, it was all the same color on your VIM script and I was like, oh no. Yeah, when you're doing the dot, so a few nice things that D language does when you're working with pointers in classes one, you know, if you're coming from C or C++, there's no arrow. So, you know, you don't have to worry about that. Everything's a dot. But then the idea of, is it a variable that I did a dot or the function call? Usually function calls are not required if they don't have any parameters, you can leave them empty. I usually just put the parentheses after. Otherwise, this is for things like language server protocol and your text editor make easy enough. It's not usually a problem. Yeah. Alrighty, thank you.
First Aid Kit for C/C++ Server Performance
And today I will show you some of the most common performance issues which I have seen so far in my career, how to fix them, and the benchmarks which show the numbers, what kind of performance increase which you can get when you fix the stuff. It works. My talk will follow the plan on the slide. So I will first present you some issue categories where you typically lose most of the performance, at least in my experience, like I said. And then for each category we will go through specific topics, what you can optimize how and what kind of numbers you can get when you optimize. And then some sort of conclusions of the topics on the slide, we just go for them one by one. The QR code right now is not working, it will be working after the talk. Everything will be online clickable, you can just walk through it again to repeat the recipes if you need. So back end performance at least in my area of work, back end usually means one of those three things, latency of your requests, CPU and memory usage on your machine, and your throughput, which is how many requests you can process third time frame, which is usually expressed as per second. So request per second, RPS, and we want to improve this stuff. And also there are those bad places where you can lose performance in those three categories, which are inefficient suboptimal heap usage, unnecessary expensive threat content should be done on critical paths and inefficient networking, like inefficient networking IO or inefficient circuit scheduling and things around this stuff. And like I said, we just go through each and see specific cases. Starting with the heap, to understand what can you lose here, you have to understand how heap is working. It's just enough to understand basics, you don't need to know specific implementations. But the basics are that this heap thing, it's a data structure, like some sort of tree, has stable whatever, it's global in your process, and it is used by all the trends. When you call new or malloc, they go into the heap and fetch three block of specified size, return it to you and you use it. When you call free or delete, they are placed back into the queue. And this operation of finding free block in the queue of needed size or placing free block back into the queue, it takes time. So this lookup thing in the heap, it's not free and it's not constant time. It's some lookup time, which depends on how big the heap is, for example. So if we mentioned that this is a tree, which stores blocks sorted by size, then lookup time will be something logarithmic or something like, doesn't have to be tree. But the point is the bigger the queue, the bigger the heap, the more expensive are the lookups in the heap. Also like I said, this is a global thing in the process, usually by default, which means that you will get threat contention on this thing if you use it extensively. For example, multiple charts are located in blocks of the same size, but very frequently you will have threat contention. Heap does have mutixes inside and you can even see them sometimes in the flame graphs. So make it worse if you are writing in C++ and you're happily using those nice fancy containers, least vector queue stack unordered containers. No, forward least, all this nice easy to use stuff, you have to realize that even if you don't use heap explicitly, it is used inside those containers. Heap is basically, vector is basically dynamic array, myp is basically a red-black tree where nodes are allocated, least allocates, containers for every item of the distance are on. So you use the heap even if you don't do it explicitly to make it even worse after that. You have to remember that allocations affect each other. Like I said, the more allocations you do, the slower it will be next to allocations and freeings. Use heap becomes bigger, it becomes more fragmented, less optimal and it gets more and more expensive to use. What can we do about this stuff? Firstly, you can try not to use this stuff. You can just not use the heap when you don't need to. For example, when you can just locate stuff on the stack, when something, some object is array, it's small enough and its size is known at compile time, just declare it on the stack and use it. If it doesn't have to be something long-leaving. Or another frequent use case which I see is then when we have a class or struct and we store in there something by pointer and lifetime of this object is equal to the class where it's stored, right? And they just store it by value then. You will reduce number of heap allocations then. When you cannot get rid of the heap allocation and you have it in some critical path which is very frequently used on your server and you see it in the flying graphs, you can still do something about it, you can optimize it. And there are ways, some easy ways how you can quickly regain some performance back. We will start with the object pooling thing which is not as simple as it sounds. Typical very widespread use case in the back end. We have this server, requests are coming to the server. Each request is read from the network parsed allocated into something like struct request or class request. It can be big, one kilobyte, five kilobytes of different data, different members, attributes, then you place it into your business logic pipeline. It is processed in the end, it is deleted. And this process is repeated again and again for every request. And if the request is big enough, like one kilobyte, and you do it frequently enough, like 100,000 times per second or million times per second, then you will get heap issues here because this heap allocation and freeing will get expensive of such a big object like one kilobyte or more. And you can see it in your flying graphs sometimes if you are building them at all. Example of the code. So we have this class request with many members. Some of them can be indirect members. For example, we could inherit this from base request, which is base, base request and something like and it can pile up. So in my current project, the size of this thing is two kilobytes from those many, many, many small members. And then we have this business logic pipeline, like process request and it allocates request object, fills this with data and when request is complete, asynchronously somewhere it is deleted. This thing, those two lines will get expensive. If done frequently enough and request is big enough. Effects of the heap here can be mitigated quite easily. If instead of using the heap all the time for allocating and freeing stuff, we will just allocate it once and then we will reuse it again and again. So we use the heap just once and then we don't use it. And we avoid the heap issues. This is called object pooling. When you allocate stuff once, store it in some sort of pool and then you take it from the pool and place it back by placing the heap. Even though first time you do allocate it on the heap. Then what you get from this is that firstly you do not pay for the lookup time in the heap. If you remember that the heap is storing those blocks of different sizes, sort it somehow then it needs to be something like 3 or hash or whatever. But here all the objects are the same of the same size. It can be just least or stack, right? It could be done in constant time, allocation and freeing. We do not pay for lookup time anymore. You can deal with concurrency in a more efficient way than the standard library. I mean you can switch of course the heap from like GMLOCK or TCMLOCK, right? We heard about it. It can make stuff actually faster. But if you do not want to or you have to have more control in your code over those things and you have this pooling thing, you can implement concurrency yourself and you have to agree that doing concurrency stack or concurrent list is obviously much simpler than doing concurrent tree or concurrent hash table or something, right? It can be done much simpler. Let's try. This is how I tried first time. It's good first try, right? Kind of. It's simple. That's why it's good. Sometimes it's even good enough, right? So we don't need over engineer things. But in this case it makes not much sense because if your code is very hot and you suffer from heap contention and you change it to this, then it will get even worse because you will exchange heap contention with mutics contention. And secondly, you are still using the heap because if you are storing in an STL container, any of them, you will be using heap and we don't want to use the heap. So it cannot be done this way. But it can be improved. It's not a dead end, right? This is how it can be improved. And the alternative is to add local pooling. So what we have is instead of single global pool for everything, we have one global pool and also in each thread we have thread local pool of limited size in each thread in addition to the global pool. When threads are locating something with new or maloc or whatever, they take objects from the local pool, not from the global one. And when they free objects, they place it back into the local pool. And this local pool can be done very, very simply. It can be just a list, an intrusive list and that's it. It doesn't need mute access or anything because each of those local pools is used exclusively by one thread. But when pool inside some threads becomes empty and they want to allocate more, they will take a batch of objects from the global storage and will reuse this batch until it also ends and so on. On the other side, when they will be freeing stuff and local pool becomes too big because it's limited in size, it cannot grow. Infinitely, they will move it back into the global storage so as other threads could reuse it. This way we get firstly that the heap is used rarely. It is used in bulk when it is used to allocate at once many objects, not four, like 64, 128. And also it will not be used at all after some point when all the pools will get saturated. And fourthly, there is no contention on the single global pool. This global storage, it can be protected with the mutex, but it is used so rarely that this mutex contention will not be visible. It will be used at most every, like, 64 locations. So it's 64 times less contention, which means it will be basically almost zero, neglectable. If the explanation was too bulky, I prepared an example how it works, like a real life example, how it could look like. Imagine that we have those three threads and empty global pool. All is empty in the beginning. First thread wants to allocate something. It will take a look at the global storage. There is nothing, so it has to allocate a new batch. New batches are located on the heap. But then when it will allocate objects, they will be taken from this batch. No more heap allocations. Just one heap allocation. And then from the allocated batch, we take objects one by one. Then second thread. That's the same. It has had local pool empty, nothing in the global storage, so it had to allocate second batch. They keep using the objects from the local pools. So far, we only did two heap allocations. But then happens, which happens very frequently in backend code. Those objects, they migrate into another thread. It happens when you have dedicated threads for networking, they read data from network. They parse it, create this struct request, push it into some sort of queue. And this queue is taken by other threads doing business logic, and they will delete the request. So most often, it happens that you have one thread allocating requests, other threads deleting requests. Objects will migrate. So here they migrated. And this other thread completed them somehow and tries to free them. They do not fit into this local pool. It is limited in our example by four. So fifth item didn't fit. And to fit more, it will have to migrate this pool into the global storage. And then it can keep freeing stuff. Now, a little bit more random work happens. Some more migrations. And then we are in a situation when second thread wants to allocate something. But it doesn't have anything locally, so it will go to the global pool. And this time, we have a free batch. So we take it. And we use this batch. So far, during this entire demonstration, we have only used heap two times for all those allocations. And at some point, after some more work, we will not use heap at all. It will be all saturated. Work continues like that. How it could look in the code? Yeah, visible, good. I have this benchmarks link is on the slide. Everything is already open. You can reproduce it yourself. I have this value, I think, which is, whose size is configurable at compile time via templates in C++. And I'm testing it with sizes one byte, half kilobyte, and one kilobyte. And they also have the same value, but thread pooled by the algorithm, which I just explained before. And in the C++, no matter how much we can argue whether it's good or bad, full of unnecessary stuff, but templates are sometimes very nice. In this case, I implemented the pooling in templates just once. And what I have to do is simply inherit this magic thread pooled class, and my class becomes thread pooled. I can simply apply it in as many places as I need, and all the types will become thread pooled with their own independent pools. So I'm comparing value versus value pooled. The comparison itself is that I have many threads. Each thread allocates many, many values, and then frees them in random order. And then again, and then again. And I am testing how fast is this spring and allocation, depending on number of threads and so on. And those are the results, which were surprising for me to be frank, that for one byte, even for a single byte case, I got a speedup. Normally, heap is very fast, even for, is very fast for smaller locations like those few bytes. Heap is actually extremely fast, the standard heap. But some why my pooled version was even faster than that on single byte case. But my most interesting relevant case was twice faster, which was good enough. And it can be actually quite visible in the final RPS. So of course, you have to benchmark everything. You shouldn't just blindly make everything thread pooled. I think this stuff will get faster. Probably will not. You have to apply case by case, measure performance, see how much it helps. I have seen in my experience that this can help and can be observable in the final RPS. This simple thing. What else can we do with the heap? Intrusive containers. So to understand the problem, which mostly comes from STL again, from STL containers, those list, map, unordered things, forward list, and the thing which unites them all is that they are not intrusive. And to show the point, let's have a look at the list. The lists are the most popular type of container, at least in my type of work in the stuff which I'm coding. I very frequently use lists. And the problem with the list is when you push something into the list, it will not be directly saved there. It will be copied and saved into link container object, this gray cube. Even if you store pointers, this pointer, those eight bytes, they will be copied. Not your object, but something will be copied and it's unavoidable. And it will be copied into this link container thing, allocated on the heap every time when you push into the list. And when you pop from the list, it will be deleted. So every operation with the list costs you heap operations. Secondly, which is not so obvious, but it also has performance, is that when you store pointers in a STL list, iteration of the list becomes slower. Because when you store pointers and you want to get into your object to de-reference some written member, for example, in your struct, you will first have to de-reference link container and then you de-reference your pointer to get to the member. You have two memory access operations. And they are not free. This arrow thing costs something. So we have additional memory. We look up simply because of how a STL list is implemented. What can be done with this? It's an intrusive list. So what we do, the basic idea is that we add those links, next and previous links, which are linking the items together. We add them into our object directly, like in the old C times. When you ask a student to implement a list, they do this. Probably they are doing it right because we will not have heap users here on every push and pop because we don't need intermediate link container objects to locate and delete them. And secondly, we don't have this additional memory lookup because to get your data, you just de-reference your pointer and directly get to the data. No intermediate objects for this. The only problem with those intrusive containers is that they are quite bulky, at least in C. So this is huge pain. Maintaining those next and previous pointers, head and tail of the list, and you do this every time for every type that you have and you want to store it in the list. This looks quite not good. It's quite hard to reuse such code without C++ templates. C++ templates, you can implement actually intrusive lists just once and then reuse them. On the slide, there are links to forward list and doubly list implemented by me. On the left side, you can see how the API looks for forward list. And on the right side, how it's used. So I have this object something. I simply add this next pointer in any place of my object and I instantly can use it in intrusive lists. With the intrusive list implemented just once using templates. And this name, by next, it's customizable so you can change the name as well of the member. Then what can you get from the performance if you apply intrusiveness is shown on this benchmark link on the slide as usual. I'm comparing a list of pointers with intrusive list. It's the list of pointers because usually, just like I said in my code, I prefer to manage lifetime of my objects myself. And when I have an object, I push it into the list. So I have object before that. And when I pop it from the list, I usually keep sleeping for a while after that. So I don't want to copy the entire object for storing it in the list. That's why I usually store pointers. And intrusive list stores pointers by design. So I'm comparing kind of similar cases. And the benchmark is doing that I'm measuring time of list population, how fast I push items into the list. And list walking, how much costs to me this additional memory lookup. It's interesting, right? So this small arrow thing is even visible in any measurements. This is what you get when you switch. So at least in this benchmark, right? So it might not get this speed up in your case, but in this benchmark indeed. And in my experience, it also sometimes does. I've got almost three times the speed up for list population because I no longer allocate those link containers. Firstly, secondly, you see this walking speed is 7% very small, almost noise. But it's not noise. It's reproducible. Every time you run this benchmark, you will see this difference, which comes, it's not much, but it comes from this additional arrow thing. And it's not much, but it doesn't mean that you can just leave it, right? Why have this performance loss if you cannot have it? Those small things, they pile up into something bigger. In my experience, this was all the easy stuff with the hip for which we have time. We can also have a look at thread condensation things. What is thread condensation? It appears when you have multiple threads which try to access certain critical sections at the same time, like mutex protected data or something like. And when this happens too frequently, it can cripple your performance, your cripple parallelism of your code. So your code will not as parallel as it could be. And result could be something like you have the 64 core machine, you enter it, you type H-stop and you see two cores used, right? It's not a good situation, paying so much money and then getting this. You are not utilizing all the resources when you have thread condensation or you are utilizing them on the condensation itself, not on something useful. And what can we do about this quickly? Like it's first aid kit, right? So it should be done something easy and quick. First thing, false sharing. It assumes that, let's start on the case when you think it's easy stuff. I know this, I am master of condensation. I don't have it. This is how I protect it from condensation. I placed this link on the slide with the benchmark and the example is that I have this object with two members. One member is always accessed in one thread, other member is always accessed in another thread. And seems like I don't have condensation because I am not sharing any data between the threads. And they have this benchmark which does some amount of work for 10 seconds or so, which looks good enough. But if I do it like this, I get five times the speedup. By adding this 64 bytes of unused data between my members of the, in this track. What is the link that I increased size of this track and I've got five times speedup? Should I just make all my strikes bigger the better? They will get faster. To understand the reasons behind this, you have to understand how CPU works with the memory. The thing is that CPU cores in your CPU, they don't access main memory, the RAM, the bus directly. They do it through this proxy thing called CPU cache, which is to put it simply is basically one cache per core, right? Not to dive into too much details. And this cache thing is basically accessing the main memory for the CPU and CPU is reading the cache transparently. And the cache, it has those blocks of fixed size, which are copies of small, small parts of the main memory. And those small blocks of fixed size, 64 bytes or 128, we call them cache lines. And all works fine and fast until we get the case when multiple CPU cores for some reason start reading and writing the same cache lines. For example, by the same address, one thread is doing crates, other threads are doing grids. Then we get contention and CPU has to perform this very expensive synchronization of the different cores so as they would store the same data for the same address. So as it wouldn't happen, then for the same address, different threads see different values, right? It shouldn't happen. And this synchronization of the cores is very expensive. This is where the slowdown happens. And what could happen and did happen in our case, that data which was seemingly unrelated, different bytes, they by bad luck just happened to be in the same cache line. And we've got contention on the cache line on the hardware level, not on the application logic level. Simply because when you work with memory, you always work with basically with minimal size of single cache line. Even when you access single bit, the entire cache line of 64 bytes containing this bit will be used by the CPU by the cache. My fix was as simple as just adding this to split my data into separate cache lines. And now I no longer have contention. This is how I've got five times speed up. This is measurable in the final RPS as well. It can be visible when you fix it. Just when you're fixing it, make sure that it makes sense. Like I said, don't just add the 64 bytes padding everywhere where you think you're sharing data. Add it, test if it makes sense. If it doesn't change anything, then just don't add it. It's as simple as that. What else can we do with thread contention? Have a look at memory ordering. If you are having highly loaded multi-threaded application, it's very, very likely that you are having also those atomic operations in your code. Like SDD atomic and C++ and double underscore sync, double underscore atomic in C compilers, which all do the same basically. Today, besides some arguments, they also take this memory order mysterious thing. There are plenty of those orders. And what they are doing is that they regulate how much freedom the CPU has about executing this instruction and instructions around this one without explicit ordering. CPU can execute your instructions in any order it wants. Even if you turn on all the off of the compiler optimizations, your machine code looks absolutely linear, even if you have single thread, still those instructions inside single thread can be completed in random order. It doesn't matter in which order you wrote them in C or C++ or whatever you're using. Example on the slide. So we have those free variables, ABC, starting zero, and they have one thread assigning them to one to three in order ABC. And then other thread reading them in different order, CBA. It looks impossible by all the logic, but it is in theory possible in some CPUs that you will get printed free equal C and zero equal B. It looks impossible because if second thread C is B, C assigned to free, it means that also it should C be assigned, right? Because it was assigned before C. But it could happen that it will not see this because, for example, read in second thread of B could be completed before the read of C. Or writing of B in thread one could be completed after writing of the variable C. We don't have any guarantees from the hardware when we are talking about observability of thread state from another thread point of view. And if you think if you are safe on x86, you just don't use ARM and ignore the problem, the bad stuff is that you still have reordering on x86. There is example on the slide by this link which you can compile and run. And even on x86, it will demonstrate reordering. The some instructions, logically impossible, will complete in different order. Without any tricks. It's completely predictable machine code. It will happen even on x86. We will not dive into details of each memory order possible. It's too much time. But I will demonstrate you what kind of speedup you can get if you study memory ordering and use correct ones. Benchmark on the slide link as usual and the benchmark is very simple. I have this loop, single thread. I'm not even using multi-thread here. It's just single thread. I'm using Atomics in a single thread to demonstrate the point. It has this loop where I'm using STD Atomic and it runs in two versions. First is default STD Atomic operation with sequential consistency order on the right side. Memory order, sex, CST. It is default when you use STD Atomic and don't specify memory order. So it is the safest and strictest order when you use it. Code works like it looks. This wire is default. Otherwise people wouldn't have to bother when they don't care. But it is an overkill in this case. It's too expensive. And in my case, relaxed order is enough. It is in most cases actually enough. Like in shared pointers, relaxed order is enough. And I'm just comparing this loop with relaxed and sequential. Just think of a number. What do you think would be, how much would be a speedup? Like probably you're thinking zero because if you know x is x86, you will tell me that it will render the same machine code. When x86 writing has the same guarantees. Prosequential consistency and relaxed order doesn't matter. x86 is safe, right? But I've got 16 times a speedup here if I'm using relaxed order. It was x86. It was modern compiler. Loop was not optimized out. It was minus of re-optimization. So top optimizations. And still I've got 16 times speedup of this loop. What happened here exactly? If I open machine code, this assembly stuff, I will see that relaxed order was compiled into single-move operation. Prosequential consistency was compiled into this exchange operation. The reason is that on x86, there is only one possible re-ordering. Prosequential consistency order protects from this type of re-ordering using this exchange operation, which gives more guarantees than move operation. And the problem is that in this case it wasn't needed. So I just requested two strict guarantees where I don't need them. And I paid 16 times slowdown for this. And in fact, at least in my entire career, I have never seen case when sequential order was needed. It is needed in such extreme weird cases that I have only seen the artificial examples. I have never seen it in actual production code needed. The only pre-orders I ever needed were relaxed or acquired plus release, nothing else. So this is what kind of speedup you can get. For fun, go to Godbolt and try to render the same code on this version of compiler on C-line. It will be even more interesting. Just amount of machine code simply didn't fit on the slide. That's why I didn't put it here from C-line for this simple loop. What else can we do with thread contention? Look for eqs. In the background code, it's very, very frequent that you need some sort of queues sometimes in multiple places of your application. And the usual use case is that you have multiple threads producing something for the queue, like requests. They read from the network, allocate requests, validate it, and push into queue. And other threads are doing, for example, business logic. They are taking objects from the queue and processing them, and then deleting, like on the slide. How can we do this? We start simple again. If we don't have much load, then this solution is actually just fine. So we have this queue. It's just a Mutex protected container in STL. It works fine. If you have hundreds of thousands of RPS or millions of RPS on this queue, then you will get Mutex contention here. You will get it guaranteed. What can you do about this is just get rid of the Mutex. And there are solutions how to do this, called log free queues, which allow you to have a queue without a Mutex and STL. It will be thread safe. And the problem with those queues is that there is no one major queue, which is best for all the cases. Implementation of specific queue very much depends on what kind of queue you want exactly. Like there are those four types of queues, depending on how many threads are producing for the queue, how many threads are taking objects from the queue. And also you have to know whether the queue should be limited in size in your case, what happens when the size limit is reached. So when you understand your case, you can choose one of the queues, one of the implementations. There are many, many implementations. Of course, I just placed a few of them on the slide. For all the queue types, two of them are mine. One of them is this very popular, according to GitHub stars, Cameron 314 concurrent queue. And also there is this very nice website, 1024course.net. Who knows? It's a very nice website, which not only contains source codes of various queue types, but also they are actually explained there in a simple language. So you can go there and dedicate yourself about how those queues are working and why, what is log free, what is weight free, what are all those memory ordering types. It's all explained on this side. Very understandable stuff. And like I said, don't just use multi-producer, multi-consumer queue for everything. If you have, for instance, single producer, single consumer queue, it can be done much faster than the former. So just be careful what you choose. And this is kind of speed up you can get when you simply switch from mutex protected to log free queue. This is benchmark of my two queues. Some benchmarks doing multiple producer, multiple consumer threads. And stuff, and those are the numbers. So this also can be visible on the final RPS of your application. Just make sure you test it before you apply it. All of those stuff I'm mentioning today, it makes sense to test it first to make sure if you actually need it. What else can we optimize quickly like first aid kit, networking, backend performance, very often like 90% of all this performance will consist of how efficient your networking is. How efficient is your data received and sending. And in cases like one connection, one request, this HTTP stuff, it also matters how quickly you can accept and close clients. So this socket creation and closure also matter how fast you can do this in those types of scenarios and quick stuff we can fix here is, for example, scatter gather a link to the benchmark on the slide. And the use case is this. You have, imagine this multiple buffers that you want to send. Each buffer can be separate message or each buffer can be part of single message like chant, response or something. And you want to send multiple piling up buffers into the socket. How do you do this? Do it in a simple way. You just run the loop where your calls send on every buffer, right? It works, obviously. And on this benchmark, I have speed two and a half gigabytes per second on local socket payer without networking. And I was sending 16 one kilobyte buffers every time when I called send all works fine. But if I do it like this, I suddenly get two and a half times speed up. And what I changed is that instead of loop of send calls, I did a single send message call. Even the code on the left side looks bigger. It was this much faster. In practice, I saw that this switch made my code 10 times faster. It just depends on how many buffers you are trying to send of which size at one time. In this case, 16 buffers each one kilobyte in size, local circuits. I got this speed up, but it can be better. And where is the speed up coming from? The thing is that on the right side, I did 16 send calls. On the left side, I did single send message call. And those send and send message, they are in fact system calls. Very, very expensive. Switch into the kernel context when you're calling those things. And this is extremely expensive, basically. Every system call is always very expensive stuff. And you should avoid that, make them as few as you can. In this case, speed up is coming exactly from this. I simply made less system calls. I sent into the kernel multiple buffers at once. And this single, even single system call is many orders of magnitude more expensive than just filling this Iovac array. Even if it's something like 128 buffers, sending more doesn't make sense. 128, as far as I remember, it's the limit in the kernel anyway. They will not accept more. Funny thing, when you try to send more, sometimes kernel can return errors. It will even just do partial send. It can return error, like too many buffers. Someway, this is what I absorbed at least. So the solution here, if you have multiple buffers to send, simply use send message and receive message instead of looping those send and receive calls. And of course, it only matters if you have more than one buffer. If you have just one buffer and you switch from send to send message, absolutely nothing will change. Some people might be already thinking that why didn't I use readV and writeV calls? Because they look simpler. I don't need to fill in this message header object. I can just send array of Iovacs directly, right? They will work even with the same speed. The problem with those system calls, read and write, readV, writeV, is that when you use them, they are accounted in the kernel as disk operations, even if you use them on the circuit. I don't know why, but it is the fact. So when you are using read write calls on a circuit and you check this protspeed.io file, it will grow even if you call those functions on circuits. They will be accounted as disk operations. If you don't care about the statistics, then you can use those functions. But if you care, try to use send message and receive message. They are portable, available on all the Unix-like systems. So good stuff. What else can optimize event queues? It, of course, depends on the application very heavily, but often in the backend servers, we have, they can be quite loaded. So we can have tens of thousands of clients easily in the same number of circuits in one process of server. And although circuits can generate events like circuits can become readable, writeable, and receive out of band data from TCP or receive errors and stuff or custom events. And we need to handle all those circuits somehow at once. And there are three ways how to do it. Without ridiculous solutions like one thread per circuit or one process per circuit, it's not scalable at this scale. And the solutions are periodic polling, reactive polling, and event queues. Those are made up names. I just made them up myself. It's not like they can be found somewhere. And we go through each. So periodic polling is the simplest approach. As simple as you just have a loop where you iterate through all the circuits, and you try to read and write each, and then you sleep. And then you repeat. This way you don't spin on the busy loop, and you still handle all the circuits. The problem with this solution is that firstly we'll have additional latency here because imagine that circuit number N becomes readable. To get to the circuit, you firstly have to try to read N minus one circuit before. It will cost you time if you have thousands of circuits. Secondly you will lose latency here because imagine circuit became readable, and you just started 100 milliseconds sleep. You will waste 100 milliseconds of latency absolutely with no reason. And firstly you will waste CPU here because you will be doing lots and lots of unnecessary system calls. If socket is not readable and you are doing receive, you just wasted a system call, wasted some CPU time. The stuff can be easily fixed with a couple of solutions, one of which I am presenting only for the sake of you not using it because select thing is deprecated. It gives undefined behavior on Linux. If you have more than 1,024 sockets, or even if just one of them is bigger than 1,024 by value. It is not advisable for you to use it even in documentation. So there is an alternative poll which works quite fine. Even these days and it takes array of descriptors with events you want to listen for, those events field. And when you call poll, it will block you until any event happens with any of those circuits and when it returns, it will fill in our events field in all the descriptors with events which are available for circuits. This is how it looks in the code approximately. So we have this poll, you call it on all your descriptors. When it returns, you have events and you are scanning all the sockets and checking which socket has which events. Then you don't do those unassisted system calls. You only do reads when socket is readable. Write when socket is writable. And I have this benchmark, click on the slide where I have 5,000 clients and they are sending 5 million messages in total. And only a part of clients is active at the time which is realistic. It's not like all the time all the sockets are active. This kind of speed up I get when I switch from periodic polling to poll. I have got 25% speed up instantly and I did zero system calls which ended up with eWood block and periodic polling did 120 millions of those system calls which were not needed. And thirdly, periodic polling wasted huge amount of CPU time because it was spinning in a busy loop. I didn't even have slips in this case. If I would add slip to periodic polling here, it would get even slower. Here I didn't have slips and still it was slower and it wasted huge amount of time on those unnecessary system calls. This is not the end. We can optimize it further. One last optimization using event queues. The idea is that instead of having socket array in user space, we can have it in kernel space. And kernel will monitor your sockets all the time for happening events and notify you when something happens. This is Epolyn Linux, KQ on Mac and BSD and Diocompletion ports on Windows. So the idea is that you create this event queue, you add sockets one by one into the queue for monitoring specifying which events you want to monitor. And then you call this Epolyn wait thing to fetch the events. When it returns, you handle the events. It is as simple as that. So instead of placing all like 10,000 sockets into the kernel for each Epolyn wait, we just call Epolyn wait and get the events. This is how it looks in the code. We call Epolyn wait on our queue. We get some events, return, we handle those events, just them without full scanning the entire circuit array. For example, if you have 10,000 sockets and 10 of them got events, you will just iterate 10 times here, not 10,000 times. And the rest is the same as with polls. So we just read where readable, write where writable. As simple as that, if I apply this on top of poll, I get another 30% speedup. Even with a single chance. So you can of course optimize it first, but those were simple optimizations. This was all the stuff which we had time for, but also there is some additional content with eight other small, simple things which you can apply in your code. They're all clickable. You can click on them after the talk or ask them as questions right now if you like or ask me afterwards outside about those other optimizations. And now this was the end. Thanks for your attention. If anyone has any questions, then I believe we have time to take a couple. Thank you. Amazing talk. Thanks. You mentioned flame graphs a few times for the cash sharing issue. What kind of tooling do you recommend to detect those? For cash pieces? The cash sharing variables, sharing through cashes? Yeah, I think the first tool for example, Linux is able to measure the stuff. Okay. The first is also able to build nice flame graphs by the way. Any other questions? I have a question about the first example. I guess the second one, but still on the first chapter I guess of your talk about the intrusive list and it's my understanding that standard list C++ is also an intrusive list. So I don't think that interaction should do anything. The list is intrusive? Yes, I just checked it up so it can be in presentation. It's not intrusive. Okay. So for example, when you have this STD list, right, sign of intrusive list is when you have link inside of your objects. For example, if you store pointers and you have pointer at your object, you should be able to just unlink this object from the list directly, right? Just leave of the previous element, link with the next element instead of you. So you just in constant time can pop the item from the list, right? When it's intrusive. In a STD list, you have first it located. You have to iterate the list, find your pointer there and erase it by the iterator. Standard forward list also not? Are you certain on that? Unfortunately, yes. In a STL we don't have it. Maybe they, we have it in boost, I don't know. What, what in boost? Okay. Good stuff. Hello, thanks. What do you think about IO ring? I haven't tried it myself in real life use case yet. Okay. But I heard that can be faster than Ipul. So basically IO ring idea as far as I understand is the same as IO completion ports on Windows. Right? So you just directly send data from your buffers without copying. Yeah, I guess it is possible with IO ring to make even less assist calls. Yeah, yeah, perhaps. Could be good, could be great. But the idea falls into the same folder of event processing. So we don't then sort of socket array full scan or anything alike. Those are by the way, obviously not cross platform solutions in networking. I don't think we have anything cross platform enough besides maybe poll, right? And the rest could be like boost ASIO. Yeah, this stuff is working everywhere. If there are no other questions, then that should be it. Thank you all very much for coming.
20 Years of Open Source building XWiki and CryptPad
Okay, so hello everybody and thank you for coming so early. And so for those that were not there before, since you came so early, there are a few free t-shirts if you want to take them. So I'm going to talk about the story of our company, X-Wiki SS, building the X-Wiki and CripPad open source software in the last 20 years. So first a bit of a track about myself. And so I discovered technology in 1984, like using an Apple II and then I moved to PCs, I even moved to Windows 95, then I graduated from a good school. And actually in that good school I was kind of told, so I was actually, it was that we had a speech at some point telling us, you're a soldier of economic war. And so that of course resonated in a young person, but it also, I mean later on, it's like what, I mean why are we doing war? Like that doesn't make any sense. We're not fighting other countries, we should work together with other countries. And then in 1995, I was really very interested by the internet. I saw people using Mozilla browsers, Mosaic browsers in the school and I really just wanted to work on internet technology. And I ended up, so I started, I took one job about the internet at Capgemini, but after a few months I was recruited by Netscape because somebody from the team had left to Netscape and I ended up working three years. So who knows Netscape here, like okay, like so just not so bad. And so I was a consultant, I became also their Mozilla fan. I even wanted to work for Mozilla, Oregon, inside the company when they launched it. They didn't take me, so I ended up working in a French startup. I wanted to stay in Europe and that startup raised money, went IPO. I actually even was a virtual millionaire and then there was the internet bubble, it crashed. I was not at all a millionaire anymore, just like any other IT guy. And in that company, we were both by US company and in that company I used Wikis and that's how I wanted, I found Wikis like amazing in terms of how it brings people and helps people share knowledge. And that's how in 2004 I created XWiki. I was a bit accustomed to open source with what Netscape was doing. Netscape had a highly transparent organization and a way to share things. It was really pushing internet protocols standards and then they made Mozilla open source. And then I was a user of open source in my company as a CTO, like installing using Apache software foundation code and so on. And so in Wikis, we're purely coming from the open source world. It was very natural when I wanted to create a company and create a software to create it as open source. But I was not as much aware of the political aspects of open source. I was really looking at open source from the technical point of view. So that's how I started XWiki. I'm going to continue from that in this presentation. But I'm also now a member, our company is a member of APEL, which is an organization of companies that do open source in Europe. In France, we have the CNL. We have the Herb Open Source, OW2. I'm also on the board of OpenFoodFacts, great association working on open data for food. And I'm a small shareholder of Morena that is doing an open source phone. So I'm welcome to look at this. I find this amazing work. So what is XWiki SIS? So XWiki SIS is a French and European independent company. So it means we've been self-funded. And I say independent. It means it's self-funded. And I still own majority ownership. And the very large majority of the share is owned by employees, some ex-employees. Yes, I should stop. Okay, great. It's on HDMI, you put HDMI. The slides are here. No, no, you're here. And you're not seeing them. That's too bad. I took this cable. Sorry. Okay. So it means we control the company, which is actually something that's not so easy to achieve in tech companies. We are at around 4 million revenue in 2023. We did 50% growth, which has been really very nice. And we have 60 people, mostly in France and Romania, but also some people in Germany, even two people in Brazil. We do two open source software, XWiki and Kripad. One is the one we started with that I created. And then Kripad that was created inside the company in 2016. We have an international community. And so we also are very engaged for digital sovereignty. We think open source is very important for gaining control of software, both for states but also individuals. And we have a business model of allowing to have revenue for that software so that we can build it. And this is done through services support training, like anything software company do, but trying to do it in a way that it allows us to fund the open source software. So we have employees in all these countries. And so what we're trying to do is enabling freedom, both with the code but also with the products that we do. I'll come back to that. So two software, XWiki. It's about knowledge management. It's sharing information. What's really interesting with the knowledge, Wikis, is that it really allows people to share and make knowledge available. So we all know Wikipedia. But we do it for organization. Inside that area, we have competitors such as conference or notion or even the Wikis of Microsoft Teams. And this is part of, I mean the competition in the end for any open source software has a high impact on how you can actually fund your work. Depending on how you compare to the competition, you can find more or it can be more or less difficult to find money. We have more than 7,000 installs and more than 400 clients. And XWiki is now part of the Open Desk project. Also if you don't know Open Desk, I think about looking at it. Google it. We also did CripPad since 2016. And CripPad is an end-to-end encrypted document editing platform. Who knows CripPad? Here? Okay, good. And so I'd say competitors to Google Docs. And the real part is that it protects people's privacy. So I'll go on, how did we start? And so the big question is why be an entrepreneur in the end? Because I'm trying to, so in this talk I will try to focus more about the open source aspect of what we did. But when you talk about your company, it's all also about the entrepreneurship and the difficulties to just run a company. But so I had this wish to kind of create things and make them happen. And so that was a bit at the core of being an entrepreneur. But one really important thing was to try to do something that's useful for people and have some impact. I wanted to do it also in Europe. I've been to the Silicon Valley and I didn't feel I liked the moons. So the technology was great, but I didn't feel as good about the fact that people were just talking about money and how they would become rich all the time. And that really made me think, okay, I don't want to spend my life in an area where that's the goal. Like I want to be more in a place where we're talking about culture or whatever. And another aspect was as an employee in companies, you sometimes feel your managers are not doing what you want them to do or they are not fair or the company makes decisions that you don't understand. And in the end, you can complain and stay as is and keep complaining about what people tell you around. But my idea was that my feeling was, well, instead of complaining, just try to do better. And that's also a reason to become a manager or own company and be the one that has to take responsibility for what's happening in the group. A big aspect is about believing in the product and in the purpose of the product. One of the really important things that motivated us at X-Wiki for 20 years is the fact that we feel our products are missing or they're not enough existent in the world and they're useful. They serve an important purpose. When I started X-Wiki, I was a big user of Wikis, I was a big user of task management tools. And I said, okay, we could do task management tools, we could do Wikis. And I directed myself towards Wikis because of the fact that they help sharing knowledge. Task management tools are a lot about efficiency being more efficient in companies and I felt knowledge is missing more. Like we're missing more the fact that we spread knowledge and that we educate people more. And in the end, this has stayed with us for 20 years. So we have a lot of Wikis inside companies that help people get more knowledgeable about what they do and about the work in their own company. But we also have a few public Wikis. We have the dictionary of the history of Switzerland, which is a public funded Swiss project about the knowledge about Switzerland. We also have a Wikis about rare disease. And you don't want to look at the website too much because it's sometimes really hard to look at it, what the parents of the kids that have this disease live. But it's highly useful for this community of parents that are living with the disease of their kids. We also have a Wikis for public service in France and so on. And so from my point of view, if you want to stay motivated about software for 20 years, you also need to really believe about the fact that your software is useful. And in the end, in 2016, we created CripPad. We created the technology, but we decided to make it a product because we really thought it was doing something that was missing, is protecting people's privacy and that too many software are exposing the data, are not built to protect the data, and CripPad is a product that is built to protect the data. So now the problem is that if you want to do a good software, you're interested in doing it as open source, then how do you fund it? There's different ways. So you can just raise money. So that works. You can build, there is a lot of open source software that is built by companies that have raised money. Even now, the modern way of raising money is doing some crypto thing and launching a token and getting millions of dollars. So it can work. I'll come back why. I didn't feel it was a good approach for us. You can be an open source volunteer and that's great. But what I tried to do in this graph is measure the sustainability of that action and how much impact it can have and on the other side, how fast you can develop things, but also the comfort of doing it. Because in the end, if you want to do that for a lot of years, are you doing this under stress? Are you doing this having a good day and being able to have a good life aside? And so open source volunteers won donations being an independent professional, like a freelancer and getting paid for doing service around open source. That's really good ways. And bootstrapping a company, this is what we did. And I feel, and this is what I want to show also in this presentation, is that it's a good way. You have a decent level of comfort. You can have speed because if you hire people, you do more. There is the sentence that you can go fast alone, but if you want to go far, you need to go as a group. And that's what the company allows, being a group that is funded, that has some money and can go further together. And you can see in this presentation also the acceleration that we had over the time between the beginning and now. So investors, why? I want to take a little bit more time on the investing. So it took us a little time to realize it was not what we wanted. I came from a company that had raised VC money, and I saw the fact that you can create a momentum. You can have a money hire very highly skilled people, and you can build things fast. But in the end, the real thing that you need to think about is the day you take VC money is who is the real boss and who holds the key of the decision in the future. And whenever I had discussions with some investors, beyond the fact that they tend to like the salespeople or the business people more than the tech people, and so that might be a reason for them to not giving us money. But for us, the problem was, okay, are we agreeing on where we want to go? And in the end, investors are in for a return on investment, so making more money with the money that they put. And as an entrepreneur, that's not what I was in for. I was in for the human relationship with the employees, running a project over the long term, and creating open source. And when you discuss about open source, and in France at the time, it was also quite simple. They didn't understand open source, so you had to explain it. Today, it might be better like, oh, open source, great. Let's do open source AI in France. They love it right now, and they tell you, oh, it's great. But what is their goal with it? Do they want to sustain that open source AI, for example, or do they just want to make a play to take a piece in the market and then cash in at some point and close the work? And so this can create good open source. And that's fine. But if you have a goal of being, for example, good to your community, not lie to them. Not tell them that you're doing open source and not have a hidden agenda about how you're going to make some money. That's going to be difficult. And I felt as a CEO, if I raise money, I would start lying to my customers about what our real goal is with this open source project. And being independent allowed us to not be that. It's much slower. It was much slower. But in the end, it's more important to do it like that. In the end, money is a mean not a goal. That's really a thing to think about. And so what was bootstrapping about? And so from 2003 to 2010, it took seven years to get to one million of revenue. It took a lot of time. It took three years almost of myself not getting paid. I found some other ways. And then any time we would do a little bit of money through service, we would use the product. We would use it for hiring more people and growing the product and making it better. One of the great things with open source is that you can build on other people's software. And that's magical. And so you can really reuse a lot. And that's actually what the proprietary software companies are doing. Now you have 90% of proprietary software is actually open source software. And they keep control of the latest piece, trying to cash in or build some business model about our data. But 90% of it is open source. And that helps us also. That helps also the open source companies we can build on that. The support of the community is huge. The service is a good way to start. It has problem over the long term to do only service. But it's a good way because you sell time, you make money. So you don't take risk with service. It can be something that doesn't have the level of risk. Another aspect that allowed us to go from zero to one million is European research money and French credit and brochures. In France you have a lot of help about research. So you can, if you do something innovative, you can bring it to the state and get some taxes back. So you will have less cost as a company. This is, for example, more difficult if you are in association. You can get subsidies, but you won't get social charges back because you're doing research. It's going to be more difficult. And then you have European research projects. You can group with other companies. And we had the chance in 2007 to join some other companies in projects and get some funding through that. In the end, over the 20 years of X-Wiki, I calculated that we received 10 million euros of European research grants of projects in France and so on. And that in the end was our VC. I mean, getting 10 million from a VC is quite difficult. It took 20 years to get that, but it allowed us to fund the software. Another thing that happened in that time is that we went to Romania in 2006. It was initially through the Google Summer of Code. And we had a student that was in Romania and we gave some projects. He candidated and he was really great. At the time, I didn't have money to pay people a lot. It was difficult to hire a full-time employer in France. There was competition about the cost. Romania was really an emerging country in the tech industry with great scientific skills. And we hired some of the first people. They all stayed. The first three that we hired are still working with X-Wiki today. And we have 25 people now in Romania. It was initially a cost-driven decision with the opportunity to have people with skills. And over the time, it's a fully integrated team that is also believing in open source. And so we hope that we also had this little effect to bring some open source to Romania because we're one of the rare companies that is doing open source in the city we are among Amazon, Microsoft and so on. And we also have, thanks to Romania, a lot of women in the team. And it may also happen that there were some couples created at X-Wiki. So as an entrepreneur, it makes you think about the impact you have. So that's just a graph of finance. I'm revealing our finances. So I'm not going to detail them, but it can show you the split. The most important data here is that when we started, 0% recurring revenue. And after six years, 20% of recurring revenue supports. And in the end, that's the goal. The goal is increase the support revenue. It takes time. You need a great product. You need to reach a maturity in the product. But it grows over time. So it's all about the strategy to make that recurrent revenue grow with the users and customers of the software, whatever the goals are, whatever the type of recurrent revenue. So it took a lot of time. It grew to 20%. So one thing that I really want people to think about is there is no success in open source without a good product. There's a lot of people that think that it's open source business model doesn't work, but in reality, it's just the product's not competitive. There is a huge amount of products, including in the open source world. If you don't do a good product, there's not going to work. And so you also need to think about the strategy to direct revenue towards the product. So when you do service, that's part of the problem. You might diverge from the product roadmap to make a great product. Because you're going to follow what some customers say instead of following what all customers need. So you need to think about that. And one of the things we learned over the 20 years is that it can be a good idea to condition the service on taking the support, which allows to give extra funding to the product and dedicate people to work on the product. And but there is also some companies, for example, NextCloud, one of the things they do is that they don't give you service. They don't sell you service. They make you pay the product, the support a bit more, and they give you the service. That's also a strategy that is interesting, that is going to raise the product revenue and really make the company focus on the product. And then the service will be used to make the product better. So we need to think about focusing the revenue towards the roadmap. That's the case, for example, of the research projects. Another aspect is the community is super important. It's your marketing. It's also your insurance. Customers will find it reassuring that you have open source. And it's also your recruitment tool. You'll find developers. We have hired so many people that came toward the community. And it's also very important, the community, to be a good open source citizen. That's also how you look. You see if companies are really true about open source. Are they really working with the community? Is the community open? If a software doesn't take patches, doesn't take pull requests, is not discussing with people about how the software should be, you could question their motivation to really do open source. So at X-Wiki, for example, even though we don't have that many contributors because we're moving fast on our end, and it's not fully natural that people come and give you code, it doesn't happen like that. It's a challenge to make people give you code. It doesn't happen on all software. So the fact that the community is not huge around a product, from my point of view, doesn't necessarily say that it's not a good open source community because it also depends on whether the people want to come. At X-Wiki, we have a fully open development model, but we don't get automatically people running. We're using Apache Software Foundation kind of rules for running the community. You can find our code, you can comment, discuss in the chat, and so on. So some companies bring their products to a foundation, that's also an approach. One thing is the relationship that we customer in open source at the beginning. I realized that you talk about open source, oh look, it's great, open source, you're going to be more free, no lock-in, etc. And you talk to some large companies, the thing is they don't give a shit. They don't care about this. They just want the best product at the lowest price possible, efficiency. So some people do care in the end, but you have to kind of find them in companies and find the people that can be sponsors of open source. Today you have OSPOS in very large companies, even in the European community, in public service. These are the sponsors, but the majority of buyers of software are looking for the best software at the lowest price. And that's why you need to be competitive to show them that you have also the best software. And there is a difficulty with the marketing of the proprietary products. There is so much marketing of the proprietary products that it will cloud the vision of the customers. They will get stuff for free. They don't look at the long-term price evolution of software. We lived it with our competition with conference. Conference recently changed the prices, but for years customers were buying. We knew it would happen. We knew that at some point they would cash in as much as they could. I mean, they would cash in on the proprietary nature of the software and the fact that they control people's data. And a good thing with open source is that open source validates your product. So you can go and show to customers, look, we have these users, it shows that the software is good. And that works very well. And we also have progress in Europe today because there is an issue of digital sovereignty, not something that was not foreseeable with the dominance of American companies, but it's something that politicians or European organizations or state organizations took time to take some action on. Now there is a bit of action in this area. One thing that I learned also through creating X-Wiki is looking at Floss and free and liberal open source software as a goal. Initially, let's create good software, let's create a good company, let's have a good balance with employee. But in what I discovered is the goal of open source of free software is giving us freedom, giving us control about software. It's all the values that are described by the FSFE that are really interesting. And that they discovered that and that motivates us even more in building what we do. In the end, we had to find a balance between all these things. And these are the values that we promoted internally in the company. This is what makes our company. We need to take care of our community, we need to take care of our customers, we need to do a great product. And the great product is about the domain in which we are, the knowledge and privacy, the goal that we have for software. And we want people to be happy inside the company and we want to do open source. So these are the values that we promote internally. And what the challenge for CEO and for the group is to find that balance between these five items. And for example, we can see that these are the highest ranked reason why the people at XWiki decided to join XWiki. This is recent data, this is not data from the past. And we can see that being open source is a key reason why people want to be there. But they also want to be there because they like the product that you're building on and that we're building. So one of the key things was building on support revenue. I mentioned the recurring revenue and this was really important to really make the support revenue accelerate to be able to gain sustainability. And that's really the challenge for a company that wants to build open source over the long term. And so from 2010 to 2015, we moved from 1 million to 2 million revenue. But most importantly, we grew from 250K to 800K of recurring revenue. In the end, that's what I look more at, like how much recurring revenue we're making. Because that is what's funding the company. We failed at building partnership. We hope the product could be used to build some other products. But we found it very difficult to find the deals. And in the end, we found that we were better at creating a direct relationship with customers. Also explaining them the open source model and what we were trying to do. And also explaining them the value of our product. Relationship with direct customers is key in order to build the value of your software. We also tried to build the first version of SaaS, we call it XWiki Cloud. In the end, we focused on the main product and the main product's value. It also allows some simplification. And so that's the graph, you can look at it. Recurrent revenue grew to 35% in the time and that's really great to have that. It's not only about the percentage of recurrent revenue, it's about the amount that sustains the team. Because even if you stay with a percentage that's 50, the extra money that you're getting from the service from the research product, it becomes bonus when the recurrent revenue is enough. So if you reach a certain amount of recurrent revenue, the rest becomes a bonus. At the beginning, that doesn't work like that. You have close to zero recurrent revenue, so you don't even manage to find the team to continuously develop the product. And one thing to keep in mind is that close source competition is tough. Even if you're doing something innovative, you'll launch something new, such as Enterprise Wiki when we started or an end-to-end encrypted tool. At some point, if there is a big market, you're going to have close source competitors that are raising money that are going to come. And they might grow faster than you for a while. They will educate the market, which is interesting for you, but they will also try to take the market and then lay the cash in. But you can stick and stay true to your goals and wait. I always tend to say when you're number three and number one buys number two, you become number two. And when you're number two, you're the alternative to the number one. And all companies want a need alternative for competition. So I hope, I wish that open source would not be the alternative, but would be the leader that's not always happening and not always easy. But being the alternative is also something that helps you grow. And so after that, we had a challenging period. And so we were growing progressively, but what happened at some point, we flattened. And that was because of the competition. At the beginning, we were working mostly on innovation. We had customers interested in buying what we were doing through innovation. But at some point, we flattened. And we didn't have that innovation thing. We had stronger competition, SaaS coming in and speeding up. Deployment people would just buy SaaS. Companies would be less interested in the open source aspect. So we basically were flat with 35%. So what difficulties that we had? Competiteness, SaaS, competition. Our custom work was less demanded because there were more products that were doing things in a standard way. The fact, we had to educate the market about the fact that open source is not completely free. So that's also, you tend to put the priority on this when you're not making enough money. You tend to think it's just a business model. It's also the business model, but the main thing that we changed in that period, one of the things that we stopped trying to think like open source startups, we came back to think like what we had, what was the value that we had, and it was about our product. And what we did in the end, we created Task Force to transform the company, focus again on the product, making the product better. And trying to convince again people that the product was good. And it worked. And we looked at what was missing, what was not so good in the UI, and really did effort in it. And the thing is, when you're doing two million revenue, you do have money to try to fix problems, and that's nice. And so we relaunched a competitive offering, and we also changed a few things in the way we were selling to customers to try to improve their understanding of open source so that they would give us more recurrent revenue in order to fund more of the product. And so one of the things we did is rewarding customers, paying the product. So for example, we decided to build open source paying applications. So this is quite unique. So I don't believe in the open core model where you're doing open source and proprietary on top of it, because it tends to push you towards doing more proprietary. At Xwiki, what we decided to do, we did paying extensions that's similar to open core in the sense that you have to pay for them. But the code is open source, we just don't make the build available. So we have the Xwiki core, completely free, and of course, completely open source. And we have extensions, you get them as extension, they say pay for it, pay for it. In reality, the code is fully 100% available in GitHub. If people would want to use them for free, they rebuild them. The thing is, people don't make the effort. It's a lot of effort for companies to do that. And by making a little bit of friction for companies to adopt these extensions, it motivates them to pay, to give them a reason to pay in companies. And the bad part of it is we like it to be free completely for individuals. But this means that you would need to find some other ways to make it happen. But the most important thing in this strategy is that the code itself is open source. That means over the long term, it's owned by everybody, not just by us. So we cannot be the owners of that code over the long term. So this is a part where open source is not free and it needs to be explained. We tend to think that everything needs to be free, but you cannot pay people if everything is free. So if you want to build it, you have this difficulty. In 2016, we launched CripPad. And it was another experience there because we relaunched a second product inside the company. It had some useful aspects. It recreated innovation in the company. And it helped us gain other research projects because research projects are highly linked to innovation. So we had a second batch of innovation inside the company. And it also helped for the image of the company. It made us more known by individuals. And then, oh, you're doing Xwiki also. Some people, I mean, I don't know. In people that know Xwiki and CripPad, how many people didn't know it was the same company that was doing that? I don't know. Who knows Xwiki? Who knows CripPad? Anyway, so then 2020 happens, COVID, what happens? So that's a crisis. One thing to think about is always be ready as an entrepreneur for a crisis. It will happen if you stay long enough. We had subprimes in 2009, 2020 we had COVID. The thing is for us, we were more ready than a lot of companies because we were already remote friendly. We were allowed to do, everybody was allowed to do two days of remote in the company. We just moved it to do whatever you want, just work. And everybody worked from home. We had the tools, everything was already adapted to work with remote tools. That's one of the magic of open source tools and open source development model. We had the knowledge, we had the knowledge tools. And it also gave a boost to CripPad because CripPad actually was used by education. For example, in Germany, we had credible usage of CripPad over a couple weeks during COVID. So it also gave a boost. But as a company, of course, it creates a bit of scare what will happen, will customers go away, will there be a financial worldwide crisis for years. In the end, we went through there. One of the things that COVID showed is a challenge for European digital sovereignty. Politicians realized that supply chains were a problem and that there were risks there. And this has tainted towards digital sovereignty and software. And since a few years, we've seen that there is an interest in this area. But the most important thing that happened for us is Atlassian changing their business model and saying that people should move to their cloud. And they should stop using software in their own companies. And they closed the smaller offers. And they decided that in 2020 in November. So I don't know how they found that COVID was a good time to add some stress to their users. But they did and their customers didn't like it because we received a lot of mail like saying that, okay, what is their way to replace Atlassian conference with Xwiki. And so we spent time on improving our migrators. And we were not necessarily surprised. We were surprised the extent of the change that they made and what they did to their customers. But from our point of view, it was something that would happen at least progressively. When investor backed companies want to cash in, this is a time for the SMAs and open source companies to really propose an alternative that is more sustainable over the long term. Open source is more sustainable for other people than proprietary software is. So for us, it brings some maturity. This raised our revenue to 3 million 3 of sales in 2023. 50% growth, I said that at the beginning of the presentation. And in the end, 1 million 6 of recurring revenue with 30% growth on the recurring revenue. And that has been huge for the company and for allowing to build more software. So this is a graph you can see last three years, pretty nice. So when you look, when you are in 2020. When everything's flat, you feel a bit depressed. And that's not going well. But then three great years behind it. So it's never given. You can always turn things around. Not only because of Atlassian or thanks to Atlassian, but also because we went one project with Digital Sorbitancy in France and Germany and the software was recognized. And what about the future? So everybody talks about AI. For a knowledge company, it's a real question. So we need to think about it. One of the things that AI is doing right now is that it's questioning the aspect of open source again. We saw a lot of big companies, as I said, not caring about open source. Politicians not caring about open source. With AI, it's the first time the president of France said the word open source. Which we wanted him to do for years in the industry saying that it was important. And he said it for AI. Okay, what will it change? We'll see. But at least it raised the question of transparency again, of the control of code of data. And that's something that is positive for the future. But you also need to get prepared because it changes a lot of things. The architecture of running AI is complicated. It's much more harder to run it on premise. And so you need to find solutions for that. We're working on AI at Xwiki. We have an extension. And we also gained a research project to do some search engine using AI. And I would like to point out the approach of also NextCloud with ethical AI. We are completely aligned with that aspect. You cannot do AI today without thinking about whether it's ethical, whether it's protecting data or not. One big aspect that I think is really important for the future is software modularity and integrations. We believe at Xwiki that the future of open source software is allowing to assemble software together. And making better reuse. I said at the beginning of Xwiki that when we started, we reused a lot of open source software. Well, if we want our software to survive in the open source world, we also need to make sure that it can be reused more. And this is why we've launched a new product. We call it Xwiki Crystal. And it's going to be a new modular AI that will not only work with Xwiki, but can work with other Wikis and can be integrated in other tools. And the other thing is that we're part of the Open Desk project, which is a funded project in Germany to make an open source suite of collaborative products. And we're very happy to be part of it. And the other aspect is doing with CripPad what we did with Xwiki. I showed the financial of Xwiki, the company that included both Xwiki and CripPad. But what's really interesting is when you run a second product inside a company, how does the other product look? And what's really interesting is that it looks a lot like the Xwiki product at the beginning, 20% only of recurring revenue. And it's difficult to build that recurring revenue. So if you love CripPad, we are very happy. We've been able to double the size of the team, as you can see in the funding in the last two years. So in 2023, we doubled the size of the team for 2024 too. But it's only 20% of recurring revenue. And that means we don't have the sustainability yet. And if you look, the blue and red part is our recurring revenue, subscriptions to CripPad.fr and donations. And you can actually help us build sustainable revenue by promoting CripPad and allow us to find more users and customers. But you can also help it with donations. Any software needs to reach that sustainability through the recurring revenue. That's really the challenge for it. Finally, giving back. So first we give our software because it's open source. We give our code as a company. But we also think that it's important that we give back to the other open source projects we do. We wish large companies would do that. Large companies that use open source for free a lot or proprietary software company today that are building on open source should give back something to all the project that they use. And we decided to create a fosfan of 1% of our recurring revenue to give it back to the projects that we use. We have three years backlog that we're going to give like almost 30K to the different projects that we use. We're going to give for example to the Matrix Foundation. We're going to give to MasterDont and to lots of other tools that we're using. And we're going to continue to participate to industry organization to help it make known. The conclusion is that nothing of this would happen without the team itself. And we have a team of 60 people, more than 200 people over 20 years that worked on Xwiki. And that's really the kudos to them because you cannot do that without all the people that worked. And we, for example, at Xwiki we have more than seven people that worked 15 years. We have seven people that worked 15 years at Xwiki. 15 that worked 10 years at Xwiki. And this is not necessarily that easy to achieve that for a group of 60. We have a difficulty of funding all the time. If you want to join, we have jobs. And also nothing would have been possible of what we did without the help of European projects, French projects, BPI, Europe, NLNet. If you don't know the NGI program, the funding you can get for the open source from NLNet, go look at it. It can help you fund your project. That's it. And if you have any questions, I'm welcome. I'm available or I can take any questions. Any questions? No, I guess people are just installing here. I have a question if nobody has a question. So I was wondering how was the ride between building a company and having a community, basically, was there any conflict of what to put in the product, what to not put in the product, let's hide this away so that people pay for it, let's give it for free. How was this dynamic in the building of Xwiki? Yeah, and that's the difficult part is what do you do as a pain module? What do you do as free? Well, one of the things, so first is really keep an open community. We are very important and really keep everything open source even if you have pain stuff so that people can look at the code and discuss. For the choices of what the features are, well, we try to direct them as much as possible to the ones that the bigger companies would need, would need most themselves, not necessarily the individual or the smaller companies. Because in the end, it's mostly the bigger companies that have the funding for you or for an enterprise software. And it sounds weird that the bigger companies are not paying for it. I think Matrix has a talk just after and I know that they will talk about the fact that you have huge deployments of Matrix and with zero money and some smaller deployment that are giving significant money. And the larger companies that are massively using open source need to participate to it. And so directing the specific features that they need, for example, audit logs for compliance reason is something of big companies. But for example, LDAP authentication or SSO, it's a bit tougher to not give it because it's a security feature that's really important to make software more secure. So for example, that has been a difficulty for us. So we made active directory paying application, but LDAP configuration is still available in XWiki as a documentation in the open source documentation. But if they want the simple configuration with Microsoft Active Directory, they pay the application and we sold a few of them. Hello, first of all, very nice talk. Thank you. What would you have done differently on the Twiki journey? What would you have done differently on the Twiki journey? Oh, that's a good question. Well, the little strategies to make people understand open source better, for example, making people pay more for service if they didn't take support, what I've done earlier, the paying application maybe earlier, not so sure because initially you need to build community for sure first and you need to build competitiveness. So it's kind of difficult. So that part, not do the four products on top of XWiki with partners, but at the same time they gave us some money. So maybe do less service sometimes, more product. So these are the things I would have done. But it's done differently. Like basically the playbook of how you can fund the product or try to do it earlier. And then when we learned it, I had other presentation about the different method we found in prior for them about how to fund an open source software. So I gave to Kerl, which also has a great experience about how to fund the work in Kerl. Any other questions? Nope, okay. Thank you, Ludovic. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you.
You too could have made curl!
I better not touch anything anymore. Okay, nine minutes off. Okay, cool. Hi. Technical stuff. Right. Let's start this. I am Daniel. I work on curl all day. I work for Wolfers & Sells. I do curl stuff all day. I am going to talk a lot about curl. I always talk a lot about curl, but today as well. I don't think I am going to present a lot of new things here. You are going to hear me reproduce and repeat the things you already know. But cliches are cliches for a reason. I am going to just let you know that some of them are actually true. At least from my point of view. I work on curl since a long time. It runs in a few things these days. You can actually probably not walk very far without using curl knowingly or not. It is in a lot of different devices, things, services. Since a few years back on more than one planet as well. Right? A favorite slide of mine. Anita, squeeze it in. I am sorry. A few years ago I also got this gold medal from the Swedish king here for my work on curl. And actually... But not a single gold medal since then. It is kind of a disappointment. But anyway, these days we estimate there is roughly 20 billion installations of curl. Quite a few. We don't actually know that it is 20 billion. It is roughly open source, we don't know. But definitely there cannot be that many other open source programs in general. Software anywhere that runs in many more instances. I am pretty sure. Pretty decent thing, I think. But you know, everything really didn't start out like that. It has been taking quite a while. Because in my project, our project Curl stuff... We of course started somewhere. And it was a long and sort of an effort. And a long journey from something that was really not very good until what it is today. Which could possibly be good. So in November 1996, it's a long time ago, I turned 26. Fun. So I started with a little project. It was more like that. Very silly toy. 160 lines of code. Just a few screen folds. And what do you do with that? You start with playing with it, it makes it something. You start fiddling with it. And you know, start small. Do what you want to do. Give it a lot of time and have fun. That's how you start an open source project. You have a niche, you start scratching. And as long as it is fun, why not work on it? And in my case, I worked on it for about two years. I actually recalled it Curl then in 1998. So it started with another name. But that's a long story. Anyway, two years later, December 1998, what an awesome success. 300 downloads of my software. I have this screenshot from the website that I had back then. Because I think it's a cool reminder that actually getting 300 downloads of your software is pretty cool. That's way more than all your friends. All those who just did it because they know you. Actually started to reach out. And that's cool. It is cool with 300 downloads, even compared to 20 billion today. And I also want to emphasize that this was two years later, right? Two years later, 300 downloads. Yay, a good one somewhere. I mean, in 20 years, we could have 3000. So yeah, keep working at it. And finding your goal or you find a project to work on, of course, it's a good thing, right? It's fun and work on it. And maybe, sure, you want to make it easy for others to help. But you can be sure that, I mean, the world is drowning in open source projects and good ideas, right? It's not a problem to find good ideas. It's not a problem to find open source projects. But how do you actually get anyone else interested in your little project? Because you think it's fun and interesting and serves a purpose. Probably not. Probably you're just going to have to realize that it's you and your project for a while until it's proven to be something. So as long as it's fun, why not keep at it, right? And spend the time because it's not going to be an immediate success. Very few things are an immediate success. So yeah, spend time on it. People often ask me what I've done in Curl mostly. But I think what I'm mostly done on Curl is spend time, right? 1996, I started this. And also learn to use the time. I told you, I was 26 years when I started this. I didn't have any kids. I've had kids since then and they have grown up pretty big since then too. And all of that, we're all having lives, families, other things than just open source, right? But how do you actually get time to spend on your projects? In many cases, you maybe need to do a little less of something else or a little bit less sleep or whatever. In my case, I sort of, yeah, maybe if you really want to spend time, as I say, you need to spend time on your project to get somewhere, maybe you have to do a little bit less of something else, right? And people actually sometimes don't believe me when I say that I never ever play computer games, right? That's just, that's an easy thing to rip out of your life and save hours and spend that on your open source project instead. I mean, you can cut down on sleep as well. And I do that to Firmum, but that has its downsides as well. Just accept the fact that for long periods of time, you might just be the only person, right? You, of course, you make it easy for anyone to contribute and you know, load the bars and accept pull requests and everything, but, you know, there are many open source projects out there and we're all competing with the same developers, right? And all those developers, they also play computer games. They watch TV, they have families, they have other priorities in life before your open source project. But I can spend time on my project. I can control at least to some degree what I spend time on. So sure, just accept the fact that, yeah, yeah, yeah, I make pull requests in my own project, right? I put them up there, someone can comment on them, someone might review them, but if they don't, I go ahead and much and continue with the next one. Because in the end, it doesn't really matter. Looking back at your project, you don't care if I started my project 10 years ago, 15 years ago, or two years ago, as long as the project is good, it's there, it fulfills the purpose. So in a way, time doesn't really matter in the end. And of course, reaching somewhere, accomplishing something with your project, there is really no silver bullet here. There is just engineering. And there's just open source stuff that we all know how to do. We've all been doing for a long time. There is just hard work and keeping at it. And of course, having fun. Because if you're not having fun when doing it, you probably won't endure. So in the code project, right, it started in 1996. Number of lines of code is basically zero. So I actually started a project with someone else's code, so I didn't write those first 160 lines of code. And then I became a maintainer a few months later. And then we started the journeys. And then we, so now we're at 160k something. And yes, a fascinating linear growth too. Kind of unbelievable. So yeah, I'm just saying that keeping at it, things might develop. And making sure that others can contribute is of course, crucially important. And that's why it's open source. We want to enable others to contribute, even if in many times maybe they don't, but there's still that opportunity, right, and availability. And if you're doing things right and you happen to be accepted by others, maybe someone will contribute. And now everyone is looking at that bump in 2005 and thinks, what happened? And it's quite boring. I actually just went back and sort of filled in names that I have missed out from the list before. So I just went back, so it's actually not supposed to be there. But it's my script count number of names in the list. So over time you might get a lot of help if you're successful enough. But success is obviously not given, right? There are a lot of open source projects. I mean, and they're adding every day, right? So there are hundreds of thousands. Just look at GitHub or whatever. We're drowning in open source projects. And yeah, it's certainly not a guarantee that whatever we do is a success and going to be popular or anything. But if you don't give it enough time, if you don't spend your efforts and really make sure, I get a lot of questions or people say, yeah, yeah, I spend a lot of time on my 47 projects. I did them for several months and nobody used it. So sure, if you don't spend enough time, if you don't polish it enough, maybe it doesn't stick out among all the others, right? So maybe you actually have to spend more time to get somewhere. And it needs to be fun. But whatever you do and whatever anyone does, there will be times when you're sort of, when you just run into something that wasn't supposed to happen, like a security problems or whatever. And it's bound to happen to anyone who's doing software, maybe more to some than less to others. But still, everyone is doing mistakes. It doesn't matter how long we're doing this or how much we have done it. As long as we keep developing, we keep changing things. There will be mistakes and mistakes will lead to security problems every once in a while. In Curl, it looks like this. The green ones are bars. When we fix security problems, the red ones are when we introduce them because I tracked them down. So of course, we introduced them before we fixed them. But anyway, I'm just meaning that, yeah, we work really hard, of course, to make sure that we don't introduce bugs, we don't introduce security problems, but can be sure that they will creep in anyway because it's tough. And you all know that, right? Nothing new here. But what do you do? You just own your mistakes because they are going to happen and try to learn from them, which I think is really, really hard, right? Because every time you get a security problem, it feels like, this is a one-off, we should never have done this stupid thing. But try to learn from it, adapt, move on, add more tests, and make sure that we at least don't reproduce the exact same problem again in the future. And yeah, I've done that still several times actually, it's kind of stupid. Yeah. And keep having fun because if it's not fun, it's not going to, you're not going to spend all that time on it. And no one else is going to do it either. And of course, everyone makes mistakes. And it's really a matter of how you handle the mistakes. It's not sort of the amount of mistakes or how critical they are, but how you take care of them, how you take care of the people who actually made the mistakes. In my case, it's easy to take care of the people because almost all of them were my mistakes. And there's no denying that it's sort of soul-crushing when you have your software in 20 billion installations and you have one of these things that you know can end up really, really bad for the users. Yes, that can make it a little bit harder to go to sleep at night. But yeah, again, we all do mistakes. We try to learn from them and move on, right? And in our case, in pretty much everyone's case, we just have to do what we can do, right? Engineering, we write readable code. You should be able to understand the code in any language. Whenever you read code, it should be understandable. If you can't understand the code, it's the wrong code, right? And you document everything clearly a lot. And another thing with working stuff, or working stuff for a long time, is that you have a long time to write the documentation as well, ideally, right? And a lot of tests too, because the more time, the more tests. And you analyze your code, of course, you threw every tool at it and make sure that the tools, they don't complain on your code. And then when you have sort of fulfilled all these steps, yeah, you know, it's pretty decent and you can throw fuzzing at it. And in our case, I also like to offer bug bounty as well, because I'm fortunate enough to have someone who pays for it. So we offer a lot of money to people who can point out the security problems. And yes, then you get a lot of bogus crap as well, sort of, yeah, there's a security problem. But still, also get a lot of quality people spending a lot of time and effort actually trying to find security flaws. So in my experience, this works really well. It's a pretty cheap way to get a lot of help to find your most stupid mistakes. But, okay, there might be other people involved in open source sometimes. You're not alone all the time. And really, over time, you learn that it's code is easy, right? Code is easy, you can just debug it, try it again, write a new algorithm, but the people, they are never easy. People are sort of what the challenges are. And the longer you work in an open source project, the more you maintain, you know that the challenges, what you need to sort of, what you face on a day-to-day basis is the problems with communicating and talking to people from different areas of life, cultures, languages and everything. And you can be sure that they are going to be less-to-friendly at times. So over time, we do less and less coding and more and more interfacing with humans and other things as a maintainer of some stuff, right? And, right, so negative feedback is sort of the default. It's a little bit depressing, but you know, as long as things work, sure, 20 billion installations, no one says a single word, sort of, yeah, it works, cool. And someone finds a little bug somewhere and you can be sure that that is what you are told about, especially if it appears stupid or silly or something, because then someone is very upset that surely it should have worked since a long time ago. You've been working on this for so much. So that is, of course, and I know you all know this, that's the default. You basically never hear when things are good, because that's the default. Everyone assumes everything is good all the time. When something is bad, you get told about it. So people often ask me what the difference is in curl back in the days with 2,000 lines of code with 300 users compared today with 20 billion installations. There's really no difference, because in the little development community, people raise their bugs, they complain, they have problems. All the ones that are successful, they shut up, they are somewhere else. So it doesn't really look different today. And a lot of lessons in what you do when you realize over time that contributors rarely stick around. In curl, I have lowered the bars and the friction for new contributors, I think, a lot. So we get a lot of contributors even fixing a spelling error or typos in a comment somewhere. People contribute that. And I think any contribution is a good contribution. It doesn't really matter if you fix a typo that makes it hard to read. Yes, it's an improvement. So I accept it, but do the contributors stick around? Out of all, I mean, today we have, I think, 1,240 authors who have written code commits in curl. That's an amazing number of people. Over 65% of them did it once and never again. So, and I don't think I'm unique in that, and I don't think it's special. I think it's more like that's how people work, right? They show up, they find a problem, they submit it, and they move on to something else. Because it's not their primary interest in helping my project, they just found a problem and fixed it and moved on. And sure, it's okay for them, it's okay for me too, but just the realization that most people who show up, they will show up there a few times maybe if you're lucky and then never again. And maybe every once in a while, of course, you get a new contributor who will actually stick around for a long time and contribute a lot. And you will be happy for those. And of course, I mean, there actually is the reverse too, right? There are a lot of newcomers. And I've never heard of, you never saw this person before in your life, and they show up suddenly one day with an amazing patch showing that they understand everything. And you can sort of be amazed that someone just shows up on your doorstep one day and have a perfect understanding of your architecture and design style and code style and everything. So suddenly, open source is open and ready for surprises in every direction. And that's part of the fun, right? Less fun is perhaps that sometimes when being a little bit public about things, things can go in the other direction. So I actually never really... So this email from, well, soon three years ago, was actually the first one that sort of hit me. Hit me like this. Yeah, so my email address is in the current license. And the current license then appear in a lot of products. And this person quite clearly had been attacked in some way and saw some traces of curl in some leftovers somewhere. And that was obviously my fault. He had lost his family's life and job and everything. Completely confused person, but it was all my fault. That was tough. But, okay, open source, this fun thing with open source, open source, the term was coined in 1998, right? Exactly the same. Actually the month before I started renamed it to curl. So it's sort of open source and curl. And it's been hand in hand going for a while and still just 25 years, right? Before we did open source before we called it open source too, right? Because we still worked exactly the same way. We just didn't use the term then because then mostly we talked about free software, but it was a little bit more confusion than what it actually was. But anyway, so today is much easier to do open source because everyone knows about open source. If you approach a developer today working in any field, people actually know what open source is. Back in 1998 or 96, no one knew about open source in general. It was just a niche click of weirdos. And today everyone is using open source, right? There's not a single project, single user, single developer anyway who doesn't use open source at least to some extent willingly or not. It's just going to be there. And we're all going to work with open source in ways that we suddenly did not 25 years ago. And we all, I mean, there's so, so many more contributors to open source today than before, right? There are literally millions and millions of possible contributors today. Back in 1998, there were not millions and millions of contributors. In 1998, the total internet population was, I think, estimated to like 40 million. That's basically the amount of open source developers today, right? And of course, we're many, many more maintainers of open source today than we ever were before. So there's also a lot of equals among us, right? We all know, I can talk to you like today, we who maintain open source and you are all a lot of open source maintainers. I don't have to even pretend. So there's a lot of good things. So it's of course also much easier and much better place to do open source today than ever before. And I think it's going to be much easier and better going forward as well because all of this is just going to improve. We're just going to do more open source and it's way, way easier to do open source today too, thanks to infrastructure, tooling, funding, whatever. But I think we're into in for a bright future. But anyway, I've done this and worked on a single project for so long and people ask me then, don't you ever get bored? The same project for 27, 28 years, yes. And of course, I get bored. Everyone gets bored every once in a while, right? Lack of motivation. How fun is it to work on the same thing all the time? Of course, the motivation comes and goes. That goes for everyone. And that's just natural part of life, right? Whatever you do, there will be periods in your life when you don't feel the same sort of, yes, it's going to be great to work on this documentation today again. Sometimes you just have to, you know, do something else, spend more time with your family. In my case, I like to sort of move around, do something silly and some less important part of the code or do a slight less curl for a while. I've just come to realize that lack of motivation is just a natural thing. It's just sort of an endless cycle. Sometimes you come in and then come back and it doesn't really matter as long as you sort of let it play out and maintain your overload. One of these things is very commonly brought up, right? If you're being that single person and you feel that a lot of users are depending on your work, maybe you sometimes work a little more than you should. And I think this is a real problem and it's a real, it can affect us for real. But it is important to separate you from your project, of course. I'm not sure I manage that always, but I do try. And there's a little this of, you know, if your code run in a lot of places, can you really ever be sure when you release a version that is not going to sort of bring down half the internet? I don't know. I think you just come to, you have to deal with it. In my case, I think I'm actually pretty good with this because I feel that we have enough tests, enough eyeballs, enough people involved that crossing my fingers. It might not happen too often at least. So I think it works really well. That's from my case at least. But I want to emphasize and I think this is true for many people that the thing about imposter syndrome, it doesn't really ever go away. It doesn't matter if you have those 20 billion installations, you can still experience periods of that. Did I, did I, do I even know this sort of who am I to tell them how things work? I mean, come on, this protocol doesn't actually work like this. But what one of my skills, I think when it comes to doing open source is just make sure to use the time slots you get. I have that a lot, you know, you have a family, you have a life, you have friends, but sometimes you have 20 minutes for yourself. Can you spend those 20 minutes on your open source project? I've become very good at it, which makes me very good at, you know, if I get 20 minutes here and 20 minutes there, that's actually 40 minutes. And I'm not complaining about, yeah, I need an hour to get prepared first because then I would never do anything at all. And I don't split my attention between all, a lot of sort of many other tiny things. Sure, I do a lot of other projects as well, but I give them much less attention. And again, time might feel important sometimes, but it really doesn't. In most cases, it doesn't matter if you're done today or tomorrow or next week or the week after that. Who cares, right? Sure, it's not in this release, you're going to do another release soon again anyway. And down the line, it didn't matter if you were done last week or next week. So, let it take some more time. And of course, I'm a true believer in release early, release often, so that everyone has a chance to get your latest code as soon as possible, because it just makes maintaining and everything easier and contributors have a much easier time to actually work on your latest code better. Yeah, so reduce contributor fiction to get people to help out better and have fun. Of course, we need to just remember that we're all different. I can stand here and say how I work, but I'm sure that you all have sort of objections and say, yeah, it doesn't work for me. It doesn't work for my case because I mean, spare time, as I'm talking about spare time working on open source, you can of course, in my case, I work on open source work hours and spare time hours, that sort of maximize. But working on anything, spare time is of course a luxury, right? If you're working on something on your spare time, maybe someone else in your family is doing, you know, the laundry or cooking or taking care of the kids or whatever. So of course, that's a luxury. If you have that position, it's a luxury. I don't deny that. So in many cases, you don't have that luxury. And of course, then it's much harder. And there's of course an unequal privilege here, right? If you're rich enough to do this, you can do this. If you have to work two works and take care of the rest of your extended family, maybe you can't do this. Yeah, just have to be aware that of course it's a luxury. We're all different. We're all unique. And of course, what is success? I consider 300 downloads a success in 1998. We all have a different way to measure success, right? So we don't have to have 20 billion installations. It's fine if all your friends are happy with your tool and you can just have fun. That's also success. In my case, I have mentioned already my email address in the Curl license. This gives me an excellent opportunity to learn about people's agonists in life. Like if they don't know how to install their GPS in their car, they email me and ask me. And you can imagine the amount of anger in this user. He couldn't install the GPS. He's been scrolling through that open source license screen in his car. Found an email. I'm going to email this person. So I get a lot of car questions. So then you learn, yeah, Sean, so my email is apparently in a lot of cars and people have problems with cars. So I have no idea. But not only cars, actually. So I can learn about other things too. And usually my way, this is the best way I have to actually learn about where people are using Curl. So wow. So I have to Google. Often I don't even understand the question. I have to Google it. What are you talking about? Oh, great. Are they using Curl too? It's confusing. I sort of stopped replying to them because... You know, the first... She asked me when I started. You want to help out? You want to be friendly? Someone ask you questions? Obviously completely lost. No, sir. This is how it works. I just wrote a little component. No, no, no, no. That's not how it works. Just ask your friends and help me fix this car now. So I have this example. This is a great one. It's a little bit convoluted, but I'll explain. I got an email from a woman. She said her Instagram account was hacked. So what are you asking me about that? Sad for you. Okay. But she showed me the proof that I'm involved. Instagram? My name. Now I should just head over and talk to the guys and tell them to help her with her account that had been hacked. And I told her, cool. They're using Curl. That's in my code, right? And try to explain the concept of open source. I never talked to these persons. I didn't know the use, Carl. For me, it was like a revelation. Cool. Instagram, right? That's like a billion installs suddenly. She didn't really see it the same way. Now, she emailed me back. She found my name again in her phone. Exhibit two. But it cannot be a coincidence. Your name cannot be twice in my phone. For any good reason, right? So she threatened to contact them and tell me that I'm an Instagram and Spotify hacking. I don't know if she did actually. So maybe they don't know this yet. No, I'm exposing myself. So when I work on this stuff, I just, what I'm trying to say here is I'm not special. I didn't do anything genius like I've just been working on this a long time. I just had an idea. I think it's fun. So this is what I do. And I think this is sort of the best you can do. And I wanted a tool to do the internet transfer. It does look a little bit more these days than it did from the beginning. And I endured. I kept going at it because I didn't know anything else and didn't know better. I think it's fun. And keep polishing. If you spend a lot of time on something, it can actually become pretty good. And make it possible for others to contribute if they want to. And you can just hope and wish that they will contribute. In my case, they did to a pretty hard degree. And this is really the most fun I can imagine. Yes, I'm living the dream. I work on my spare time project full time and getting paid for it. What else can you ask? So, is that easy? I think you can do it too. And pretty much that's sort of what I wanted to tell you. I've written about these things a little bit before in this book-like thing if you wanted to read more about my thoughts on this topic. So, thank you. I'm done. APPLAUSE I think we have a few minutes for questions. If you have a question, raise your arm and someone will run with it. There's a question. The mic will come flying. Hi, thanks for the talk. I have a question regarding... You mentioned that you have lots of contributors nowadays. And how do you deal with their PRs, basically? I was wondering two questions. One is how nitpicky you are and how you... Basically, based on your experience, how nitpicky can you be without discouraging people from contributing? Like being overly pedantic on comments and stuff like that? I'm having a little bit of a hard time to hear your question. So, it's regarding how nitpicky you are in your PR reviews. So, how pedantic you can be not to discourage people from contributing to such an important piece of software? So, do you tend to just let things through or are you very strict? And you still get lots of contributors, even though you're strict in your reviews. Because I guess when you get a diverse set of contributors, it can happen that lots of people have different coding styles and different levels of detail that they go into, code comments and stuff. I don't think I have any sort of general rule there. I try to... Sometimes, I think there are contributors who are clearly, maybe newcomers struggling with the language or the culture or everything. And of course, I try to be a little bit more welcoming, maybe more forgiving and helping out. But it depends also a little bit about load and everything. Usually people, no matter culture, language, anything, people understand code and following code styles and making sure the test case works and everything like that. So, usually I don't have to consider that to any greater amount. Okay, that's interesting. Most people are developers. They understand this from the beginning. The other bit of question was regarding... similar, but regarding documentation. So, have you found that... Documentation is roughly the same. Documentation in the code comments. So, if you've seen that being over-documented, has that helped you or are you not doing it? Because when you get such a... Over-documented, that's a rare thing. Well, over already means it's too much. But when you know... you could go overboard and you could... You can, but in my experience that is very rare and sure. I mean, we can have a discussion. Sure, you mentioned this as a comment, but then below the code is exactly the same thing. Assign A to 2. No, yeah, of course. Maybe you don't have to say that in a comment, and then we just had a discussion. So, of course. But I think that's very rare, actually. Usually it's in the other direction. Maybe you could explain a little comment here why this is happening, and not just have a huge blob of code. Right, right. I guess what I'm referring to is when you have such a long story in your software, and you want to leave traces of some design choices and why some things were implemented some way rather than the other, because other people, especially contributors who are one-time contributors, are not going to have enough context. So, I'm just asking regarding your style. Do you try to leave traces of context like this was done this way because of this reason? That reason, please do not change it, blah, blah, blah, stuff like that. Sometimes, but it's hard to leave traces of sort of to leave it for history because things change. So, leaving traces like that also just risks that you leave traces of your former design or former decisions that maybe were not enlightened enough. So, I don't make a concerned effort to do that because everything is in git anyway. We can always go back and look at the history if we want to. Was there any question left or should I shut up? I have a couple of questions. One is how much time you were spending on the project before being able to work on it? Sorry, can you repeat it a little louder? Yeah, sorry. How much time you were spending on the project before you were working on it full-time? I have a long-standing tradition in my family that I spend every night on curl. So, when the rest of the family goes to bed, I stay for another two hours working on curl. So, I've done that since 1996 basically. That's two hours every day, every week, every month, every year for 27 years. Now I've just added my full-time work as well. So, now it's just instead of two hours per day, it's now 10 hours for work days. Do you delegate maintenance? Sorry, again. Do you delegate maintenance? So, do you have many... Do you delegate maintenance of your project to someone else and how many... Or you maintain everything yourself? Because there's much maintenance overhead. Well, I'm the lead developer here. I'm not the sole maintainer. We're a whole team. There's a lot of people, apart from me, who can merge code and who does. I just think I do the bulk part of it because I'm the only one who works on it full-time. I do it much more than they, but if I would stay in a conference the whole weekend, someone else can still merge code while I'm away or if I'm just absent. So, there's a whole team actually. You
SCION, hitting the future Internet road: Next-generation Internet ecosystem and burgeoning opportunities
Thank you. My pleasure to introduce the next speakers, Jordy and Tillman, who are going to be speaking about SCION, the next generation internet. Can we get a round of applause please? Thank you. Hello everyone. Thanks for attending the talk. Thanks to the host for having us here. I'm a little bit of a rockstar right now. We are Till and Jordy. We both come from ETH Zurich. We are part of the network security group at ETH Zurich. We are also part of the SCION open source implementation team. First question, who has heard before about SCION? Okay, some people. You can skip the introduction and the overview. Okay, so for the rest, I will start introducing what SCION is. SCION is a clean design of an inter-domain architecture that considers security from design to achieve security properties, mainly availability, also transparency and control, reliability and scalability. So I want to make here clear that SCION has to do with an inter-domain network, so it doesn't have anything to do with inter-domain protocols or higher level protocols in that sense. And the other thing I want to highlight on this slide is that SCION is an open source project. So here you have the GitHub repo in which you can find the reference implementation of SCION. So in here you have also, well, Till will give more details afterwards. Here you can find also references to documentation and related stuff. So the second question is why does even SCION exist? So SCION comes as an alternative to our old friend, BGP IP Internet. So yes, this was created even before I was born, so imagine how things have changed so far. So SCION has the distinct aspect that incorporates those security aspects I mentioned before from the very inception. Why do we need this? Because we need a network that provides availability even under the presence of malicious actors because there are people interested in harming the inter-domain routing. So we can find some examples, current recent examples, so for example an outage caused to a Spanish ISP due to a BGP attack. And we have several malicious actors in Internet unfortunately from nation state actors to cyber criminal groups that are interested in harming and due to different reasons. They can be from political reasons to economic incentive, you can name it. And yes, sometimes the trust boundaries, so trust nature of the current routing architecture sometimes doesn't make it clear enough where the trust boundaries are. So probably some of you are hungry, maybe not enough for just running away and grabbing lunch. So I cannot offer you food like actual food but I can offer you some yummy desserts. In this case towards the end of the presentation we will give or we will present a couple of demos. One is a browsing demo using Scion, second one is a Scion word first person shooter, and finally we will walk you through steps and guidelines for developers that hopefully find this interesting and want to contribute or just use what's there so far. But first some overview of Scion. So the whole Scion ecosystem includes different entities from different domains or from research institutions to ISPs, to vendors and integrators and users of the system. All these ecosystems is nurtured by the Scion Association which is a non-profit organization responsible for the standardization of the Scion protocols. They are currently pushing and they have published three or four ITF drafts and they are pushing it to RFCs. So they are also responsible for managing the open source implementation. So here I try to summarize Scion in five distinct aspects. The first one is that Scion is a pathway where internet architecture meaning that end hosts are presented with path information in the network and they can make a choice of what path they use to send their traffic through. The second aspect is that Scion designs and implements a scalable trust infrastructure. I will go into a little bit more detail just in the next slide. It also designs and implements scalable path discovery basically in the control plane for trying to achieve rapid global connectivity. Then another aspect is that it has like multi-path nature. So as I said before end hosts are presented with several paths that they can use even simultaneously. And finally another aspect I would like to highlight in here is that there is already real world deployment of Scion. So these I will show you towards the end or the middle of the presentation. So first just some terminologies. My idea here is that you kind of get the idea or the gist of it so you are not completely abstract in this. So first term is that Scion organizes itself in trust in so called trust isolation domains or ISDs for sure. These trust domains as the name indicates are isolated trust so are nothing else that group of ASSs of autonomous systems that share a common trust route configuration. So they basically agree on a set of routing policies that they want to use. And the other term here is the core ASSs which are the ones in charge of managing meaning updating those TRCs and so on. And they also provide peering with other ISDs. So basically they isolate trust. This is an important point I want to emphasize. This isolate trust it's not another kind of isolation. Then the other part of the overview is the control plane. So here I will explain briefly how the control plane and the path dissemination looks like. So again this is an overview of code. This is full of details and you will find them in the documentation and references to the books and several papers that we have about this. So the routing information is disseminated to the network in these so called beacons which are these squads, these corals and squads in here. Those beacons are initiated from these core ASSs that I mentioned before and they are either propagated farther down the network in the local ISD and they are also propagated between these core ASSs. These beacons are authenticated and extended at every hop and every hop meaning every ASS on path decide how they extend these beacons. So they do it based on only their local policies. So yeah basically on the slide you see that they are disseminated and finally you have found some path information. You have already those segments, sorry those beacons have been disseminated and at the very moment they reach an ASS and here we can focus on the green ones. For example those are already usable so there is no need for convergence in that sense so this piece of information is already directly usable. This helps to achieve rapid path exploration and scalability. As I said before this is just a quick overview because I don't want to overrun with details but there is exhaustive evaluation on this scalability aspect of the control plane system. One other aspect I mentioned before is the multi-path nature of SIAM. So from the N-host perspective N-host retrieve path information from their local ASSs. So basically they request this path information and they retrieve several paths they can use simultaneously. So the path server, so called the servers in the local ASS that provides this information will provide to the N-host this information. So this is different from source routing so here N-host directly retrieves the path from their local ASS. This many paths or several paths allows applications for optimizing on different metrics so they might find some of those paths better in terms of latency than others. Others better in terms of bandwidth and they may hopefully find a point that better suits their needs, the application or N-host needs. So just for putting some numbers in here in the current production network, so the real fabric we are building, if you take two ASSs you will find from dozens to even hundreds of paths that can be used to reach the other endpoint. And last slide on this overview is this control plane and data plane slides. So the control plane is what I have just explained in the previous slides and the data plane is what I will try to explain right now. So as I said before N-host retrieve those path segments from local service and they combine them to create a path. So you can find here two examples of paths. So segments are combined in one path for packet one and in another different path for packet two. So once the N-host has encapsulated this information into the packet they send it out to the network. And routers forward packet based on the path information, so they inspect this path information which contains information of which one is the next hop and then routers can simply forward to this next hop. So this allows for simple routers and stateless operations. So as you can see here, so for example those packets may belong to the same application. For example, you send the packets or the N-host in this case send the packets using two different paths that in this case are even disjoint. So for example this can be useful if you have an application, it has a control channel, it can use low latency path for control channel and higher bandwidth path for the real application data. Okay, now, so I want to also convey the idea that there is already some tangible stuff so it's not, as I said before, it's not only a research project, of course there is a lot of research in it, but there is real deployment and engineering right now already in Cyan. So for that I will basically explain these two networks. So first one is the actual, the global Cyan internet, so the real production network in that sense, so the real fabric. And I will just introduce briefly some concrete ISDs in this case, those color bubbles that I showed at the beginning. And then I will also talk briefly about side lab test with network which is a completely separated network, in this case this is an overlay network that anyone can use and I will give more details afterwards. So in general this production network again is, this is not an overlay network, it's the real fabric, it's BGP-free. And it's currently deployed by several international ISPs, so here you have some logos, you don't have to look at them. Currently there are over 100 ASS and they are distributed in Switzerland, you find a few of them, also in the EU, in North America or in Asia. And the other thing about the production network is that recently also has been enabled Cyan-Claus-Bates access, in this case this is a commercial offering. But so just for you to know that if anyone happens to have like cloud deployments, they can also access the production network. Okay this is one of the examples, this is one ISD again, one of the, this color bubbles that I showed you at the beginning. This is the Education and Research ISD, Sierra is the fancy name. And here you find universities, in this case here you have some of those. This is a growing ISD so it's not closed, so more universities may come and some are interested in coming in. There are also other research institutions and research and education networks that also provide connectivity between those research entities. So here is a world map picture of how they are distributed roughly around the world. And then very shortly there are also industry use cases right now, so there is this secure, in Switzerland are those two that I'm going to introduce. The first one is the Secure Swiss Finance Network, so they are basically using Cyan and they are going to phase out the finance IP net that they are using. And by June this year, and by then they will have, or the network will have around 120 participants. And the other example similar to the Secure Swiss Finance Network is the Secure Swiss Healthcare Network that provides similar service for health professionals. So yeah, that was the real production network and this is Cyan Lab, which is the research network. In this case Cyan Lab is a globally distributed testbed to conduct experiments and test deployments. So anyone can join, so anyone in the audience can join this network just by downloading a virtual machine. So with background file you run background app and then you have your node attached to one of these transit nodes. So all of those are transit nodes, leaf nodes are not in here. And yeah, basically I'm not that interested in the names, so they may be a little bit unreadable, but different boxes are located in different parts of the world. So for example you have Korea, you have North American, also you. So yeah, Tillman will also give some pointers afterwards where you can find the information for joining Cyan Lab. So now we also have awesome Cyan project, so basically this is a compilation of projects that are related with Cyan. So we have from infrastructure projects, so we have people implementing Cyan into Fino routers, also express routers. We also have firewalls using Cyan and other kind of infrastructure related projects. We also have application projects, so we have the Cyan Naval Browser extension for example in Chromium. And we also have, so far as Cyan were, Quake 3 video game, video game client distance. And of course we also have the libraries. We have pointers to reference implementation again to network APIs and client and host stack in different languages. So we have in Go, then we have this client libraries for Java that Till will give some more information and explanation just right now. We also have client and host in Rust and bindings to other languages like C++ and Python. Then also this list includes useful tools, so Cyan is integrated in the CIT emulator, so if you are using CIT emulator you can also bring up your Cyan network. Then there is also escape libraries for package generation and YSAR plugins for packet capturing. So here is the first demo I want to show to you. I will just switch to the video. Okay. Yeah, I guess. So this is the Cyan browser demo and basically here you will see, so first of all this uses the production network. So you will see this basically browsing in the production network. This is part already, I just said, of these awesome projects in which we have projects of, so different projects using Cyan. So in here you can find this extension. You load one Cyan-enabled website, in this case for example the ETH website. And the extension provides some information about the resources and where they were loaded from. So here you see that the resources from the ETH domain were loaded via Cyan, so the green indicates Cyan and red indicates that they fall back to the GPIP. Of course this is configurable and you could choose not to fall back to anything if they are not available. Here you have some path information. You see that we stay within the Swiss ISD, so my client in this case is in the Swiss ISD, so we stay in there. Because the server happens to be located in the same ISD. Then an example of navigating to another ISD. In this case we navigate to the MacW University server and in here we see that resources are loaded, all of them via Cyan and we have also some path information. So here we see that we go from the Swiss ISD where my client is through this CERA education network that I presented before. So here you find the exact AS numbers that the traffic traverses. So here I type yet another example. And basically also we have more path information. This resource also happens to be located in the same ISD. If you really do find different ISS but this is not important information. So basically that will be it. Now this is just again for just showing that we fall back to the GPIP but this I already explained. The second demo I wanted to show is this Quake 3 demo. So here this demo is using the CYLAB testbed network so it's not using the production network. And here our client is located in note at ETH in Zurich and we connect to the server which is located in Magdeburg, Germany. We connect it to the server and okay. So basically here what you will see commands being typed. The piece of information I want to convey is that so different things that we have is this showpaths command that's CYM specific. And this showpaths command show all available paths from the client to the server. So you get a bunch of paths and well we see a little bit more of them. And the other thing is that for demonstration purposes what we do is we bind to the key to command next path. So then we can iterate basically and see how different paths provide different latencies. So we have this key shortcut and while we are playing you will see on the top left of the screen not right now but while we are playing. So this path for example, so if you saw before this path had 100 milliseconds latency, we start playing. And then what I was saying before is that now for example you see in the top left corner we have switched the path. So we just press this key shortcut and we iterate over the set of available paths. So this is for demonstration purposes for show that we can find paths with different latencies. We keep iterating, we see changes in latency, we see we keep iterating on the top left part of the screen and yes basically we see these different latencies. So hopefully this will stop now in the last frame and here for this specific path we receive this latency. So this interactive of course you can program this and adapt your application to have path selection algorithm that does it automatically and always takes the path with best latency. And yes that was it. I will now let the floor to Till for explaining the rest of the presentation. Okay thank you Jody. So now let's imagine you found all the science stuff very interesting and you tried to implement your own project. How would you go about that? Yeah the first step would be to go through this awesome science list that Jody presented earlier where you can find existing projects but also libraries, language libraries to connect to the science network. So these are probably most important ones for a new project. The first one here is the Go API that's like the reference implementation, the original implementation of science. It's the most comprehensive implementation. It contains everything including border routers, control servers and everything you need to completely run science. It also comes with language bindings for C, C++ and Python. More recently we have a Rust API 100% written in Rust and just released a few days ago we have an alpha version of the Java API and that's actually what I'm going to talk about in the next few slides because that's kind of the project I'm involved in the Java API. So yeah it's written in 100% pure Java. It is very similar to the Datagram channel that people may know from Java with a few exceptions. Datagram socket is currently not implemented but that's very pretty much the next thing to do in our list especially since I realized that a lot of existing older projects rely on Datagram socket instead of Datagram channel. The library also has an API for path inspection. This is pretty much all what science is about. You kind of get a lot of paths from your AS and you select the best path for your purpose. So yeah path inspection and selection is very essential. It also supports the SCMP protocol. SCMP is like ICMP for science. So again you have echo or ping and trace root commands available. So let's look at a very basic Java client. So this is a Datagram channel example and basically there's nothing to see here because it looks exactly like you would use a normal Datagram channel. The one thing to just bear in mind is for example the host name eth.ch that needs to be in sign and able to host otherwise you can't run that example and also your local machine needs to be somehow connected to the sign network. Let's look at a bit more interesting example. So there's an additional method. There are several ones but this is one that may be interesting. It's called set path policy. So what you can do of course is just go through all the paths that you get from your local ISP, your local AS and then pick the one that you want to use. But it's much easier if you can just define a path policy. In that case maximum bandwidth. You set that path policy on your channel and the channel will always try to find a path that suits this path policy. Now we're going to look at the server side. It's a little bit different from the native Java implementation in the sense that the receive doesn't return an Ionet socket address but a path object. And the path object does contain the Ionet socket address from the client that connected to the server. But it also contains the whole path that the packet actually took through the internet. And you can just use this path and to send a response back to the client. The idea here in sign is that usually if you send a packet to a server you would send it back the exactly same route. Technically you don't have to do that but it makes it a lot easier for the server because the server doesn't have to look up paths how to connect to the client. It's just much faster. So I mentioned path policies before. The Java library comes with some predefined path policies. Somewhat self explanatory probably but one path policy is called first. That just picks the first path that your AS gives you back when you ask it for a path to a certain destination. That's kind of a cheap way for the AS to actually recommend you a path. They think this is a path you should use. The next one is min hop that just tries to find the shortest path. So with the least hops in the path on hop being like a border router or other ASs you have to go through. Then there is min latency and max bandwidth which also pretty much do what you would expect them to do except that these implementations are non parameterized aesthetic. So they just rely on metadata. So you ask your AS for a path to get metadata back that kind of estimates the latency and give you like the allocated bandwidth for the short for the links in the path. If you want to have like really the best latency you would need to implement a new filter or I may also provide that in the future. A new filter that looks at all the paths, pings all the paths and then just selects the one that has a lowest latency. At the bottom of the list we have the ISD allow and ISD disallow filters. ISD being the isolation domain numbers that we previously saw, although this whole set of ASs. So basically isolation domains can map to countries for example or to something like the university network that we saw earlier. So these ISD allow and disallow can be used to implement something like geo fencing. And since the ISDs represent countries for example you can decide that you don't want your packets to go through a certain country. So imagine you're on the bottom left ISD in one of those ASs in that ISD and you want in your ISD is 110 and you want to send to ISD 130 and there are a lot of paths, some direct path. Some go out by 99 and one some path by 125 and 120 and for some reason you don't like ISD 99 which could be a country, could be just some other organization. And then you can just define your path policy like that. The exact syntax is a little bit different. So I simplified it here but it's pretty much that. So you can just exclude 99 so the filter will pick any path that is not and that doesn't go by a 99. So once you wrote your application, your first step, well not the first step but yeah, the next step is testing. And the common way to test sign is just to run a local network on your computer on your machine. And you can do that using the reference implementation that was mentioned earlier, the sign proto reference implementation. So what you do is first you define a topology in a file, topology file, then you run this command. This will create a lot of configuration files for all the border routers and control servers, demons and whatever needs to be started in your machine. Then you can view the topology if you like. So we have a very simple topology here with three ASs like the three ellipses, one core AS that's a button on the top and they all reside in the same ISD. That's just a simple example. There are also a number of topologies that are already in the repository so you don't really need to write your own if you don't want to. Then you just run the topology that will start up all the different processes for the different border routers, control services. They're all connected by loopback devices and then you can just connect with your local application to the network. In this case I just run a ping to the core ISD and that's the result. So there are other methods for testing. So there's for example the seed emulator that Jordy already mentioned that does support Sion. Then there's the Sion lab. This is a worldwide network of Sion nodes. If you want to use it you can go to the website, register, you can allocate your own ASs if you want. Then you can, as mentioned previously, download an image for a virtual machine and the virtual machine is like an AS that you run locally. You can even create several of those and then create a network. Then you can test in the production network but that requires you to actually have access to the production network. So if you're lucky your ISP supports that but there are currently not that many. AWS offers nodes that have Sion access so you can rent an AWS cloud center or something. Or maybe your university has access to the Sierra network. And finally for debugging there's a lot of command line tools. I think I mentioned ping and trace route before, show path and several others. And there's also a very neat wire shark plugin. So you can just look at Sion packets, inspect the header, look at the path that is associated with the header. So and if you want to contribute there are several tons of projects that could be done. You could carry your own project so we are still missing native libraries for C and C++ for example. There are no libraries at the moment for C sharp or swift. You could think about embedded or mobile devices. Also network protocols. For example Java implementation currently only supports UDP. We aim to use support quick and probably TCP very soon but there are many other protocols that would need support. Or you can just use one of the big existing projects for web proxies, HTTP servers or video conferencing clients like JITC or something. Or gaming libraries and try to make them sign a way so you can select path or automatically select good path in these projects. So yeah finally if you want help or support there's a Sion Slack channel, there's a Matrix channel. And since last week we also have a Sion tag on Slack overflow so you can tag your question with Sion and we have some developers subscribe to the tag and they will try to answer your questions. Yeah that's already from my side everything. Thank you and looking forward to some questions. Thank you for a great presentation. I have my question is regards security and protection against the DOS attack. So you allow everybody to select a path for packets and how do you protect against someone doing it maliciously. For example sending packets back and forth between ASS to overload the network. Yeah so the question was I think how do you prevent DOS attacks or how to prevent people abusing paths to send traffic for creating loops for example. So these paths that you can see here they all signed. So that makes it kind of impossible to create your own path. The paths are all signed by all the ASS on the way. And that also makes it a bit easier to prevent DOS attacks because you know where a packet came from. And if you don't like that region you can quite simply block everything that comes from that region. Thanks. Yes. I think a question is someone in the similar vein how do you deal with resiliency of the network which with. How do you deal with. Oh yeah the speakers. How do you deal with the resilience. Like the internet is very resilient because the routers can take independent decisions but if you select a path as a user and like a link goes down then like. Like information that information has to disseminate all the way back to the user so they can select a new route and then. Have their link up again instead of that just happening transparently to the user. So. Okay the point here is that normally you send the packet out to the network right and just pick routers to send it to the next hop just based on the destination information right. So when you have for example some link failing in the middle you need to converge to a stable state in which okay what's now the after the failure what's the next router I have to send the package to. So this is this takes some time and by that time your packet may be time out already so you need to still send the packet so with Sion you can already detect that when you see that the packet it's taking long or you don't get feedback and you can already take for example. Completely disjoint path and in that sense this failover mechanism is quite effective. So normally when link is busy you also get network feedback right so you see that packets are thinking longer latency is increasing so what you do is you as a user you are interested in taking a more. Healthy path in that sense if I can say that so you will automatically switch to that path. I have I have a question you showed in your API example that when you create a connection you specify the route. But supposedly a country changes you don't want to go your want your packet to go through a certain country then you have to change your software. Currently when I make a connection I only specify the destination and I don't care how it gets there. Now the information of how it gets there is mixed with where it should go. So when the route changes I need to change my client software. If I understand correctly. I hope I understood the question. So the routers become very simple because I don't need to do any decisions anymore. The only thing is they could verify whether the path is assigned past. So yes you do have to update your clients. The clients all need new libraries that could be. I mean I have a quite high level library with a Java but it could be an operating system. Just another driver that sits underneath UDP or TCP and adds like the sign path transparently. But yeah that's kind of the big work we have to provide updates for the clients. Let me add five cents. So there are also transition mechanisms right. So we have for example a thing so called Sion IP Gateway. So there for example traditional IP applications so you have your applications in a certain subnet and this traffic gets to this Sion IP Gateway. And this Sion IP Gateway encapsulates traffic to the Sion network. So in that sense you don't need to change for example with this transition mechanism. You don't need to change your application. Of course the application is not taking like the best properties in that sense. For example then you would not be as application choosing your path and optimizing for all of those or for some of those metrics. But you can still let this traffic go to this specific gateway and then this gateway will decide for you depending maybe on policies and local policies where do I send this traffic to. I don't mean the one time conversion of course if you switch to a new network protocol you have to change your software. But it's more the dynamic stuff when now my provider decides if some AS is out of the loop. But now I have to decide that as a client. Yes but you can easily default to whatever paths your provider provides to you. So you can always pick your default and just don't care about the rest. But on top of that you have the choice as an host to decide where you want to send your traffic to. If your for your use case is not important whether your traffic goes through I don't know any country you name it. Then you can just fall back to the default paths and then you don't have to make this decision if so. Is path selection always from the client side or can the server make a decision of what paths are acceptable. So the password is usually the client would connect the path server in his AS in its AS to get a selection of paths and we would use those paths. And those paths are sent to the server and the server could look up a different path to the client in theory but that feels a bit ineffective. So they can just reverse the path what's automatically reversed in this API to you send the packet back. Yes but just adding something else. There are some projects for example that try to make some negotiations so the server can signal or indicate the client. What's the best path to choose in their opinion but this is also like separate from the vanilla side I don't think. This is something you could put in place. I have a few questions but they should be quick to answer. So I've read about this thing called secure backbone autonomous system which is where you advertise better routes to existing BGP infrastructure. Is this in wide deployment is this popular is it being used. Also have there been any experiments with Wi-Fi does Zion work well in wireless context. Also in the quake example I didn't see an IP address in the quake demo I saw some other sort of address. What is that address. It's like the ISD or something. So I got the first and the last questions I will try to answer those first. So the first was talking about as much right. So as much is also an incremental deployment model architecture which basically is a hybrid between BGP and and so in that sense. The idea very basic idea and I'm not the one like involving the project but very basic ideas that you combine BGP so that BGP are announced closer to this backbone. And then you use this backbone as secure backbone and then you go out to the Internet again hopefully close to the to the to the destination. So the specific question was about the current deployment so it's under deployment. So some members in our team are making efforts and this I would say should be soonish already to be used in production. It's not yet there but it's there. The last question was about the address format right. So yes the address format that you saw is composed of the ISD so this color bubble and a yes the individual is of the of this ISD. So these two numbers plus the end host the end host address. This end host address has a scope within the autonomous system. So you could use basically any address you want but the scope of this end host address is specific to this specific autonomous system which is indicated by the ISD plus AS numbers. There was just one more question about wireless. Is there any experiments with wireless? Has anyone done anything? So now we will be starting projects for supporting Sion in Android and there we are going to go deeper on that. But of course I mean for example if you are interested in providing wireless support and optimize for wireless this Sion use case I mean you are more than welcome. But yeah. Thank you for the talk. My question is you mentioned earlier at the presentation that there is a way to update ASS which nodes which which nodes can be trusted with the further routing. So my question is how does that work? Who decides which nodes can act as ASS which ones cannot and how those bigger bubbles ISP or something else whatever it's called. How do you decide which nodes act as ASS and can you dynamically update it? So if I understood the question correctly it's about who decides what ASS is trustable or not trustable right? So what Sion brings is this possibility to the sender in this case the end host within the ASS as I said before. So your ASS will provide you with a set of paths and then you will based on your local policies as end host you will apply these policies to these set of paths and then you will end up having a subset of path that you consider that good for your use cases. So this of course brings so this delegates some responsibility but this is good. I mean at the end as I answered before you can fall back to any default path right? And just be kind of agnostic to okay where my packet is going to but I mean the main benefit is just having a choice on that. So ISDs basically represent jurisdictions right? So you can think about them for example we have the Swiss ISD. We will have other countries ISDs or regions or in this case for example this group of university institutions. So you may think so in there is pretty much for my use case for example I would want my traffic only go through research institutions because I'm deploying this thing from maybe my home country. And then I could basically steer that policy or implement that policy in there and then of all the sets of available paths I will be doing that. Of course this will depend maybe on your application for doing certain things you are good with other paths. So I don't know if I... The initial trans routes so they are agreed upon so basically... Can we take this offline so we can get the next speaker please? Yeah I mean I can answer you offline because they need to... Hallway track it's a thing. Okay thank you for the talk thank you. Thank you.
Open Food Facts : Acting on the health and environnemental impacts of the food system
Welcome everybody. We're going to start the next session now. It's my pleasure to introduce Pierre Slamish who will be speaking on open food facts, acting on the health environment impact of the food system. Hello everyone. I just have a quick question. Have any of you in the room used NutriScore to choose food products by raise of hand? Okay. So you'll see that open food fact has played a little part in getting NutriScore out. So let's start and let's dive right in. There's a lot on the menu. So for those who don't know open food fact, I'll briefly introduce it. I'll have a section on what's new in the project this year, what's cooking for next year, and also we'll be able to do Q&A probably outdoors. So about open food fact, it's a project that we started 10 years ago. So it's an NGO and it tries to answer how do you choose the best product in the supermarket? A lot of information and it's not legible. I've never been able to understand the nutrition table. It's abstract out to me. So a long ingredient list as well. And yet food has a massive impact on public health. To give you an idea, obesity and overweight wipes 3% of our GDP due to the cost of treating obesity and overweight. And the same goes for the planet. One quarter of food emission is food. One quarter of carbon emission. So the idea of open food fact is to empower users and contributors who have an impact on their own health, on the environment, and on the health system at large. Our slogan, if you will, is don't panic but organize. So crowdsourcing is a way to do that mobile crowdsourcing. And if Wikipedia was able to build the largest on the planet, open street map, the largest map, why not build the largest database of food products on the planet? Two days, 10 years in, we have 3 million products from over 160 countries. Main sources, crowdsourcing, so you and me using mobile. But also the food industry which has started to realize that transparency in the end wins. So the mobile app of open food fact allows you to choose products that are good for you and the planet. You scan barcodes, you get NutriScore and EcoScore. You also have a personal scan for those of you who have food allergies or want to go vegan. It will help you on the journey. It's of course privacy preserving. So it's privacy by design. We don't require any login. And if you don't have a NutriScore in your country yet, you can get it on any products in a couple of seconds. You answer a few questions in the app and you get the scores instantaneously. So you can take your health to the next level with NutriScore and which is about the nutritional quality and NOVA which is about food ultra processing. So avoid NOVA for products as much as you possibly can. We also do additives and labels. And we make it simple to understand all of that. With NutriScore, we started computing it in 2015 when it was a scientific paper. It was called the five color score. And now we compute it in every country including Mexico and the United States. Everyone can get it even if a producer don't want you to get it. So we recompute it. We create an ecosystem around it. And the nice thing is that as you all experienced, it's now in supermarkets in Europe. It's still not compulsory though. And producers are beginning to improve their products. And we also show EcoScore which is about the planet. So same principle. We use something called life cycle analysis which are very precise analysis of food products. So it's an average. And then on top of the average, we make the computation more precise to the products using specific data. With EcoScore, the great news is that France will have an EcoScore despite all the trouble you are seeing right now in France. It's in low. So that's the cool news. It's beginning to be experimented in Belgium, in Colbert. And it's also available in many European countries and the US. And so yeah, we are having a more global discussion around it. In terms of impact, OpenFoodFact has quite a lot. Because we are open data, over 250 projects, application services reuse the data to inform users from questions on pregnancy, allergies, etc. Even big corporations use it. In terms of impact, it's a simple circle. We collect data using our mobile phone. People are more and more reusing that data to do many things including scientific research. People get more educated, more mindful about what they eat. They start changing their behaviors, their purchase behavior. And the whole industry actually starts to follow. The producers are taking notice and they are changing their recipe as a result and everyone benefits. And the circle goes on and on. So from those kind of Photoshop or GIMP images that we did a couple of years ago, we went straight to this where the NutriScore is everywhere. So yeah, you go from Perl code to real life impact where basically all products, all newly introduced products start to change for the better. What you can also see across Europe is for instance the differences between In The Food Offer. We take photos across space and time for 10 years and we found out that the Fanta Recipes change across Europe. So for instance, Italy 12% fruit, Serbia 3% fruit, Portugal 8% fruit plus high fructose corn syrup and 0% fruit in the French island of Réunion. So that's the kind of thing you can do with data. We can also have a giant map of food factories in Europe. So that's may near me. And all the packaging code you see on food products, we actually collect and we can map them. You can do benchmark if you're liking to data, if you want to choose a perfect year old. No, you can. So it's highly customizable. In 20 seconds, you can do your own charts. We also have a platform for the food industry. So whoops, sorry. Yeah, for the food industry to help them actually reformulate, we say, okay, here's an opportunity to reduce a little bit sugar and then you will get a better NutriScore. So we compute all of that. And brands have started playing the game. Some of the brands you consume every day are actually doing open data and sending open data to open food fact. Even the big ones like Unilever, even Ferrero from Nutella are doing that. So they're starting to realize that consumer pressure is important. In terms of milestones, so as I said, we launched NutriScore in 2015. We launched EcoScore more recently and ultra processing in 2018. So the project is a bit over 10 year old. And this year, we cross the three million products threshold, which is a nice milestone. We are now at 3.1 million monthly visitors on the websites and the app and contributors together have made 28 million edits since 2018 and it's growing, it's still growing. The permanent team is growing. The community is much more engaged this year than it used to be. We were doing European meetups. We had our second open food fact days this fall. And we are also getting more people into coding. This year, we also scaled app marketing so that new users discover about open data, open source and open food fact to 40 languages. And we started getting into European events and trying to get a European committee off the ground and not just be a French project. In terms of manufacturers, we introduced a few new features as well. Manufacturers are getting on board. And even more important to us, as scientific use and reuse, we have 30 scientific paper in nutrition, in machine learning based in 2023. And we have increased the reuse a little bit as well. So what's cooking for 2024? It's going to be a big year. First and foremost, because the new score is going to change. The formula is going to become more strict. You know that there's Italy is trying to block it at the European level. And the scientists are overwhelmingly supporting new three score. Seven countries have adopted it. And now it's the question of whether it will become the European score. The new formula is going to be more stringent, like seven out of 10 products are going to lose a grade. Most of them are going to lose a grade. And it will be a two-year transition in real life. But as soon as we start deploying it on open food fact, it will be on every, the new computation will be on all products directly in open food fact, even before producers do the transition. On mobile, it's going to be a big year. I'm going to go very fast because there are only four minutes left. So we did a lot of user interview this fall. And so we are going to make the app more pedagogical and to improve search. So here's a screenshot of all the ideas by the community. So we are going to improve the onboarding so that people better understand the scores. We are going to make the personalization engine more intuitive. We are going to make all the information more legible, guides even to go further for French people. We are going to try and tackle the mineral water scandal. And improving search. Also, thanks to the support of NGI, NGI search, we are going to have a live search in open food fact. And this year, we are going to go beyond food. So the thing is, we have had an impact on food. But there are many objects like, I know, this projector or this chair, which have a life cycle. And then at one point, the owner decides it's not worth keeping anymore. And as a result, we are surrounded by object, but some of us no longer serve us or please us. And they end up in the incinerator because we fail collectively to give them a second, a third life to repair them, to fix them. And so open product fact is all about that. Giving open data to power of circular economy. So today, this year, we are going to merge open food fact with open product facts, beauty facts and pet food facts so that you can scan anything on the planet and get solutions for it. And yes, people have asked us for that for years. We are getting into price collection this year. People, we started open food fact once, so what's in my food? But no, people want to know at what price. So we are starting open prices. Currently, it's only a web app. It's only five weeks old. So it's still a very experimental project. Even the logo is experimental. But basically adding a price takes 20 seconds. You scan the barcode, you put the price details, you put the location. It remembers the two or three locations you inputted previously. And then you start to realize weird stuff. Like for instance, price variation in the same city for the same products, for the same supermarket chain, and nobody is able to explain why. We are also thinking that we could kickstart a European price collection and build the first European Nutella price index. So we already have a few prices in Europe, but you'd be very welcome to add the prices nearby at your favorite shop. We are also, this is more experimental, but we also would like to help people free their data from receipts. So now at this point, you are asking how can I get involved in my country? So we have a broad European coverage that's already there, but there's still a lot of work to do. So how can you contribute? Scan and add new products. That's the most basic, but the most vital way to contribute to open food facts. Translations, word spreading, taxonomies and design. So a lot of knowledge about food required. And if you develop in any programming language, hacking and fixing is welcome. We have many programming language you can contribute. So the mobile app is in Flutter. We have some machine learning, robot off in Python. So we're even experimenting with LLMs and 60 seconds on the clock. Perl, Python, you name it. There's really something for you in there. So that's the QR code. If you want to become a volunteer, you can scan this QR code or go to openfac.work.com. Also, if you're a student or an adult, you have a Google Summer code we are going to apply. So if you want to become a mentor, a mentor or refer a mentor, feel free to do so. It's nice to have a large impact on food. We are independent from the food industry, by the way. We're not like a startup or anything. So we'd like to thank all the sponsors that are supporting some part of open food fact. So thank them for enabling infrastructure or everything. So I guess let's get in touch. Eight seconds on the clock. You have the contact email, my personal email, and you can install the app right here. Thank you.
Observations on a DNSSEC incident: the russian TLD
Welcome everybody. My name is David and I have the pleasure of introducing Stefan Bortsmeyer who will be speaking next on observations of a DNS sec incident, the Russian TLD. Hello everyone. I work for AFNIC which is a data file domain name registry so I know one or two things about the DNS. Time to see first the problem. So the lightning talk appeared quite recently in the schedule because everything happened on Tuesday this week. So many users noticed a problem. A lot of sites services under the .ru TLD. TLD is top level domain. .ru is for Russia. And there were many problems. Many people reported this as I cannot reach a young dex or I cannot reach V-contact or other service. But actually it was a very general problem with .ru. Everything with the name .ru was down, it seemed. But some people said, okay, it still works or it works for me. You know that on the internet because the internet as a previous speaker said, the world is not coherent. On the internet it's perfectly possible that some users say it's down on others. Hey, it works for me. So in that case there was no apparent reason for some reason for some people in Russia for instance it worked some not. Outside of Russia it was the same thing. On the problem lasted a few hours, three to four hours, which is a very common duration for an internet incident. Someone told me once that every internet incident is two hours of panic on five minutes to fix it. So a bit of analysis of the problem now. So I have something terrible to tell you. Don't believe what you read on the web. A lot of bullshit. Many people don't know what they're talking about. They don't rely on facts. In that case for instance a lot of things are observable on the internet. Anyone can run a DNS client, can run trace routes, can try with Curl or other software. So it's possible to have data, actual hard data. But yet some people prefer to immediately start writing anything on the social network rather than collecting data. So if we collect data we can see that the problem was not with one website or the other. So when people said Yandex is down, no it was not specific to Yandex. But also there was a specific problem. It was about Russia. Many people immediately started to assume that it had something to do with the war. That it was an attack by the Ukrainians or a problem with Russia. There is the first problem that many people talk on the social networks without first gathering data. But there is also another problem is that many people reacted to this event not based on facts but based on whether they were pro-Russian or anti-Russian. So they said it's a fault of Ukraine, CIA etc. or the opposite or it's a fault of Putin or Kadyrov or I don't know who. So for instance you can find in many articles published about this problem that it was because of Russian censorship, some censorship test that failed. There is no evidence supporting this. There is censorship in Russia but in the specific case of the incident on Tuesday there is absolutely zero evidence that it was an attack or zero evidence that it had anything to do with Russian censorship. It was just a technical problem. So to debug this sort of problem let me spoil immediately it was a DNSSEC issue. But it was in the title so you already know it. So the best tool to debug DNSSEC issues if you don't know it is DNSViz. DNSViz is one of these few programs that are loved both by hardcore hackers and by managers. Hackers love it because it's technically sound and it produces correct diagnostic. And managers love it because there are pictures. Here you can see the chain of cryptographic keys that were used in .ru. At the top is what is called the key signing key which is one reference from the DNSWoot. The key signing key signs two other keys which are called the ZSK, the zone signing keys. One was inactive at this time. It was the old one which was soon to be retired but still published because again the world is not consistent which means that different parts of the internet see different things so you have to keep information in case of. On the new one the active one on the white well as you see there is a problem. Red is not because of Russia it's because of problem in that case invalid signatures for all this type of data. So this was at the heart of the problem. The zone was signed cryptographically signed but with invalid signatures. So the issue was at the .ru domain name registry which is the organization in charge of the top level domain .ru. Unlike what many people said without any facts. It has nothing to do with the system of resolvers used by the internet access providers in Russia. The problem appeared for everyone. I had the problem at home for instance because the source of the problem, the root of the problem was at the .ru domain name registry. Also this registry is the same organization is also in charge of two other top level domains which were unaffected. Again unlike what you can read in many articles about the problem. So DNSSEC is a technology of security. The goal is the idea is to sign cryptographically the DNS data so the resolver at the other end can check that the data is pristine, is correct and has not been modified. So that the idea is in a way, actually it was even in the official statement by the domain name registry, in a way DNSSEC worked because the signatures were invalid so the resolvers rightly so rejected the signature. So you cannot see immediately that the signatures were invalid. You can query DNS with tools like a dig, drill etc. etc. But of course unless you can do RSA or ECDSSE computations in your head you will not see that the signature is invalid. You have to trust the software. So why did it work for some people? It's because not all DNS resolvers on the earth validate. I didn't try the resolver used on the first-dem network for instance. I assume it validates but for instance many big internet access providers don't bother to validate which means that if the signatures are incorrect it doesn't matter because they don't check anyway. So big public DNS resolvers like Google public DNS validate on the other problem. Also at home I have my own resolver which validates so I was also unable to see anything under .RU. But it can explain why some people said hey it works for me. Sure because every resolver DNS is decentralized which means that any resolver on earth will do its own validation and some decide that no it's broken so you cannot access it and some will not validate so it will work in a way. So the lessons we can take from this incident. One is that DNS is important. I can even say critical. Most activity on the internet start with the DNS so not having the DNS for most people is like having no internet. There have been some reports that for instance Russia was disconnected from the internet. Bullshit. It was easy to see that if you know the IP address of the server you could still reach it. But of course it's not really convenient. You cannot spend the day using ping and truss route with IP addresses. So for most users it was exactly as if the internet in Russia was down while it was only a DNS problem. So DNS is critical. That's why the people who work to maintain the DNS should be paid much more but it's another issue. Also an important thing about the DNS is that the domain names are organized in a tree with a root. So you can create top level domain like .fr.be.ru and then second level domains, .yondex.ru etc. And because of this organization in a tree if you break one node in the tree everything under it is down as well. So if you break something .com every name under something .com disappears and if you break a TLD, a top level domain, big problem because you break everything underneath. That's why domain name registries are extremely important. Also cryptography is hard. We know it. It's hard to do properly. It's hard to debug software as bugs. I'm sorry again to inform you that software as bugs. So it's still a problem today that internet could be more robust if we could get rid of security measures because every security technique can turn into a denial of service. In the case of .ru many people said oh okay because DNS sec was broken and access was then denied we should get rid of DNS sec. Okay it's exactly the same if you find an expired certificate on an HTTPS website you decide that checking certificates is a bad idea. It's the same thing for every security technique. If you lock your door when you leave and if you then lose your keys you cannot go back to your home. You have a denial of service and then people lock their doors for good reasons. So it's the same here. It's true that in that case a problem in a security technique made a denial of service but it doesn't mean that we should get rid of security. Again it's a very general problem with every security technique. Also one important lesson but you already know that free software is great because in that case without DNS v's debugging such problems would be much harder. Of course we could use tools like a dig, drill etc but typically they don't make nice reports. It's not just the pleasure of a nice picture. It's also a good summary and it allows you to see very quickly what was wrong. Some tools like drill for instance I use drill a lot and drill reported also the bad signature but it reports also many other things so it can be hard to pinpoint the problem. So DNS v's is really great. It can be used online but it's also free software so you can use it on your own machine if you want. Also during the problem I used a lot the Wipe at Last probes. There are small probes with free software in it. Of course that volunteers install all around the world so you can make distributed measurements. Again the world is not consistent. You can have things that work in one place and fail in another so you need also distributed monitoring of the internet, distributed debugging. And this is exactly what Wipe at Last probes are producing. The software on the probes is free software but typically you don't mess with it. The server is not so it's not really free software everywhere but it's quite open because not only anyone can install Wipe at Last probes but anyone can also ask for measurements from the probes. And they can do everything which is needed to debug DNS and DNS SAC issues. Thank you. I'll be there if you have questions or issues or you can ask them on the metrics room as well. Thank you.
A simple caching service for your CI
So, hello everyone. So, as I said, my name is Remedio Raffa, I'm a principal tech lead at Lino. I've been working on Open Source Project for a long time now, and I've been at FOSDEM for many years now, it's not my first FOSDEM presentation. So, I've been working on VLC media player on V8, Javascript Engine, and I joined Lino some years ago working on Lava and on Automation and CI in general. So, today I wanted to speak a bit about a really tiny project that I created some years ago, which is called Keyscash. And in order to present it, I have to explain why we are using Keyscash in Lino. So, at Lino we contribute a lot to the Linux channel, and not only by developing new stuff, drivers, and a lot of different things, but we also contribute a lot by testing the Linux channel. We have a project called LKFT, Linux channel functional testing project. That is, if you go to the website, it's written that the goal is to improve the Linux channel quality on the ARM architecture, because we are now mainly about ARM, but not only. By performing regression testing and reporting on seleting Linux channel branches on the Android command channel in real time. Okay. That's what is written on the website. More or less, it's a project led by Leno. It's an automated system to build and test a set of Linux channel trees. We mainly care about LTS, obviously, mainline and next. And by contract, we have to provide a report in 48 hours. So, it's quite tight between an RC on an LTS trees. In 48 hours, we have to provide an SLA. We have to provide a report, all right. So, if you look back at 2023, we built and tested 396 different RCs, so only LTS channels. As we also care about mainline and next, we built 2,443 different channel commits. That's 1.1 million builds. So, 1.1 million channels were built by the system by LKFT. And we ran 297 million tests just in one year. And if you look at the Android parts, Android command channel, that's 580 million tests. The tests are running both on virtual machines, so QMU and FVP. We have a specific system where we can instantiate in the cloud many machines for running QMU and FVP. That's a stock suite service that we created. We will not speak about it today. And we also have a physical lab. So, with physical devices in Cambridge, that is managed by a tool called Lava. That's a tool that I'm running inside in Salinaro. So, if you look at the LKFT, really simplified architecture because obviously it's way more complex than that. So, as I said, we care about LTS trees, mainline and next. So, we have GitLab repository that are just mirroring the different trees that we care about. And when there is changes, GitLab will pull it and we create a GitLab pipeline. The GitLab pipeline will send a set of instructions to our cloud service for building, called text build, that will run the builds. So, it will scale from zero machine to 5,000 machine in some seconds, do the builds, shut down the machine and then send the artifacts to an S3 like storage. So, the artifact will be the kernel, the TTB, the root file system, the modules, etc. And then these artifacts will be pulled by our lab in Cambridge to be tested on real devices. So, in the lab in Cambridge, we have some hundreds of boards, Raspberry Pi, Dragon boards, IKs, X15, etc. A lot of different boards. And at the same time, they will all pull the artifacts, deploy them on the hardware, depending on what kind of hardware you have, run the test and then report back. And obviously, everything will run in parallel and don't leave from the same storage. So, our CI system, as I said, will build and test artifacts, L, DTB, RAM, these modules, etc. And for each kernel, DTB and root file system, they will use multiple times because when we have one commit from the kernel, we'll build it for multiple architectures. We'll build it for x86, ARMv7, ARMv8, ARMv9, PPC, SH4, MIPS, etc. Then for each architecture, we'll have multiple configurations. I want to build with some virtio-specific configuration. I want to build in debug in release, etc. And then for each configuration, for each commit architecture configuration, I will run a set of tests. So, KSELTest, KUnit, libgperiod, the LTP, etc. Considering that LTP, for example, is broken into 20 different test suites that will be 20 different test jobs because it takes a lot of time to run. So, the CI system will run a lot of different jobs, of test jobs, that will actually pull the same artifacts all the time, which means that in the network, on the network in the lab in Cambridge, we have a lot of network usage and a lot of duplication. We are re-downloading always the same artifacts. So, that's normally really simple things to solve. You just add caching. So, just, I'm really adding that because that's really important. Our system, our CI system, the Lava Workers, will download multiple times the same artifacts at the same time in parallel. So, if you look for a caching proxy in the open-source community, you will obviously find that Squid is the main caching proxy and it's a perfectly good one. It's really working well. So, you should just install that on our network, point all the workers to it and it should work. Short answer is no, it's not working just because of the two reasons above. So, and also for another reason, this one. All artifacts, as I said, are published in an S3 like bucket. They are somewhere in the cloud. So, obviously, if you want to download them, you will download over HTTPS. You will not download a random binary from internet and run it in your local lab for testing. Not something that you will do. So, we have to validate. So, we use HTTPS to be sure that what we're downloading is what we're expecting. At least we are trusting the software. But when you add a Squid proxy in the connection, it will not work well with HTTPS. That written in the script documentation, you can make it work with that. It's not easy. The main problem is that as an HTTP client, when you connect to a website over HTTPS, you're expecting to get a certificate and the connection will be encrypted with the certificate and the certificate, you have to trust it. When you add Squid in the middle, Squid will have to connect on your BI to the server. So, the connection between Squid and the website is encrypted correctly. The certificate written by the website is a legit one, so it will work. But when Squid will have to decrypt the content to cache it and then re-encrypt it to send it back to you, it does not have the private certificate from the website, obviously. You don't have the private certificate of Google.com on your machine, so you cannot re-encrypt the traffic. So, Squid will need to have his own certificate and it will encrypt the traffic with its own asset certificate. And you will obviously not trust it. You will not trust your local Squid proxy to sign something from Google.com or AWS or any website or Linux Foundation. So, when the HTTP client receives the custom asset certificate, it will just say, no, I don't trust you. There is a workaround and it's written in the script documentation, obviously, which is create a wildcard certificate, which is a certificate that will be valid for absolutely every website on the planet, every DNS, so it's kind of a dangerous asset certificate. And you can install it on every of your HTTP clients. It's possible, but it's really crappy, honestly. That's the first problem. The second problem and that there is no way to work around it is that when Squid, when you try to download multiple times the same artifact in Squid, so, for example, you have two connections downloading the same root FS, Squid will download it twice and stream it back to the clients at the same time. And when it's downloaded, it's finished, then the third connection will have a cache version. But as long as it's not cached locally, it will re-download from the start. And as I said before, our system is by-designed running everything in parallel, so it's often the case that we have multiple downloads of the same artifact at the exact same time. So when using Squid, it was just not caching anything. Sorry. So that's why we created KeysCache. So Keys stands for keep it simple, stupid. It's a pretty simple and stupid caching service. But the main features that it has are exactly what we need for a API system. It allows to cache HTTPS resources without any acts or anything. It allows to download only once, even if you have multiple clients and they will all get a stream back, the stream of data back. And the reason why it's not, it's working for both cases is that it's not a transparent proxy. So it's not like clients that will know from an environment of the Bible that it has to go through a proxy. Instead, you have to prefix your URLs. So if you want to access example.com slash .fs.x4, for example, you have to prefix it by your KeysCache instance. So even if you're downloading over for HTTPS, your clients know that it goes to KeysCache and not example.com so that it's expecting a certificate from KeysCache, not from the original website. That's the first reason. And KeysCache also, we made it so it knows how to stream back to multiple clients, the same content. Fun thing, we also added a lot of automatic retries inside the KeysCache backends. So if for any reason, and it happens a lot, the connection between your network and the S3 like bucket breaks and it often breaks, honestly, KeysCache backend will automatically retries. This is a list of HTTP codes that we're retrying automatically. And it will also, so when it's retrying, it retries up to 50 times over a period of two hours because we had exponential backups. So sometimes a download will actually take two hours and 50 retries just because the S3 like bucket is just sometimes a bit buggy to answer. We also added partial download, which when you have, we do a retry, if the HTTP server knows how to do that, we only download the remaining content, not from the start. And the good thing is that with the automatic retries, the client will never see that there is a broken connection because from the client to KeysCache, the connection is kept alive. It's only the backends that sees the network issues. So it has been in production for 3.5 years. It downloaded 32 terabits of data from internet and served 1.6 petabytes of data locally just for a really small tiny software, which is an expansion ratio of 51 times. So we divided the network usage by 51 just by having a small working proxy. It also improved a lot of stability thanks to the automatic retries, I said, up to 50 retries, which is insane. And it also lowered a lot of the S3 egress cost because you have to pay for egress in the cloud. When you, for 1.6 petabytes of data, that's a lot of money. So yeah, we saved around 150 K of euros just by having a local proxy. Just because I have just two minutes, a look at the global architecture of the service, it suggests a Django application with a salary backends. So you have a reverse proxy and Ginex. It can be any reverse proxy in fact, that will receive an HTTP connection. It will send that to Giniacon, which is a Django runtime. The Django will see if the, we look at the database, but at the base, progress, to know if the artifact has been downloaded already or not. If it's a case, it will then look at the file system and just give that back to Ginex saying, please send that to the client. And I'm done with it. If it's not already downloaded, it will send a message to Redis that will spawn a salary task that will actually do the download and retry in the back end. And it's done only once. And it's then saving it to the file system, appending to a file, byte by byte. And at the same time, the Django process just reads the file on the file system and sends the bytes where they are available. And that's all. Waiting for the file, the file to be just finished. And if a second or third of many different users arrive for the same file, then they will just reuse what is already available in the file system and wait for the download to finish. And that's all. That's all. It's pretty simple and efficient. And it has been a really good use for us. And it might be useful for your CI system. So if you have any questions, I will be here after the talk. Thanks a lot. Thank you.
Reinventing database exploration with Azimutt
Welcome everybody, let's get started on the next session. My name is David and it is my pleasure to introduce Loïc Nuchel, who will be speaking on reinventing database exploration with Azamut. Thanks a lot. Hi everyone, thanks a lot for coming to my talk. Indeed, I will talk about Azamut and how we can explore the database with it. My name is Loïc Nuchel and I am principal engineer at Dr.Libb. Basically, the whole talk is a story about how I started at Dr.Libb and now I am here talking to you about Azamut. Three years ago, I started at Dr.Libb. I joined the company so if you don't know Dr.Libb, it is a French company in healthcare, allowing patients to book appointments with doctors. Basically, it is built on big monoliths, on rubric and rails backed by PostgreSQL database. Basically, it is a huge monorepo and also a huge database with 800 tables inside and several petabytes of data. As an architect, I joined Dr.Libb to work with the team and help with architecture, improve the code and things like that. But also, for that, I have to understand what is inside the database, what are the models and what are the relations. The thing with rubric and rails is you don't define the properties inside the models. You just define the relations, but often the models are quite long. They can do like 1,000 lines long and sometimes the relation is as far as 100 lines or something like that. That is not really convenient and I had to look inside the database a lot to understand what are the things and how it works. Basically, that is me working at Dr.Libb for the first month and obviously, this is not very friendly. I had to find a tool. I looked for a lot of tools. They are called ERD, and they show tables with relation and nodes. As you can see, this is not very friendly. Here is the 10 tables. Imagine 800 and you will have some trouble. I tried quite a bunch just for you to have a look at what they look like. Basically, the NWAS NAP failed. For a few reasons. The first one and most of you find is all of the tools I could find will show everything. When you show 800 tables, you don't understand anything. The second one is most of them don't have an SQL or database import. The last one is they are not private. Basically, I had to upload the schema to the service and I don't want that for Dr.Libb. Basically, when we are a developer and we are in this situation, we do another tool. That's what I did with the big one goal to make it easy for large database like 800 tables again. You may see tables with a lot of columns like 100 or something like that sometimes. Locally, this is not for us, but this is a possibility to stay local and just have it in your browser. Not send any data to the service and of course open source. The first part was the schema exploration. When you load your schema into azimuth, you don't see anything. You just see a search bar and an empty screen with some situation. The goal is to look for tables with the search and just load the table you are entered in. Mostly, if you are working with a big database, you don't want to see everything. You just want to see one, ten tables around your scale, your feature or something like that. You can do some nice diagram like this with choosing the table and the column you want to show. Also, you can navigate from one table to another following the relation. Obviously, the foreign key with outgoing relation, but also the incoming relation coming from the primary key from the other relation. That's pretty nice to expand your diagram and explore what's around. Of course, you don't see everything. You want many layouts. One per scope, discovery, team or anything you want, but several layouts of your database to understand it. The last thing is sometimes in the database, you don't have foreign key for all the relation. Sometimes for performance reasons, sometimes for reliability. There are a lot of ideas around there, but sometimes you don't have the relation as foreign key. So, azimuth can infer and suggest them directly inside the diagram. The last feature on the stream exploration is a fine pass. If you want to join data from one table to another and you don't know really all the tables in between, it will be a good one. Basically, when developing this feature, I was very surprised about how many paths there are. You will be surprised too. Basically, that's also a good idea to have a look at that. The second thing is when people are starting using that, like on read-only on the database schema, they wanted to draft new features on it. Basically, doing some more design for the database. I made a DSL with an explicit goal to be very simple. Here is a bigger version if you want to read, to just write the table name and the column name with two space before. Then you can add some attributes like the types and some primariki, unique index, new label and things like that. The goal of this DSL is to be very simple, very quick to write, to go as fast as you and you can flow and your figure can type. When you do this kind of exploration, sometimes you have some discovery and sometimes you want to write them somewhere, maybe for your colleague but also for your future self, again the exploration. There is a lot of documentation. Of course, the SQL command from the table, so it's loaded and accessible into azimuth. Also, the nut into the table, this is the same thing as the SQL command but inside azimuth you can edit and view it easily. There is some tag also to find type easily and of course there is the same thing for the columns. So the SQL command, some nut you can add. The nut are in markdown so you can do the formatting with images if you want, links, lists and things like that. On the layout, you have one layout for what you want and you can document them with some memo inside. Same with markdown, you can put image, link, whatever you want to explain the whole schema, some part of the schema, you can have the color behind. You can also have table groups to show that tables are in the same position or in the same context. That's how you can do the documentation for azimuth. The last part I did not long ago is the data exploration. Basically, before we were only on the structure, on the database model, but sometimes you want to go a bit deeper and understand the data inside the database and how it's working, what you can do. I think this is quite interesting. When you open the detailed sidebar for the table, you have all the details but also you have all the columns with the sample of the data inside. This is random picked data, not just a row with everything but I avoid nulls and things like that. You have interesting data to show here. The same for a column, when you open the sidebar for a column, you have the most used value, the count of rows, the cardinalities, the number of nulls and things like that to know a bit. What is inside this specific row? That's for the quick access but you can also do some full query from it. We have a visual editor for very simple query, like a table with some filter, but also you can write any query you want to have the result. Basically, you have all the results on the right in a list so you can see different results and have some nice features to filter, to sort and things like that. The most interesting one is this small arrow here. You can click on it and see the relative row on this part. Here, I selected all the events, like on the CFP database, we have the event but they are linked to a group here. You can see in one click that it's the human talkspire group which is the link row on this event. This also works in a nested way so if you scroll down and see other relations in this sidebar, you can have multiple sidebars that are stacking to navigate from one row to another. Basically, this was quite interesting but the very nice thing here is you can add this specific row, so one row and data from a table into the diagram. You can add to the layout and see this row specifically so this is not a table anymore, this is a row of data with of course the table name and the table column, but also with specific data for a specific row. You can refresh the query again but with all the data. Same as the layout, you can navigate through the row inside the data. If you click on the primary queue again, you will have all the linked tables and for the table, all the linked rows, with a maximum of 20, because sometimes it can be very expensive, for example, event or if you have some tracking things, you can have thousands of them. Basically, you can see easily what are the linked row, what are the incoming links to this specific row and not just the table in the schema. And then if you click of course on a specific one, you can show it. And the same is for foreign key, so if you click on an outgoing relation, you can just show the relation with it. This allows to do some nice diagram with not only the table in the schema but also actual data from your database that sometimes is interesting to show that you have several rows in the same table like here. And of course you can mix both having on your layout, having your schema, so the table above and the table below. So this is very small, it's not intended to read, but on the right you can see there is several times the same different row on the same table, on a clear blue. So I think that's a very interesting way to navigate into the data. So if you want to try it out, so it's available on azimuth.app, but there is also a nice CLI to load any database almost here, so you can just do NPX azimuth explore and then your database URL. It can be of course a remote URL but also a local one, you will start to get away on your machine which is just a node server to proxy the query to your database. So it also works with local database which is like I think the one of the only tools that I can do that. So thanks a lot, you can try it on azimuth.app, it currently works with several database, so major relational database but also some document database. And basically for relational database when you have a json field, a json column, it inspects the json column selecting 100 non-empty row and infers the schema from it so you see directly the schema of your json column inside azimuth. So this project is fully open source, I've been working on it for a bit more than two years and basically I intend to develop it a lot more in 2024, so if you are interested with it, there is a survey with a QR code and I will be happy to have your feedback on what you thought about what I presented but also what are your current problems about the database, what you expect to see from a tool helping you interact with the database and so on. Thank you all, there is still two minutes so maybe I can run one or two questions. Is there any supplementation when you explore a state or really state database? Yeah, there is several things, so it's made for big database, so basically the table is 100 tables and the biggest schema I think is 1000 tables, 1,500 tables, so there is no issue extracting the schema. There is more issue and basically that thing I will address soon. When you explore that, basically if you have a lot of data inside your database, the quick show of the value into the table and the column can be quite hard to get, but after that you just run some queries. So you will have performance issue if you do queries that take a lot of data but the queries run on your database are not linked to azimuth.
Documenting and Fixing Non-Reproducible Builds due to Configuration Options
Good afternoon, everyone. So next we have Aaron, speaking about documenting and fixing non reproducible builds due to configuration options. Thanks. So hello, everybody. My name is Aaron. I'm a PhD student at the University of Rennes. Doing research in the software engineering research team diverse of Inaria, Eriza, Ray in France. And today I'm going to talk about reproducible builds and software configurations. So what is reproducible builds? I took this definition from the paper, reproducible builds, increasing the integrity of software supply chains. So it says that the build process of a software product is reproducible when given a specific version of the source code and all its dependencies, every build produces bit by bit identical artifacts and plus no matter the environment. And I think it's a really important point. So to achieve reproducible builds, there are a set of guidelines in the website of reproducible builds such as how to have deterministic build systems, what not to ship in the binary or even how to distribute an environment, set some environment variable and so on. So let's take an example. So for the Linux, I can go to the source tree. So I've downloaded and I just generate the configuration of the kernel. So here in this case, I generated a tiny configuration, then I just build it. And once the build is done, I have a binary called the VM Linux that I just keep on the put in the TMP and then I clean everything up and I just reproduce the process. So tiny config run called twice just produce the same configuration. And now if I want to compare the product of these two builds running Difascope, which is a tool provided by the Producible Build Initiative tool. So what's happened? Just because I've built the two binaries few seconds apart, I have two binaries that are different, not bit by bit identical. So I'm following the guidelines. I can set some value to environment variables of the build system. So here in this case, K build. So I can give a fixed date, for instance, here the 1st of January of this year. And now I can have a bit by bit identical binary. The question is in Linux, for instance, we have all different set of configurations. We have the default configurations per architecture, all these config, all mode config and so on. And especially round config that will set some configuration options randomly. So do I need to just fix all of the predictability issues for Linux just with this trick? So we can look in the documentation. So the K build trick is obviously written in the documentation. But there's the documentation emphasize on the configuration options. So here we have six of them. So just as a reminder in the kernel, you can set some values to some options, either yes, no or module to ship them or not. And so here we have a list of six configuration options. But is that all? So as the latest version of the kernel, I think there are more than 19,000 configuration options. So there are six configuration options that have an impact on the predictability of the kernel among all these configuration options. So to answer this question, we have basically have a kind of a brutal approach. So we just generate the set of random configurations. So as you can see here on the left, then we build them in the same environment. So we have a fixed Docker file. And for each build, we just build them in a newly built container. Then we compare the binary. So we don't compare all of the intermediate file of the build. Just compare the final binary. Then you simply do a diff on the binary and get all the results, as you can see here. So there's a way to encode the configurations in a tabular representation. So we just have a row with all the configuration options. And zero means no. One means yes. Enabled. That means module if it exists. Then we get all the data and put it in a classification algorithm. And we just get the outlier options that are responsible of the non reproducibility. Then from the list, we have a phase that is an exploration phase that I will explain a little bit later, where we enrich the list we've got from the classification algorithm. Then we just have a fixed phase. And the idea is to add, if the options are indeed responsible of the non-reposibility to add them to the documentation. So the setup is, so this is the setup. So we have 2000 configuration for each system we study. So the Linux kernel, but also busybox and toybox. So we generate random configurations. We have a preset for x8664 for the kernel. And then for the environment, we just derive the tuxmake image. And then we set all of the environment variable so they don't vary during the build, like the timestamp and so on. So here's one of our first results is for Linux 47% of the builds were non reproducible. And for busybox, we have two cases here. We have the first case where we didn't vary the environment, so the build path. And there's a case where we just vary the build path. So we wanted to show case that there is an interaction between two layers, so the configuration and the build path. And to solve it, you just choose either to change the build path, to fix the build path, or to disable the debug configuration option. So it's up to you. But if we enable the debug configuration option and we just vary the build path between two builds, we have 49% of non reproducible builds. And for toybox, it's 100% reproducible in our study. And so now who is to blame? So no for the Linux case. So here we have an example of the decision tree we got from the process. And we have five configuration options here. So what we do is we don't consider destructor like if I disable module six, one. So here the structure is that if I disable six, one, the next responsible is Jacob profile of trace and so on. So here we just flatten everything and we consider each configuration as independent. Each configuration option as independent, sorry. And so we have this list of five configuration options that, so module six is a similar configuration option is in the documentation for both, but for the rest of them, we don't have them in the documentation of Linux. And now we have an exploration phase where the main idea is that we want to identify all the options of the same kind. So in the documentation, we saw that we had some configuration options on the module, CIG, all module CIG module and config CIG module CIG and so on. And so here the idea is just to identify the siblings of the options. Like if I disable one option, I have another alternative of the same kind and we just explore all the alternatives. And a great example here is module six, one. If I disable it, I have to enable two, 24 or 256 and so on. And so once we have, we've got all of the siblings, we just use the name and conversion in K config to just get the parent. So we know that if I want to disable this specific option, I have to disable this parent. And now, place to the fix of the each configuration options. So the idea is to remove all of the detected configuration options from the initial configuration. And it's a kind of hard task sometimes in the Linux kernel because we have to get all of the dependencies of the configuration options. So to solve each, I mean to detect the dependency of each of the configuration options you want to modify or to change, we use a tool called config fix that is a set based solver that is presented in detail in this paper here. And it just gives a list of options to a list of conditions to satisfy. And it could be in the configuration option and the value in order to apply a change. And then once the change is applied, or once the change is applied and the change being just set in the configuration to no, we just build again and check for a predictability. And from the list we've got, we were able to make 90% of the non reproducible build reproducible. We had 31 configurations, so 3.5% that is still not reproducible due to some dependency we couldn't identify. So that's one of the limits of the approach. And less than 0.5% because the tool we used couldn't find the diagnoses. But compared to the first result I showed, we went from 47% of non reproducibility to 1%. So now the summary. So I think one of the takeaways that options matter. So we should explore more the impact of configuration options in the reproducibility of each build. The second takeaway is that there could be interactions across variability layers, such as I showed for our busy box. So we also need to detect them and to pinpoint and describe precisely in documentation. And we have identified more configuration options that could be added to the documentation, so we'll send the patch soon. And now we could remove some of them. So 96% of non reproducible builds made reproducible. So if you want more detail on the whole approach and the rest, this will be presented at the mining software repository conference. It's an academic conference that will happen in Portugal in April. And thank you for your attention.
Platform engineering for dummies
Great. So good afternoon everyone. Next we will have Donnie Burkaltz introducing platform engineering for dummies. Thank you. Super excited to be here today. It's been a number of years for many of us since being at a Fosdham in person, so welcome back. I was very happy to be here. I got myself a very nice Belgian beer as soon as I arrived, so I'm feeling great right now, all ready for my talk. Only one now, just one. The rest will be later. And I hope I'm assuming none of you are actually dummies, so thank you for coming to this talk. This is just for people who have heard the term platform engineering. It's starting to get increasingly popular. It's the only thing people talk about besides AI these days. We're going to mostly skip that one. And we're going to talk about what it is, how vendors are completely destroying the term, just like they do with everything. And then how to get started with it yourself. How you really make it as easy as possible. You don't have to buy vendor solutions. You can use open source off the shelf software. It doesn't even have to be custom and brand new. So by the end of this talk, you'll have a really good sense of platform engineering, at least as good and as deep as you can get over the course of the next 12 or 13 minutes. You'll have a lot of good resources. I've got links in here and a couple of the slides as well. So you can go check those links out afterwards because it's not just about technology. It's also about the people. It's also about the process. There's a lot of different pieces you have to do to get this right. In fact, the technology in many cases is the easy part. But first, a very short story. A few years ago I worked as a technology leader leading a DevOps transformation at the time. That's what we called it. We now probably call it platform engineering at this travel tech company called Carlson Vaganley Travel, CWT. It was actually an office here in Brussels. I visited there a few years back. Great place. Lots of interesting development happening there. Since then, I actually have led products, management, and products at Docker and at Precona around open source containers and databases. I've spent a long time in the platform space. Long story short, I know what I'm talking about. I've been doing platforms for like 20 plus years at this point, as have many of you. I'm just sharing my own story and my own perspective here. I'm sure many of you have your own. When we think about platform engineering, or at least the way I look at it, there's really three key pillars to it. There's platform operations, platform as product, and self-service for developers. We're going to jump into each one of those pillars and talk a little bit more about what that means. If you want to check this out afterwards, I have my own independent analyst from my little blog post about it. Feel free to check that out at your leisure. What does platform operations mean? There's a lot of companies today. In fact, how many of you come from a large enterprise? Do you have something called a platform team? Does it maintain maybe Linux OS, maybe some other OSes that we won't talk about, some things like that? It just got called the platform team at some point. It might have been the OS team. Before that, maybe they merged it in with the network team or something else like that. When we talk about platform operations, we really mean operating it as a holistic platform regardless of how many servers, how many VMs, how many containers might be underneath it. The same thing we talked about 10 years ago with Cloud, the same thing we talked about five years ago with DevOps, moving away from that Pets mindset into the Cattle mindset, moving away from that single server, single container, naming things after our favorite characters or our favorite TV shows into that mindset of these things are fungible, they're disposable, we operate them as applications and fleets of things and they're automatically created and deleted on demand. We're in this world of SRE now, we're moving more and more into things like SLOs of how do you monitor the user impact of the applications you're serving. In this case, we're talking about platform engineering, meaning building for developers, but even if you're serving internal developers, a platform, you still have to care about the quality of service that you're giving them. You still have to care about your latency, you still have to care about your error rate, you still have to care about how much of your capacity you're using in any given moment. You have to treat those internal applications just as importantly as you treat the ones that you're serving to your external customers and users. A lot of companies don't do that, they'll have things like their tier one applications, those are business facing, they get major incidents, spinning up war rooms and all that kind of thing when there's an outage, but if their CI pipeline goes down they say, oh well, it'll be back eventually, it'll be fine, we can just have our developers kind of doing nothing for most of a day, no big deal. A lot of companies are still like that, but we have to apply this platform operations concept not just to our external customer facing applications, but treat our developer productivity as something business critical in its own right, because developers are expensive. Sitting there for a day, not being able to ship software is expensive. And so we went through exactly this journey at CWT. One good example of this was we started by monitoring tens of thousands of different infrastructure metrics, right, classic old school world of monitoring, and we shifted that into just a handful of user facing impact metrics, but along the way we actually had to educate our developers and our operations teams on how to debug things in a much more complicated way than they were used to, because with the infrastructure metric you could have a simple runbook. You see this thing, you push this button, done, whereas if you have a metric saying my application is slow, there's a lot more potential causes, a lot more you have to learn to jump into it, and so at the same time we made this transition with technology, we also had to upskill a lot of our level two operations teams and had them become an SREs in their own right learning how to automate things, learning how to debug things much more deeply. Now the second piece is platform as product, and when I say this what I mean is for things like your internal CI pipelines, for things like your container services, whatever other internal developer tools and services you might have, you have to apply the methods of product management to them. You don't have to have a full-time product manager, that's fine, if you do fantastic and you're lucky and fortunate and congratulations on that, but if you don't, there's a lot of different people who can pick up some of that load, learn how to do modern digital product management, right, you might have people even depending on how traditional your company is called service managers, right, they might use a framework called ITIL to talk about things, and those people still have the potential to modernize and move forward and get with the times and apply modern product management approaches, meaning talk to your internal stakeholders, understand the problems they're trying to solve, right, in many cases they might be providing a service like source code management is a service you provide to your developers, there's probably a team running it inside your company if you're at a big company, do those people talk to their own developers about what problems they're trying to solve and what their workflows look like, Jets are probably not, they just shove stuff at them and say good luck, right, and we're fortunate we now have better tools than we used to, but there's a lot of opportunity for people in those positions of being these central platform teams or central developer productivity teams to go talk to their own developers about the problems they're trying to solve their day, understanding their pain points, and bringing that back in. At the bottom I've shared a handful of links in varying levels of depth that are super good resources if you're wanting to learn this or if you wanted to share these with other teams, there's an entire specialization on Coursera that'll probably take somebody six months to go through maybe an hour or a few hours a week, there's a great book by the same person who put together that course or the series of courses and then there's a website you can just go read for free to start checking it out right now. In every one of those cases they aren't written for Platform as Product People, they aren't written only for internal product management, they're written for anybody doing modern product management of how do you get that up to speed and so you have to do a little bit of extra work to think what does this mean for me specifically, but all of you are smart people you can figure that out. Applying this Platform as Product approach is absolutely critical to doing Platform Engineering right and nothing about this requires a specific piece of technology, nothing about this says proprietary versus open source, this is the people and the process side of it, but you have to get this to get Platform Engineering right because if all you do is say oh hey we gave you a platform now we've got Platform Engineering, you're wrong. What probably happened especially if you're at a big enterprise is you still have a ticketing system somewhere and you're still requiring developers to go file a ticket every time they want access to some new resource whereas if you're getting Platform Engineering right you're moving away from that because you talk to your developers, you've understood their needs and you've probably moved into something much more policy driven where there might be an initial ticket but the only thing that happens is to assign the developer a role as I'm working as a developer or I'm working as a developer in a certain application area then they're granted that policy driven access and they're able to move on and get on with their lives instead of every single time they need access to a new server every time they need a VM created every time they need additional memory provision to the VM right all these things are crazy and in many cloud environments they have been partially solved but a lot of us are still working on premises we're still working with servers in data centers or in colos or working in clouds that feel like we're that in every one of those cases this is an opportunity to make dramatic improvements in our own productivity as developers um one example of this from my own experience at CWT was we applied this approach to a really novel area which is um one of the teams that reported me to me was the major incident commanding team right so every time stuff got really really bad it was like the fire department you'd call them in they'd run the the issue and run it through to conclusion now that team had to send out a lot of different communications to a lot of different audiences they had to send things out to our internal executives had to send things out to all their employees who were being affected by it we had to send some things out to our customers as well um all those communications were things that hadn't really changed for a long time we had to get a lot better at them there were all kinds of complaints that would come in from these different audiences because it wasn't a one-size fits all approach it was something where but we were sending communications out that way and then things had gradually evolved very organically there wasn't a clear way to understand who should get what i mean so we applied these these platform as product style approaches to the communications going out from the incident commander team and made dramatic improvements by just doing things as simple as going out and regularly talking to the people who need to consume this stuff to understand when do they need it what do they need what do they need to understand so they can turn around and make the right decisions or do their jobs more effectively or tell their own customers the people who actually pay us as a company what we need to do and what they need to do and how long they might need to wait and when to try back and what their alternatives might be what was interesting too is we did this in a very lightweight prototype sort of fashion right so of course we had a technology solution for sending all this stuff out but instead of using that and using our developer time to sit there and iterate and work their way through their backlogs we literally just wrote a heavily formatted email by hand and started sending this out and used that as a tool to iterate on what the product should look like and so we just put together this email and we'd send it to somebody and say hey like what is this what do you think of this like walk me through how you're interpreting this what you're doing and by applying that really lightweight technique of just doing things by hand doing things the rough way before we had to put in the effort on software development it dramatically speeded up our ability to figure out the right thing and then spend our development effort building the right thing instead of getting getting it wrong very slowly multiple times on the way and third self-service for developers this one is pretty self-explanatory so I'm not going to spend a lot of time on it but really this is the continuation of that consumerization of it trend right the expectations for user experience in the enterprise side are very different now than they were five ten years ago and the same history for developers right developers should not have to put up with really clunky terrible interfaces on their internal tools anymore right it's been bad for a long long long time but things are finally starting to get better right things have gone through very ticked-dirgin approaches my own experience at CWT was you know we came in and we did something called value stream mapping which is a great technique for anybody who's interested in solving a lot of problems like this where we worked through a very specific workflow and the one we picked was deploying a new application for the first time um worked through every single team a request went to every single team that had to touch it and end up being something like 15 different teams were involved in this because there was a single silo team for everything you could imagine right there was like a network team and a security team and a firewall team that wasn't the same as the security team uh and you know the list just goes on and on and on in large companies like this and every single one of them required a ticket in some case it was the ticket you had to file in some case there was a ticket that a team filed to another team and that team filed to a third team and then somebody else would audit it and somebody else would review it and finally it would work its way through right but imagine getting all those to a place where you can clearly define the policy once get agreement on that from all these teams and then work on that policy and use that policy to automate all of your governance going forward all right that's what we're talking about um we took out of a 45 day timeline to deploy new app we took 30 days out on the way there um by making some simple process improvements and applying some automation now let's look at some solutions over the course of the next minute what do you need from a solution you need a job runner pretty simple because you got to do stuff you need a web GUI so you can click some buttons you might want to click on it have an API or CLI but those aren't necessities you need to access controls so that only the right developers can do the stuff you want to do and of course you need to be floss now there's a few different classes of these job runners you might look at internal development platforms you might look at CI servers you might look at workflow and data orchestration tools or you might work on look at task schedules there are all good options when you're thinking about how do I do this platform engineering and really the answer here is use whatever you've got don't make this huge start where you are you can use GitOps you can use backstage you can use even Jenkins you can use workflow and data orchestration tools or task schedules so hopefully that's given you a sense and I'd encourage you to refer back to the slides later to see that list because I went through it pretty quick of what platform engineering is all about what are some of the different solutions and that you should start exactly where you are today using the tools you have don't make this over complicated thank you
Taming the Beast: Managing High-Growth Postgres Databases at CircleCI
Hold on. Hello everyone. Sorry? No, I think people are just using the arrow keys. Sorry. Less high tech. Hello everyone. So our next speaker is Bryce Kenta, introducing Taining the Beast, managing high growth postgres databases at CircleCI. Thank you. Hi everyone. My name is Bryce Kenta and welcome to my talk on Taining the Beast, the CircleCI journey to managing high growth postgres databases. First, who am I? So I'm a staff engineer at CircleCI, where I've been working for the last three years. I have over eight years of engineering experience spending the full stack back in front end. At CircleCI, I've been focusing on backend architecture and reliability. Over a period of hyper growth, reliability became a big problem at CircleCI to the point where our CTO started posting a monthly blog post to keep our customers updated about the improvements. So a key part of those improvements was dealing with large databases, which I'll be talking about today. I'm very enthusiastic about the develop experience and making that better, which is why I love my work at CircleCI. And when I'm not in front of a computer, you can find me on the driving range because Canada is very cold and occasionally traveling the world of my wife. All right, so let's get started. Just to give you a little bit of background about CircleCI, it's a global CI CD platform with a wide range of customers. A bunch of open source projects build on CircleCI, such as React Native, Angular. Anytime you see a .CircleCI folder in a repo that typically is building on CircleCI, and on the right screenshot, that's an example of a React Native workflow, which is currently just running some tests. And so this should be familiar to any of you that are maintaining any CI CD pipelines. So our platform runs about 4 million of these workflows per week and over 20 million jobs per week. Each workflow that runs on our platform generates net new data to be stored, such as the workflow itself, the dependencies between the workflow, the workflow graph, the job states, and test outputs and things like that. So to handle all of this traffic, our infrastructure runs over 150 services and 70 plus post-grace databases. However, some of these databases were growing very rapidly. The particularly one that supports the platform's engine. The growth of such databases was directly correlated with the number of workflows and jobs that are created per second. So an example of high-growth database that my team was responsible for had grown to 5 terabytes in size and growing by 500 gigabytes per quarter. The right amplification on that database was a recurring cause for incidents. The nail in the coffin, though, was when we tried to upgrade that database from an end-of-life 9.5 post-grace RDS instance to a 12.5 instance. This took months to complete and incurred significant downtime because of incidents. The first attempt at migrating the RDS instance took a couple of hours and resulted in poorer query performance. This is because the large tables required lengthy vacuum operations, post-upgrades, which led to massively degraded performance. We considered using AWS Database Migration Service, DMS, but it would take too long to complete given the database size because DMS uses logical replication which is concerned with the number of rows and the amount of bytes that you're transferring. We were finally able to do the version upgrade using a form of home-brewed logical replication, taking advantage of application-level knowledge of the database. But this required significant engineering effort with engineers working weekends. So that wasn't great. At the end of all this, it was clear to the business that operating these large databases is very risky and could cause a company-ending event. So we needed to tame this growth. So now I'll take you on the journey that we took to taming this beast. So first, I'll talk about the storage reduction, so the immediate savings that we gained by deleting some of the low-hanging fruits. Next, I'll talk about the growth restrictions that we put in place to make sure that the data growth remained at manageable levels. And lastly, I'll talk about some of the optimizations that we made to ensure long-term success. So the first thing we did to reduce the storage was to drop unused columns, tables, and indexes. Indexes in particular can grow large in size over time, so dropping them was a quick win. We leveraged a tool called PG Analyze to identify indexes with those scans. So that means they were not used, and then dropping the indexes not only benefits the storage size, but it also reduces write amplification, so the writes to the database are actually faster. Next, we switched a bunch of B3 indexes to use Brin indexes instead. So Brin indexes are designed for handling very large tables where in which certain columns have a natural correlation with where they're physically on the table. So for example, if you have an Ordis table with a created-at column, earlier records on the table would physically show up earlier in the physical location. So those Brin indexes are optimized for that kind of data. So from the screenshot, you can see we had a bunch of created-at indexes across multiple tables, but the thing to note is the size of those indexes. That took over 400 gigabytes of storage in a single database. So dropping them, or those the ones that were unused, or switching to Brin were able to save space immediately. The next step we did was to reduce the storage further, and we had to upload any static blob data to S3. So S3 is much cheaper, and you can define object life cycles to automatically delete the data. But my greeting to S3 came with some drawbacks, such as additional latency, because we had to put a Redis cache in front of it. And the other drawback was that it added more dependencies to our service, and the queries were no longer transactional. So we had to add code to stitch together the response from Postgrease and S3, so that added a bit of complexity. So at this point, we freed up some storage size and to give us some runway, but we haven't addressed the growth. So let's talk about that next. So the first thing we did to slow down the growth of our databases was to put in place data retention policies. Our product management team collaborated with other parts of the business to identify data retention periods. So the data retention period differs based on the customer plan. So for example, a free customer will get three months of data, and higher-plan customers will get up to two years. We communicated these policies to all of our customers ahead of time. We gave them a quarter, so three months of leeway, before actually enforcing any restrictions. So the next step after that was to implement data access restriction, but at the API layer before actually deleting any data. So this meant customers no longer have access to data beyond their retention period, which enabled us to go to step three, which is safely delete the data, because now customers don't have access to it anymore, using background jobs. I should point out that at this point we still have growth, but mainly due to new customers, or existing customers that are building more on the platform. But the growth is contained because we don't retain data older than two years. But we ran into some issues. So the first issue that we ran into was, as we're deleting data from the primary database, it caused degraded performance on the replicas, as the deletions are getting replicated. So we experienced like spike in IOPS and CPU usage, and so we needed to upsize the replicas. Another issue that we faced was index bloat. So frequent background deletions without a periodic maintenance of the indexes, reduces the efficiency of those indexes over time. So a solution for regularly re-indexing the database was necessary to make deletions sustainable. This is something that we're still figuring out. We haven't found a proper solution yet. But lastly, post-grace databases do not automatically reclaim space when a record is deleted. This is something that we found out. So there is a built-in vacuum operation to reclaim space, but this process only frees up space back to the table for reuse. So once disk is allocated for a table, it may never be released until that table is dropped. The vacuum operation has a full option which builds a new table and swaps the old table for the new, but it requires an exclusive lock. So this was not a viable solution for us because, again, it requires downtime. We're able to use PG-REPAC, which is an open-source post-grace equalization that allows us to reclaim space on the drop columns with minimal locking of the table. So that was great. And then the last step on our journey was to establish a long-term strategy. We needed a data archival process that could be applied to all of our high-growth databases. So we established a data reliability team with the mandate to own a single historical data store. The data store would support functional requirements such as high availability, be horizontally scalable, support multiple query patterns, which is needed by the API or the UI to filter data. But this historical database is only used to serve customer data only, nothing else. No ETL, nothing like that. And then each service team would implement a data archival process, which is similar to the diagram at the top. The service sends requests to the historical service to archive data. What data is archivable and when? It depends on that particular service domain. There's a sweeper job that makes sure that any missed archivable data is archived. And then there's a deletion job that is continuously deleting archive data. Also, as product teams are building new features that require net new tables to be added or to be created, we aim to partition them from the beginning. We use PG Partman, an open source partition manager to create time-based partitions. PG Partman enables us to configure retention periods and will automatically delete any old partition. So as soon as the partition falls out of the retention period, so in our case 24 months, it is automatically deleted by PG Partman so we don't have to worry about it. And finally, so now that I've taken you on the full journey from reducing our storage size to establishing long-term data archival processes, I'd like to take a moment to acknowledge some of the key learnings because an initiative of this magnitude was spanning almost two years and was non-trivial for us. So the first learning was to implement a brief retention policy as early as possible. Ideally, one that allows you to serve more data at your discretion because this means you don't have to implement the code to delete the data until you really need to. That would have saved us hours of engineering effort and downtime dealing with massive databases. The second learning rehearsed any major database maintenance, things like major version upgrades, space reclamation, re-indexing, anything like that. Make a copy of your production database, validate your changes there, compare query performance against the production database before actually running that maintenance in production. And finally, write down your learnings. This creates a knowledge base for everyone to learn from and helps other teams move faster. The extensive documentation that my team put together throughout the last two years is what helped me a lot to come up with this presentation. And that is it from me. So thank you for listening. I hope this was helpful to you.
ε-serde / mem_dbg / sux / dsi-bitstream / webgraph: a Rust ecosystem for large graph processing
Hi everybody, we're just about to have our next talk, who will be Sebastiano Vigna, who will be talking about a Rust ecosystem for large graph processing. Sebastiano? Thank you. Okay. Okay. How many Rust programmers here? Well, some. How many Rust programmers who handle large data structures, like those of gigabytes? A few. Okay. The first group is reasonably interested. The second group is more interested. The rest of the people can't sleep. I'm not offended. You can use the computer. It will be very, very boring. So okay, let me introduce why. Okay. What I'm doing is just announcing a few crates we are distributing that do very specific thing related to large-scale data analytics. And the original of this is a framework for graph compression that has been around for around 20 years. And that's being used by the community around the WWW, the web conf, the largest conference on the web in general, academic conference. For the last 20 years, there are many data sets that are distributed in this format that are utilised and so on. There are a lot of journals. And in 2011, it was used to measure the degrees of separation on Facebook, if you remember it, maybe you're too young. But it was quite a feat at that time because, I mean, it was for 15 years ago and Facebook was still rather large. But we were able at that time to represent the entire Facebook graph in just 211 gigabytes, which made it possible to run some pretty nice algorithm to compute this and distribution. Maybe in this community, I should mention that in the late, I started to do free software in the late 80s on the Amiga. Okay. So nobody remembers what it is, but I have some history with the free software movement as well. So at some point, we decided to move to Rust for the obvious reasons, like it's a high-performance, safe language. But, okay, all I said is in Java. It was written in Java, started in the 80s and of the 90s. And at that time, it seemed a very good idea. Okay. Then things happened like arrays are at most two billion elements. And if you have graphs with 50 billion elements, you cannot even index the notes, which gets very, very annoying. And today, anything this size is done using memory mapping. I mean, if you go to Facebook, Google, whatever, all the large structures are there in memory, but usually they're just memory mapped because you don't want to start up time. If you load in memory a graph that is half a terabyte, you wait minutes, whatever the platform you are on. But if you can memory map it, this time is amortized along the visit of the graph, for instance. Okay. And we actually need to represent very large graph. If you ever use Java, the access to memory mapping facility, I will not say words because they would not be proper in this particular situation. There are really lazy iterators. If you're written in Java and iterator, you know what I mean. And okay, so we, to do this, we needed to port a number of ideas from a Java library and to develop a few new things. So the first thing is absurdity, weird name. So it's a framework from epsilon copy, serialization, deserialization. So you might know what is zero copy, serialization, deserialization, means that you serialize something and then you use the memory, actually in the state it is, to represent the object internally. Okay. So there is no deserialization. You don't build a new object. The piece of memory is directly used as it is. And this is how things work, as I said, in all these organizations that have large indices, Facebook, Amazon, whatever you want. I mean, the index is on disk, it's memory mapped as it is. It's not deserialized in any proper sense. There are a few frameworks like abomination that do this kind of things in Rust, but they all have problems for us. The first one is the oldest one by Frank McSherry, writes into the serialized object. So if you want to memory map a file, that's out of question. You might know it is from the people that do the internalization library. Nice idea, but it has a huge impact on performance. It does some kind of runtime resolution of the access to vectors. And then there is Archive, you might be familiar with, which too does some relative memory that is differentiation. And also the structure you deserialize is completely different from the one you serialize. So you have to delegate all the methods and then each time you change one, you have to change the other. Not very practical. So what we did was develop this framework, which requires a little bit of collaboration from the underlying struct. But the basic idea is that you serialize something and then you epsilon copy deserialize it. So you access it, you allocate a very small amount of memory and then the rest comes directly from the disk without any intervention. And the way we do it, we remap vectors essentially. You build a structure with a vector, but when you deserialize it, it has a reference to a slice. In this way, we just have to allocate the actual struct that you want to deserialize, but then anything that is a pointer inside just point to the original memory. So epsilon copy, the idea is that it's not a zero copy because we did a little bit of copying, epsilon copy, a very small amount. But the advantage is that now you have exactly the structure that you serialize. It's exactly that structure with all its methods. The only thing you have to do, if you have vectors, there must be a type parameter and you must write the access method for as a left to a slice. Of course, when writing, you write for a vector, but when you read, you read it from a slice. This is the collaboration you need. But then, completed transparently, like you can do it with basic type. You store a vector and then you memory map it and that's it. And what you get in T is a reference to a slice. More precisely, something that derives to a slice, to a reference to a slice. And again, you work essentially transparently with respect to the framework. Unlike the other cases, and since there is nothing intervening, resolving the pointers, there is no dynamic resolution, everything is done at this realization time, zero impact on performance. The performance is exactly the one of the original structure. We use this to map massive immutable data structure like representation of sequences of sets and so on that are like those of gigabytes, 100 gigabytes on disk directly on memory, without any load time. So if you handle large immutable data structures, that could be for you. Memdology, that's a very small crate, but it's a problem we had. Okay, it's a high performance memory occupancy detector, which sounds ridiculous when you say it because, well, it does as to measure the memory occupied. It's not so easy because if you use the one that are around, so it is like a large vector and few other things, this is the amount of a located memory. These are the three more common frameworks, sorry, crates that do that, and this is the amount of time that they take, and this is the amount of time we take. So the reason is that without some infrastructure similar to the one of absurdity, you have to iterate through collections to measure the space occupied. And if you iterate through a billion element collection, it will take a lot of time. So we routinely measure the space and occupancy of things that are like 50 gigabytes, it will take eight minutes. So we develop this if you need to measure the actual occupation memory, not stack occupation, the actual occupation in memory of something large, try MDBG. Also, as a nice, it does you a print out of the structure with the old memory occupancy. It's important for us because we do all the time this succinct data structure that have various components and we need to know the relative size. So this is only if you have very large data structures. They are small, you can iterate, no problem. Succ is an ongoing problem, ongoing problem, yeah, it's an ongoing problem. I won't say an ingrate, but it's actually kind of an ongoing problem. And it's a part of an existing C++ project and Java project about succinct data structures. You might know what they are. If you don't, no problem, you don't need this crate, but they're very fashionable now. There is one crate at least that does this, but we wanted to have something more sophisticated. So if you're interested in Elias Fano representation of monotone sequences, ranking, selection, and so on, please have a look. This is really getting to existence, but we like to have feedback. Fungal piece bit streams, very, very high performance bit stream with read and write by word and support for big and little Indian files and a lot of instantaneous code, gamma, gamma, delta, go long, and so on. This is kind of cosy you'd like in MPEG or so on, but we use it to do graph compression and we spend a lot of time to optimize every single shift and move and also to give you scripts to just run and we massively test all parameters you can configure on your architecture so you can choose how to optimize the speed of the coding and the coding specifically on your architecture. Like which word size to use to pick up stuff from memory, using the coding tables or not, and so on. And this comes from quite a long experience in doing this with web graph. So if you're interested in writing this instantaneous code for compression, you should have a look at this IBS stream just to tell you a gamma code is ready in less than two thousand seconds. So I think this is pretty nice. Okay, the last piece which is probably the more specific, so you might less be interested in is web graph. So web graph is a framework to represent very large graphs in a compressed form. So typically snapshot of the web are represented in about one to two bits per link. The software heritage graph which is a graph with about half a trillion edges, it's three bits per link, Wikipedia costs 10 bits per link, it depends on the structure of the graph. But usually in particular the graph is redundant, you can represent data in 10, 20, even 50 times less than you do with a redundant version. It's a rough sport of the Java version and of course we use the SIB stream for instantaneous code and sucks for pointers in the big stream. And just to give you a very simple example, the software heritage graph is 34 billion nodes and a little bit more than half a trillion arcs and you can do a BFV visit single thread in three hours. It's very nice. Okay, you have to notice half a trillion edges. The ergonomics of the whole thing is incredibly better than Java. Just having real iterators changes completely the game because it's much more natural that what we had. And this is all the others are crates that you can download and use that are pretty stable. This is still on GitHub because it's a lot of code, a lot of optimization. We just merged into main the last big chunk of modification, the API should be stable by now. But this is very specialized. I mean unless you have graphs with hundreds of billions, half a trillion arcs, for instance, this biologist did this huge data set with a trillion protein-protein similarity edges and they did it with web graph because if you need a trillion edge and you need to distribute it and analyze it on a standard hardware, not a massive supercomputer, you do it using compression. There is also support for labels on the edges that you can enumerate and it's much better in the new version than in the old one. And one thing that we had to fight a lot against is lenders. So if you're familiar, I don't feel familiar with a lender idea. It's generally an idea and a number of crates for Rust. Lenders are iterators whose return object depends on the iterator itself. So iterators in Rust are thoughts that give you values and you can take the values and use them. But in all this kind of batch processing for graphs, you iterate on the graph and you cannot look at two nodes at the same time. There is a sequential iteration which goes through a file or a sorting of labels. So you need to be able to say, okay, this is the next batch of successor, use it, but I won't give you the next one until you finish with this one. To do this, you need to use essentially generic associated type. Not really that. We use higher order trade bounds. But you need to impose that each call to next can be made only when the previous one went out of scope. So you cannot do two calls to next in a row. And this is called a lender. There are a few crates that implement lenders now which have, say, almost feature parity with iterator, but the fact is that presently they work because of bug in the borrower checker. So the borrower checker doesn't check certain things that if fixed would make all these lender crates not work. And at that point, we would be in really deep shit because we have no idea how to do this other than the way we're doing it. In fact, we're even in a situation where we have a chain of an iterator returning iterators and the final value depend on state on the initials thing. So there is a propagation of bounds of on lifetime that goes through two different types. And that gives me headache each time I look at it. And in fact, I didn't even invent it. I asked on Rust forum and they said, I have this completely crazy situation. What can I do? And a very nice guy wrote a type like this with 25 different implied type bounds and now it works. But let's hope it continues to work. But this is just to say we need a little bit more borrowing in Rust than there is now to make this work properly because it has been a little bit of a pain to get something like an iterator in which the return value depend on the iterating object. In the last thing, if anybody know how to get one thing done, index get. Since 2015, it's been sitting in the issues of Rust to have an index trait that gives you a value, not a reference. Because index give you a reference. Now, index give you a reference is fine. But if you do compress, succinct, any kind of implicit data structure, index giving you a reference is a pain in the ass. Because you don't have the data. They are implicitly represented. You need the trait that giving two nice square brackets will give a value, not a reference. And then you can enter the world of modern implicit data structure. So if you know anybody who can implement this, convince someone in compiler team to get done with this, you please do it. I'm over. Thank you. Thank you.
Using elliptic curve cryptography for the purposes of identity
Hi everybody, next talk is about to start. We'll have Yamo Makinbach talking about using elliptic curve cryptography for the purposes of online identity. Thank you. Shall I start the buzzer? Shall I? And we're off apparently. Yeah. Alright, welcome. So I'm Yamo. I work on this project called Keogh's side, which is about online identity. And we're going to talk about it in a minute. First, because of the last previous talks, I wanted to specify the skill. There will be no 5 terabyte database here or serialization of billions of nodes, which is going to make a little script. It's a bit of a Bob Ross talk, I guess, which is going on a journey together and have fun, discover. And before I really start, we're going to try something experimental. We're going to try a little interactive demo at the end. We're going to write the script, but you're going to verify if the script that we're going to write actually works. So for this, for whoever wants to participate, you should consider downloading the Keogh's mobile app. It's available on these locations. You can just get the APK from the CodeBerg repo. Alright, let's get started. So if someone makes a claim, how do we verify that? Well, quite simply, with a proof. What do I mean with that? So for example, if Alice lost her luggage and then Bob found it very conveniently, and then Alice says it's hers, then Bob asked for the proof, of course, because, you know, and then Alice fiddles with the little dials and unlocks the luggage, and then she verified that the claim was indeed true, that it is indeed her luggage. So now we want to know, is this also true over the internet? Can we do this over the internet? Well, yes, we can. We can claim things over the internet, but humans travel rather poorly through ethernet cables, so we need to find a way to connect Alice and Bob in a different way, so that Alice can make her claim, and Bob can verify that claim, each in their own space and time. And so for this, we're going to use cryptographic signatures. So, yeah, we could talk for a long time about cryptographic signatures. For the purpose of this talk, let's just... the important stuff is basically just like a real signature, but digital, but the big difference, I guess, is that it's really difficult to forge, so that's good. And in short, we have a secret key, which we will use to sign documents, text documents, with a public key that we will use to verify those signatures, combine those two keys, and you have a key pair, and each key pair is identified by a unique fingerprint. All right. So let's try and work out this process then. So let's say that I will write this text document, which just says that this is my account on the Fediverse, on Macedon, now I will sign it with a key, which has this conveniently fingerprint, which starts with very familiar letters. And now the signature itself is just zeros and ones. We're not going to worry about that. So now I will give this text document, my claim, together with the signature to my friend, and my friend will use those two pieces of data. They will first verify that indeed the signature corresponds to this text document, and once that is done, they're going to my actual Fediverse account, and then they're going to read in the bio, oh, this person indeed wrote in their bio that they have this key. So that is the proof with which I verify my claim, and that it is indeed my account. So now we're going to do that whole process. We're going to try to create an online identity with just 100 lines of rust. I did need five dependencies. I tried to minimize it, but without these, it will be a lot more than 100 lines of code. So yeah, these will be it. So we're going to generate a key. This is where the elliptic curve part comes in. Elliptic curves are a technique of creating cryptographic keys, and in this case, we're using these specifically the P256 curve, but all this just to say, yeah, we're using these two lines of code just to create an entire cryptographic key. So this includes a public key and a secret key. Now, of course, I said every key pair has a fingerprint, so that's what this code does. It looks a bit complicated. This is the most complicated part. So the most important part about this script is basically we'll just get some data from the key, we'll get some parameters from the key, and then we're going to hash it, and that is how we get the actual fingerprint. Now we're going to collect the identity data. So we're going to create what we call a profile. Just a profile is just a name, some other metadata about the person, and claims, multiple claims. So I'm just going to continue with the same example as before. I'm just going to claim that that is my account on the Internet. Now we need a way to encode all this data, because we need the text document and we need a signature. So for this, we're going to use a JSON web token, which for the purposes of this talk is just a convenient way of combining a document and a signature. We'll need three parts. We'll need a header, a payload, and a signature. So let's make each of those. Oh yeah, some quick notes. So whenever you see that are you at ID, that is just the namespace that we use for the creation of the tokens. And sometimes you will see JWS instead of JWT. Those are different, but for the purposes of this talk, we'll just consider them the same. So let's create a header. So the header is just a little bit of metadata about the key that is creating this profile. So we'll set the fingerprints and we'll set the actual key. We'll just give it. And the public key, of course, not the secret key, because that one should be secret. We'll create the payload. The payload is the actual profile itself. So we're going to say like, oh, it's the type as a profile of this token. We're going to say line 10. We're going to say like, oh, what is the name? It will be the name and the identity claims. Don't mind all the payload set claims. That's just to confuse you, because JWT also uses the term claim in a different way. Just to make it easy. Now that we have the header and we have the payload, we're going to sign the two. That's what we do here. So line three, we get our key that we built earlier, generated earlier. And in line four and five, we're going to use it to sign the payload and the header. And with that, we are done. We have our profile. So now, if you would like to copy this, write this over. Yeah, that's not convenient. So we need to do a second part. We need to do a second step. I need to get this from my computer to your phone, to your device, whatever, so that you can verify for yourself that I do indeed have that account. So we need a way to transport, I guess, documents and preferably sign it. You guess where this is going? We're going to use another JSON web token. So we're actually going to reuse the same header, because we're going to use the same key. So we'll just use the same metadata about the key. We're going to create a second payload, which will be very similar. This time, instead of being a profile, it would just be a request. And we're just going to ask the server to create this profile. And then in line 14 and 15, we're actually going to give that document that we created earlier. We're just going to give it to the server. And this second outer JSON web token, we are actually going to upload it to... Sorry, we're going to sign it first, so we'll have a similar string, a piece of data that we can actually then send to the server. So this is where we're going to send it to what we call an ASPE server that we're working on. And it's just basically a way of storing and exchanging these kinds of profiles. And yeah, that is basically it, what you need to do. Those were the lines of code that you need to actually make an entire profile, make a claim, and make it so that people could verify for themselves with their own devices, with their own methods. So yeah, it is a fun script. You can actually just try it at home. Or as I said, we could try it live on stage. That is what we're going to try right now. So I did prepare it somewhere. So you'll see that apart from some cosmetic changes, if it loads... Yeah, that's the big risk of doing this on the stage. We'll give it a second. Apart from some cosmetic changes, it is largely the same script. And you'll see that it will fit neatly within 100 lines. And it might not. We'll give it another second. And if it... Alright, well, maybe it won't do it. It would have been phenomenal, I can promise you. Alright, I'll reload it once and then... I do have a sort of a backup. Alright, it's not playing game. Alright, so let's go back to the presentation. I think it's this one. I don't... wait. I have lost the presentation. That's a different presentation. What? That was not supposed to happen. Yeah, I don't know what's happening. But basically, yeah, this would have been... We would have run scripts and we would have created a profile. And then it would have presented you with a QR code that you could have scanned on your phone. And it would actually have worked. And then you could have seen that the script would have created a profile that we built here on stage. Yeah, and just with a couple of lines of code, we can work with cryptography, we can work with identity. And, yeah, thank you very much. Thank you.
Timestamping with opentimestamps
Alright folks, we're just ready to start our last talk, which will be time stamping with open timestamps by Timothy Riddia-Eli. Okay, thank you. So I'm a Red Hat employee that works as software engineer but not for this stuff. So what is time stamping? What is time stamping? Time stamping is needed to be sure a document or a file is made in a specific date. And for example, in Italy, because I'm Italian, the law requires that the data are ushered by a public officer, so you can't do that by yourself. So what about digital documents? Well usually digital time stamping is made on a third-party data center, so you must trust some other authority, and it's usually a certification authority. So how we can do that without reeling on a third-party authority? So we could use the blockchain, so you create the hash of a file or information, and you put this hash inside the blockchain, so you can demonstrate this hash was present on a specific time. So why the blockchain? It's safe because it's backed by millions and millions of dollars. It's open in the case of Bitcoin we use. It's not cheap to create a new Bitcoin because mining is an expensive process, but it's quite cheap to use that. So why open time stamping? So the blockchain is open, anybody can write on it, anybody could do the same thing directly without using the open time stamp or another framework. So open time is a standard way of doing the same thing in a trustless way, so without trust no one. It was proposed by Peter Todd, a Bitcoin Core developer. It's used by dozens of different companies, and it's almost because in information technology we can't have infinite storage, so it's almost infinitely scalable because it uses a Merkle tree. So what is a Merkle tree? Merkle tree is a tree where you just put the top hash or the root of Merkle tree inside the blockchain, but you can demonstrate that your file or your information existed without the need to push any hash inside the blockchain, but only the top hash or the root. So open time stamp provides users multiple and easy way to create an independent verify time stamps. Open time stamp project on GitHub includes these different implementation. The first one was written in Python. Then somebody has wrote one in Java, then in JavaScript because it's easier to use in browser or in some Node.js stuff. They also started to write a Rust open time stamp because Rust, as you told in a precedent talk, it's good languages because it's safe because it's fast, low memory usage, etc. Or on the open time stamp.org website that uses the JavaScript implementation. So now for this slide, I show an example of usage with the Python client because it was the first one. So if you want to use that, you just need to use OTS stamp command and stamp command create the Merkle tree of the file, submit it to some remote server that are the server that write the information on the Bitcoin blockchain every summer. So when you do stamp, the operation creates the hash of the original file concatenates with random nonce for privacy just to avoid having your hash on the Merkle tree directly and recalculate the hash. So you have double SHA hash and it sent the value to the calendar server. So the calendar server add the hash to the Merkle tree and return the response to the client in order to generate the OTS file that is a file you will need to verify the signature later. Of course this file is incomplete because it doesn't contain the record in the blockchain because you need to wait the calendar server to send the record to the blockchain and the Bitcoin networking to mine the block with the Merkle, etc. So when a time is elapsed, some hour the user rerun the OTS tool with upgrade operation and this update the file with which block of the Bitcoin blockchain includes the hash. It's also possible to create a timestamp for several different files simultaneously. In fact we did a test when we got all the ashes of all the files included in archive.org not the web.archive.org, the archive.org that includes the petabytes of files. Of course we didn't download all the files but archive.org API supports to you can ask the hash directly. So we took all the ashes from archive.org and we were able to put all these million files inside only one Merkle route. So it's absolutely scalable because you can put tons of files only with one Bitcoin transactions that you don't need to do yourself but is the calendar server that you have. So it's absolutely free. So the verification requires both the file and the original file or the original hash. And if you want to do that by yourself so without trusting nobody that's what you want. You need an up-to-date Bitcoin node. You don't need a full node but since the attestation is on the block either. But so you just need a prune node that only so you need only a few gigabytes of data instead of almost one terabyte of a full node. So if you do that you are sure nobody can fake your check because OTS asks directly the blockchain and so you don't need to trust anybody including the calendar servers that put your verification on the blockchain. So the OTS file includes three main sessions. The hash with the nodes, the Merkle tree construction because you need to know which other hash you have in the Merkle tree in order to be sure your file is in the tree by your root and which Bitcoin block includes your hash. So the timestamp is saved on a binary file to save space and to avoid problem of interpretation especially on Windows. The file is as OTS extension and it starts with this line. So if you use the OTS information command with the file it prints lots of information so I can't show them because it shows all the single Merkle ashes. But you can try that at home and you can see which Merkle tree is how the Merkle tree is created. So this is some example of open timestamp usage. The website I presented at the start, Proofmod.org that is an Android app by Guardian project that it uses to certify a photo is valid with GPS data etc. And ASA check is an example of how you can use timestamp newspaper article and to stamp it's a website that you can put the stamp on a Twitter. The end.
Compiler Options Hardening for C and C++
Okay, hello, good morning here at the lightning talks at Fostum in Brussels. I want to introduce you Thomas Neiman, senior security technology specialist from Ericsson, and we will give us an introduction to compiler-optioning hiring guides for C and C++. Give him a warm welcome. Thank you very much. Start. Thank you very much. I work for the network platform and telecommunications company Ericsson, but today I'm here to talk about the compiler options hardening guide for C and C++. I also am in the open-source security foundation as the sub-initiative lead for the compiler hardening best practices initiative that has produced this guide, and we had an initial release in November last year. I hope many of you might have heard about the open-source security foundation, but maybe for those who might have not. This is a community of software developers and security specialists who are working towards improving the security of the open-source ecosystem. This means both innovative open-source software as well as these efforts to develop best practices and collaboration around security in open-source software. The background for the work I'm talking about today is the C and C++ hardening challenge. We all know that the C and C++ languages are consistently the preferred languages for systems programming, embedded systems, and performance critical applications. But C and C++ are also memory unsafe, and that means that they are susceptible to a certain classes of programming defects that affect the memory integrity of software written in C and C++. In unfortunate cases, these defects can lead to software vulnerabilities that can be used by malicious actors to then exploit the software in different ways. Addressing these types of vulnerabilities in C and C++ in a large scale presents several significant challenges. There are many memory-safe alternatives for these languages, but there is also a lot of C and C++ code in the world today. Rewriting all of these existing code, the memory-safe languages is both umber-ably expensive, both in monetary value, but also from this kind of opportunity-cost point of view. The alternatives often have unsafe dependencies, and these unsafe dependencies will then slow down the migration to the memory-safe alternatives. One example of this is Rust, which is a very promising language and provides memory-safe guarantees. But if you look at the dependencies, there are some references here to recent surveys where the conclusion was that over 70% of Rust crates, the Rust packages in the official package repository, have some form of dependencies to either C or C++. This is not just a technological problem, but this is also something that is actually gathering regulatory attention. In the US, something that has been very influential in shifting the attitudes towards surface security was the presidential executive order on improving the US cybersecurity in May 2021. Also in the EU, we've had this cyber-resilience act that has also been heavily discussed among the open source communities as well, and specifically relate to memory safety. We've in the past two years seen a lot of guidance from cyber security authorities, including the USA NSA and the US CISA, who have issued these joint publications with other national cybersecurity authorities, the most recent being the December 2023 document on memory-safe road-waps, where they urge organizations to explicit plans how to shift away from memory unsafe code. So what we are doing in this initiative is that we are providing a guide for compiler options hardening, and currently this is specifically geared towards C and C++ code. The idea with this is that we have a guide that will help developers and packages of software to configure programming tools during development to reduce the attack surface of produced software. You can think about this as something that is quite close to what sometimes is called product hardening. If you have a hardening document that usually provides guidance to the parties who are deploying this software in configuring the operational parameters that help you deploy a software security in its operating environment, so we are focusing on these kind of parameters during development that helps everyone who is then later deploying the software. And the modern C and C++ compilers provide many optional features that help improve the security of the produced software, but this must be explicitly enabled when compiling software for the software to actually benefit from them. And if you are consuming software from the major Linux distributions, then these are usually, the major Linux distributions are usually enabling these features by default, but then if you are consuming open source software from source, then you are responsible for making sure that when the software is built, these kind of protections are enabled correctly. And of course, these also come with various challenges, right? So I will not go into this in detail here, but these challenges can sometimes make it difficult to deploy these in a sort of a correct and correct manner, and we hope that this guide will help practitioners in some of these challenges. And the current situation we are seeing now is that according to some academic surveys, the situation for these are actually much better on the desktop side, but especially embedded systems often ship without these protections enabled. So here is a publication from the Network and Distributed Systems Security Symposium from 2022, which shows that there is kind of like a radical difference between the deployment of specific hardening mechanisms between desktop and embedded systems. And of course, compiler options hardening is not a silver bullet, right? So this is something that is necessary in combination with the adoption of memory safe languages, secure coding standard, as well as security testing, but we hope that this is like one way of addressing the C&C++ hardening challenge. So if you look at the guide, you will find that this is sort of divided into sort of four main parts. So we have a large section on the recommended compiler options that currently cover a wide range of different features in GCC and CLANG LLVM, and this includes both flags that one developers about different software defects that are related to these security vulnerabilities, but also flags that will add instrumentation to the binaries, that helps the binaries be resilient at runtime against attacks that are trying to exploit possibly residual defects in the software. We also have a section on discouraged compiler options, so these are compiler options that have some specific purpose, but if you use them inappropriately, they may impact the security posture of the produced software in one way or another. We also have a section on sanitizers, so these are compiler-based tools designed to be used during development and testing to basically pinpoint memory safety issues and other defects, and these provide really a lot of valuable information for debugging and testing, but they often have sort of more runtime overhead or memory overhead which makes their deployment for production software more difficult, but they are very valuable during the development and testing phases. Then we have some information on managing debug data, so this is something that can help in making produced binaries more resistant to reverse engineering, where you have threat actors actually analyzing binaries specifically for ways to exploit them, but of course in practice these decompiler tools that are used for this purpose, they can work with debugging information, so the security of the system should not depend on omitting this information, but there is some sort of guidance with this respect. As I mentioned, we had the initial release of the guide in November 2023, and we have a lot of activity on the OpenSSF best practices working group GitHub pages where the development happens, and for this year we are planning on documenting new features that are in upcoming versions of GCC and CLAN, so this is actually an area where the compiler communities are very active in providing new valuable features that are security relevant, and we hope that this guide will eventually cover all these new features as well, and then we also have some plans with partners to also introduce information on new compilers, so we hope that this will also be possible during this year. And then another effort is that we have a separate guide on using compiler annotations in GCC and CLAN, so there is some work in progress work up on the GitHub if you are interested in that. And this is of course, everything is open source, and I hope that we also welcome contributions also from people who are not necessarily security experts, so we've had very valuable contributions on improving the readability and presentation of the guide, so if you think that there is something that could be improved, I urge you to open an issue or open a pull request towards this material. And other development happens on the OpenSSF best practices working group. GitHub repository and we have calls every other week on Zoom to discuss any open PRs and short developments around the guide, so these are also public. And this slide has some links, more links on how you can participate in the work that OpenSSF does if you're interested, so the slides are available on the talk page on the FOSTEM site if you want to access the links. And lastly, I'll leave this slide open so if you want to access the guide itself, you can do so at the URL here or by scanning the QR code. And that's it for my side, and I want to thank the FOSTEM organizers for giving me the opportunity to present this work here. Thank you very much. Thank you.
A Lazy Developer’s Approach to Building Real-Time Web Applications
Mark was a wrench and his talk is about a lazy developers approach to building real-time web applications. Give him a warm welcome and applause. Thank you and it's your state. So good morning. Today I want to tell you about my two hobbies. First I'm a musician. I play the bass guitar and my other hobby is a cloud solution architect. That's my money making hobby. And the project I want to talk about today gave me the opportunity to combine these two hobbies into one little project and I want to share the learnings from it with you. Okay, so the challenge. The picture above you see Ralph. He is a friend of mine and he plays along. He plays songs, people sing along. But we had one problem when we play somewhere with WENUs from 100 to 1000 people. Songbooks don't scale. We had songbooks but they got damaged. The WENUs were dark. People couldn't read. Songbooks even were stolen. One fact that was beneficial for the project, we have music stand software on our tablets. They are networked to each other and this music stand software has an API. Terrible software, proprietary stuff. I don't want to talk about this software today. But what we made from it so that you are able to use it in your own projects. How to get lyrics to the people with minimum effort. That was our task. We had to solve. And so that it doesn't get boring, I want to start with a demo so that you see the result and later on I will show you how we accomplished that. So please use your mobile phones and with it decide or your computers will work both. If you call this web page, you will see. Let me show it here. This side, the other side. There it is. You will see exactly that page which I have loaded here. It's waiting for lyrics. So the communication is established and when I now start sending the lyrics, imagine somebody on the stage would change to the next song in the music stand software. Okay, let's do it. I had an AI friend of mine write a few songs about open source and the like. So if my talk is boring, just look at the songs and I'm also open for collaborations in getting music to them. Okay, let's go on. What do we see here? The songs get updated like I just said before. And we even have confetti when there's a new song. So the title of the talk was the lazy developer. Why does being lazy matter? If you are too eager, it can happen that you think of a structure, how to implement something and that you stick to this structure, that you don't have this gut feeling that's too complicated, has to be easier. You create code duplicates and the like. If you're too lazy on the other hand, you get nothing done. So you have to find the sweet spot. Being lazy enough and being eager enough to get things done. And because I did this in my spare time, that was the only approach which could work. So I had to have something easy which allows me to get this job done but also allows to scale to a venue of a thousand people, to a thousand people requesting this resource at the same time. And here's my technical approach to that. We have the people who want to sing along. We have the musician with the music stand with the let me call it rest-ish API, what I saw wasn't so good. A small VM at the cloud provider. Everything should start with something like this. A host name which I said on it. After installed, then I used catty as a static web server for static pages. Great project makes it easy to host things with the same default TLS configuration without any effort. So it was just spinning up the container and it's immediately served the web pages like I wanted to have them. So now we need a component which does the heavy load which transports the data to the devices of the people. There are many solutions around and since I'm working as a cloud solution architect with Kubernetes, you always look at the CNCF landscape. And as well in my company, as for this project, I saw NETS as the solution. We use it for micro service interaction in our projects but NETS also has a web sockets interface which make it possible that the people which are getting the web page through the static web page on the browser, the JavaScript part connects through web sockets to the NETS server. And then the musician needs to have a computer which pulse the API and as soon as there is a new song, the lyrics get sent to the message broker and when we talk about message brokers, there are a few patterns around how messages are being distributed. We have a classical fan out pattern here. The message comes in, the message broker distributes it among all of the subscribed devices and what's really nice about the approach, it's just a few lines of code in the end. So let me show you. We have the project here. You will get access to the GitLab repository at the end of the talk and also linked online. Okay, so we have the NETS server. The ports, the 8443 is the web sockets port over TLS. 3 times 2 is the NETS native port mapped to the outside. Then we tell NETS the host name so that for the TLS mapping and since Katie takes care of the certificates, we map the certificates from the Katie directory as we only file mount which we can access in this Docker compose repository I have set up here. And Katie, the easiest thing, just the regular web server ports mapped to the outside. Katie took care of getting the Lats and crypt certificates automatically. I only had to set the HSTS headers and had an A plus on Wallace SSL check. It's something I always want to try. Okay, and look at the application itself. This div does everything so it's more meta text than real payload on the page. There's the div with the id lyrics and as soon as something new comes in, its content is replaced and the JavaScript part is also something very simple. You see that's everything I did, basically developer. The communications magic is here. We include the NETS web sockets library and then we connect to the NETS server. We subscribe to the subject lyrics and as soon as something drops in as we receive new lyrics, we handle them over to handle lyrics which form its first line in bold and shows the rest just like we received it from the NETS server. If you want to have a look at the NETS configuration, it's also not much. I have defined two permission sets. One default permission so any user of the system has the set of permissions and it's just subscribed to lyrics and we have the lyrics publish security profile for the authenticated publisher. I defined the user with the hash password, assigned the permission and down below here you can see the web socket definition where I also use the TLS certificate, caddy gets me from let's encrypt. Next line I assign anonymous access to the user that it works with the login. No, not login. Okay, that's it in a nutshell. If you are interested about the topic of message brokers, I can highly recommend the book enterprise integration patterns. It's a book from 2003. I'm showing an IT book from 2003 but the principles are still the same. Of course, there are a few new. If you go to the website, they also have listed new principles but I wish I had new in the book 20 years earlier. Now I have it on my desk. My GitHub repo, check out Nats, check out caddy server and it's absolutely possible. You don't have to use Nats for this. You can use an MQTT server. I did the same example with EMQX. Rapid MQ should work. Also with Redis, it also has web sockets integration so you could also use that. My example was in Nats. If you are interested in Nats, I asked the guys of the project if they could send me some. On this corner of the desk, if you leave the room here, you can grab a sticker. After all, we are on conferences for the stickers, isn't it? That way. Okay. That's about it. What did we learn? Let others do the heavy lifting though. Just be lazy enough to find the right ways and get things done and concentrate on the things that really matters. Reach out to me if you have questions. I will be around. Don't forget the stickers. Have a great Svastim Sunday and a safe trip home. I'm Markus Röntschler. Thank you. Okay. Thank you for your talk.
Introduction to BlissLabs and Bliss OS
Okay, let's go to the next talk right now is John West and the talk is introduction to bliss labs and bliss OS. Thank you John, the stage is yours. Thank you. I represent bliss labs. Thank you very much. I represent bliss labs and bliss OS. So what bliss labs is, is we're a volunteer based 501C3 non-profit organization that helps open source projects thrive, mostly Android based, but we also do Linux projects as well. Our goal is to create and maintain various open source operating systems and other software that helps extend the life of devices in order to help with the world's e-waste problems. Bliss labs also maintains bliss OS and other open source software. We're not just a bunch of projects, we're a very mature open source org with a proper organizational structure. We work to mentor and teach future open source developers in all aspects. We form alliances with other projects that share in our visions. We develop tools to help minimize the complex development process of Android and Linux and aid in learning. And we also act as a global fiscal host for open source projects that are not able to monetize in their current region. So this allows for a much larger opportunity to work with others globally and increase their user base. Our region is global, so we have members of bliss labs all over the world. Age group ranges from 13 to 60. Students, professionals, retirees, the whole nine yards. Women, men were LGBT friendly, different able in key positions like CFO, CPO, CTO. For education, we don't have any requirements really. You can be a middle school student or you could have a PhD in the professor. Our estimated download count from the year we started in 2014 is now up to over 6 million. That's across the entire suite of bliss labs projects. And speaking of the projects, how many of you recognize any of these logos? Anybody? Okay. To go over some of these, we have BlissOS, which is a unified Android experience irrespective of hardware. We use Android X86 as the base OS, works on Intel and AMD X8664 version 2 devices and greater. We also run RayDroid, Android integration with Linux. It's a lightweight containerized base project. We run Android Linux hybrid, which is a cross of Linux and Android running on bare metal hardware. We maintain SmartDoc, which is a desktop UX for Android. We maintain XTmapper, which is an on-screen keymapper for Android as well. We maintain a community called Supreme Gamers, which is centered on Android X86 development. We also maintain BoringDroid, which is a complete open source desktop UI solution for AOSP. We're adding Bliss base to the mix now, which is a production ready example of Android on X86 based on Bliss, but geared towards, we'll say, commercial applications. Then we maintain Android Generic Project, which is an easy button for Android X86 and BlissOS builds. We also maintain BlissROM, which is Android for Android devices. Our matured process model includes building, testing, releasing, and documentation for both dev and users, and then post-release support. Popular open source projects having links with our projects. BlendOS is one. They use our RayDroid project. Ubuntu Web also uses our RayDroid project. Ubuntu Touch is well. PrimeOS, which uses Android Generic Project to generate their images. Then Android X86, we fully support that project by supplying team resources, build servers, development, et cetera. What we do to support sub-projects is we attend conferences like this. We supply hardware for testing, hardware for development, hardware for build and web servers, and communication servers like Slack, Discord, GC, Telegram, Matrix, et cetera. Then we provide software services like storage, source, forging, GDrive, or development services like GitLab, GitHub, Access to Jira, Confluence, Trello, and then servers for updates, OTA, GitHub, and CICL. Which brings me to the next part of what we're doing here, is we're introducing BlissOS, which is an Android on X86 hardware. It's Android for your PC. Funny thing is, my Linux PC took a crap on traveling out here. I actually had to put this whole thing together on Android X86, so on BlissOS. That's what we're seeing this whole thing run today from. It's based on mainline Linux, so using Android X86 patches on top of the kernel in order to provide support for Android subsystem. We have features of desktop UI, changes for X86 hardware, custom house, et cetera. Generic builds, so one ISO runs everything. We have tons of customization options included. Low resources based subsystem, so it's a very low overhead system to run Android on X86 hardware. A lot like Edge Linux would be. Then we have added Linux tools integrated with Android, like Turmux, networking tools. We bring in DRM hardware composer, Greylock, Mesa, all from mainline Linux. Some of the diverse use cases of BlissOS and the likes are kiosks, mobile devices, PCs, gaming devices like the Steam Deck, automotive displays, POS and customer facing displays, new displays, ad displays, TV and large screen applications, IoT and industrial IoT, industrial displays, gaming and component displays. BlissOS is open source. It's based on Apache version 2, GPL version 2 and GPL version 3. If you're using BlissOS, as is, it's free and open source to use anywhere. If you're using BlissOS, using a modified source, still comes as open source as long as you release the source code. It's open for anybody to use in their project. Coming to BlissLabs is not a requirement as long as you release that source code. Then a small per device or perpetual licensing cost for those that are using a modified source and do not wish to release that source code. Some of our milestone achievements as of lately are EIDU. We've gotten into a pilot program for Kenya using their low end hardware with our operating system. Companies have been using it in products lately. We've been shortlisted by Swarchforge multiple times as a featured project. We've been adding more and more devices every year. You can find demo videos on our website as of this week. You can also download an ISO for Android 11, 12, 13, and as of today, Android 14, which is our announcement today. We will be making BlissOS version 17 available within the next couple of days for everybody to test and download. Swarchforge is already on GitHub. The source is already on GitHub. Initial features are ported from Android 13 to 14, including desktop UI, dual ethernet, multi-display, et cetera. What you can expect from us in the near future are more device groups will be supporting Raspberry Pi, RISC 5, et cetera. More leaning towards the Linux hybrid side of things, where we'll have a larger Linux subsystem running Android on top. We'll be more independent and a complete fossil ecosystem, so we'll be providing tools for companies and individuals to create their own cutoff app stores, pretty much. Then we're going to be supplying some Edge IoT and IIoT examples as well in the code. Our process, we're going to be documenting even for newbies, so we're going to continue on that, dumbing it down for everybody. And automated installation support, so we're working on a couple new installers, graphical installers as well as text mode installers for Linux. And then we're going to fine tune our AI-enabled support bots that we use to help users answer the questions so we don't have to. Community engagement includes contests, Easter eggs we put into the operating system. We have goodies we often do giveaways, stickers, t-shirts, et cetera, get points, and then Blisify videos in the link on our website. Opportunities available through Blis Labs and BlisOS are internships, mentorships, open roles paid soon. We'll be web development, server maintenance, project management, HR, finance, and developers. Contributors can get mileage for the next move in their career or can be absorbed by BlisOS, BlisLabs for commercial opportunities. We have a very easy streamlined way of joining. We cut out most of the crap, no ego, but we are healthy and drama free. We are full democracy on team, so we have a flat structure, nobody's head of anything, nobody's overseer, nobody's the god of Blis. And that brings us to where you can find us online. So if you could scan the QR code, that'll take you to our link tree and that has all the links available for you to contact us and move forward in the future. If you have any questions, I will be available outside the room and I have my device and a couple other screens so I could demo if you guys are interested. Thank you very much.
Introducing the Open Podcast API
So, let's start the next talk. Introduction to the Open API, podcast API with Kuhn Glotsman and Karen Einzwort. Sorry for the pronunciation. So the stage is yours. Thank you. All right. Thank you. Thank you, everyone. So, yeah. We are here to present the... Thank you. Thank you. We're here to present the Open Podcast API, which is a new specification that allows users to actually synchronize their podcast listening data, like your subscriptions episodes where you started, where you want to continue, et cetera, et cetera. So why are we actually doing this new API? It's because we have a problem. The problem is that there is actually a defecto standard for synchronizing podcasting data between devices, but it has a couple of challenges, let's say. One of them being that it's no longer actively maintained for the moment. There is a draft for a new version of the API, which is good, but has been still for a while. So that's one issue. But maybe a bigger problem is that there are some technical issues fundamentally in the API and the way it's designed. One of them is about the episode identification, which is basically based on the media URL, which is in the RSS feed. And that thing is not always unique. RSS is a standard, but it's a Wild West at the same time. So we can't really rely too much on that. So that's a problem. And also, the software behind this standard has some issues with feed duplication, which occurs if a podcast changes the RSS feed, they change their URL, then you get the same podcast twice in your subscriptions list. So I said, well, we didn't say that yet, but I'm with Antenna Pot and Kiran is with Funqvill. And in Antenna Pot, what we see is that that service, that software behind the API de facto standard is actually used as a centralized service. So there's a lot of users, which is great, but it's also a restrain on the servers. And so that overload is actually causing end users in Antenna Pot to see errors, and then they come and complain to us and we're like, well, yeah, we don't have too much influence over that. So the solution there to this, these set of problems, is to build that new API standard, which is actually building on the existing standard, but being more extensible, more standards compliant, easier to implement across different projects so that we avoid the centralization aspect. So for users, that means that they can synchronize their subscriptions, listening progress, favorites, cues, etc., etc. That the idea is that they can connect all their different devices. So whether you're on your desktop or mobile, or if you have a work mobile and a private mobile, that all your listening progress, etc., moves from one to the other. And also, this integration with the different apps would allow you as a user to actually switch from Antenna Pot to Cast if you don't like Antenna Pot for some weird reason. And so, that's on the end user side, but we need developers to implement that API, of course. To make that as easy as possible, we want to have clear and comprehensive documentation about the features, but also about the behavior. So if I send this API call, then what is expected to happen? We want to have the specs being reliable and easy to implement. And also, we want them to be feature complete, because different podcasting apps and servers and services all have different features. Some might have multiple queues that you can create. Some like Antenna Pot, we only have one queue that you can create. So we need to make sure that the API covers all these different use cases. So the approach there is to build a new API spec based on the existing standard, which I assume many of you might have guessed, is gpotter.net. Notice that it's a great thing to start from. And there are some issues that we are trying to solve. So we're building on it in a way. We're not building on it. We're taking inspiration from it, I should say. But actually, compatibility with it is not our main focus. We also try to follow the OpenAPI standard specification, because that allows for easier integration into software. With by respecting this standard, we can have CI create libraries, which are always up to date with the latest specifications. And that's our plan also to do that for different languages. And an important aspect there is also that RSS is our single source of truth, meaning that we don't want to synchronize, for example, episode titles, because that's already in the RSS feed. So why would we synchronize data that's already in the RSS feed? But at the same time, we also have already the GUIT of an episode, the unique identifier of an episode in the RSS feed. That's unique, but not really, because of RSS Wild West. So we do actually expect to create and synchronize a true unique identifier for episodes. And then we're also trying to be podcasting to the already, which refers specifically to the GUIT at the podcast level. And there are some technical challenges. One is about the episode identification. Like I said, there's a GUIT in the feed to identify an episode, but that's not always globally unique. So why is it called a GUIT anyway? You have links and you have enclosure URLs. And we thought, okay, to identify an episode in order to sync data between devices, we could do a hash of these three, but they're all changing. They're all optional in the RSS standard. So you might have none of these and then end up with no hash, I guess. So that doesn't really work. So yeah, we are having the solution that the first discover of an episode, whether that's a server, if it pulls RSS feed, the RSS feed, or if it's a client, it creates the GUIT. And then yeah, there's some things that we need to consider. Like first pull the new information from the server and then send it back to avoid race condition, et cetera. And also we expect the client to do the application of episodes. But if you're interested in more technical aspects, there's a link to the notes. Okay, thank you. So building on that sort of quite specific example, there's the more general question of feature compatibility. So clients and servers need to agree in a way on what is compatible. We need to have a way of communicating that. So we can't expect all apps and services to support every single endpoint, every single call because different apps implement different things in different ways. So to sort of get around this, what we've decided is that we should have essentially a core feature set where we say that specific endpoints are considered core and you must support them as a client or as a server in order to be considered open podcast compatible. There is of course then scope to optionally sort of extend this and to add additional endpoints which give us more functionality but are not considered core. So you can then negotiate that between your API, your server and your client. These would be then documented in the specification, what is necessary for compliance, what is an optional extension and then we can sort of work with clients and servers to map that and say what is this, what works for you and what do you need to implement. So what sort of endpoints are we looking to add? Well, we've got a few that we've already been working on. So as Kun has already mentioned, subscriptions is a big one. It's fetching and storing and syncing all of your feeds, all of your subscriptions between devices with the option to update them, change their URL or change their position or whatever it may be and delete them and to manage them across all devices. Versioning this is an important one. If the specification changes and we decide to deprecate something, change an endpoint, we need to express what major versions are supported so that clients are aware of what they are able to get from the server. We are currently discussing episodes but as Kun has already alluded to, this is a very complicated thing. So we already have a pad full of information about how we will synchronize this but the goal is to have that sort of implemented to synchronize status and playback position, how long you've played it, that kind of thing for all episodes across all different feeds. In future, we would also like to be able to synchronize settings or give an optional endpoint for synchronizing settings, search endpoint, discovery for discovering similar podcasts and features and also ratings and reviews which are becoming a big part of a lot of podcast stores. Who's involved? Currently, you've got myself and Kun. We're from Antenapod and Funqwell respectively. We've also been in conversations with casts, pod friend, Gpod Async for NextCloud and Musicpod. The idea is to get as many projects on board as possible from both the client side and the server side. Funqwell acts as both but we're more steering for server side at the moment. If you are involved in a podcast adjacent project, we would love to hear from you and get your buy-in and your advice and feedback. Just to mention on that last point, those are all open source but the idea is that closed source projects could also use this if they wanted to. What are our next steps? As mentioned, we're still discussing episodes. It's a big thing. Something we need to get right and something that we need to finalize before we can consider ourselves at a point where we have a core endpoint. We also need to discuss authentication. This is super important. You should not be able to query somebody else's status and you should not be able to get a hold of anyone else's data. It must be locked down. We need to discuss how we want to do that. It will probably be just a case of OAuth. That is for someone who knows more about OAuth than me to decide. We're currently building a new website. Currently we have a website which is built using Sphinx but we found some limitations with Sphinx in terms of having dynamic content. We're going to be rebuilding that using Astro and Starlight. It's currently just in a pull request somewhere. We're just waiting for that to get deployed somewhere. We're mapping features across apps so we need to get a greater understanding of what different features are available in different applications, how they present that information, and therefore how we can present that in our API specifications. We want to get a beta implementation in a few applications. Client applications specifically. We would like to have at least two maybe more supporting some of those core API endpoints so that we can show that it works. Which of course means we also want to have a reference server implementation which the FunQuel team will be working on just so that people have something to test against, something that they can deploy themselves if they want to. We can check that our client implementations also work as expected and according to the specification. If you want to get in contact with us, contact details are up here. There's a QR code you can scan but basically search for Open Podcast API. It's where we are. We're on Matrix. We have the website which I mentioned we'll be replacing soon. Obviously we have a GitHub organization which is where all of the conversations are currently happening. Get in touch especially if you are interested in podcasting or are currently involved in podcasting we'd really like to hear from you. If you have questions we will be outside I guess and we would love to hear from you. We're very friendly I promise. Thank you very much for listening. I'll just put the contact details up again so that you can all take your time, scan the code. Thank you very much. It was lovely.
FOSS for DOCS
We have Dr. Candace McKita Moore with PhosphorDocs. Thank you. I'm Candace McKita Moore and I will be giving a presentation about PhosphorDocs. First I'll tell you a little bit about myself, then I'll tell you what I mean by PhosphorDocs. Then I will try to convince you to get involved in this or if you're already involved in it, kind of give you some tips about what you're doing. And I'll conclude with some deeper insights about how to be successful when you do this. So about me, I have my bachelor's from Columbia. I got my MD and Technion. I went further in my medical training. I did an internship. I did further training in emergency and radiology. Now I'm a research software at the Netherlands, a science center. So stopping my biography there, you can probably figure out what happened. I married a really awesome guy and I wanted to get out of the place I was. So he got a job in Europe. I said I'd follow. I've learned Dutch. You can't really work in the hospital when you speak, don't speak the native language. So I sort of reverted back to something I did before I went into medicine, which was software engineering. These days, like almost three years later, I do speak Dutch and you can find me two days a week now at Rutter Damoressens Medical Center. So I think I know a little bit about this because I've helped create a lot of what I call phosphor docs and by that I mean free open source software meant to help medical staff accomplish medical research or treatment goals. So right now I work mostly on CVSL, which is sort of processing arterial spin labeled sequences and other sequences from brain MRIs. And that's really typical of what I do. Usually I'm working with radiological data, but not always. So an example where I did it is Resurf EMG. That was a project where I was the lead engineer because my center gave a grant to a group at U20 to work with Respiratory Surface Electromography. The grant ended, I guess, about a year ago, but recently I realized that the scientists and engineers I work with have actually had a couple releases since I left the project, so it's still going strong across a couple academic medical centers. So now I want to warn you a little bit if you're new to this area. If you get into this, you're going to be annoyed. So one thing I want to warn you about is that in hospitals and health systems, on average, we don't have the best computer scientists and software engineers. That is not always true, but maybe you could see that as a positive, right? Because if you know anything, you're going to come out looking like a hero. But seriously, I'm here to try and get more people who are really enthusiastic and into open source to think about doing these kinds of projects. Unfortunately, when you do this, if you work in a hospital, you're going to be at best outside of a hierarchy. At worst, you'll be on the bottom and people will treat you like the gum on their shoe. Okay? Just deal with it. That's part of the culture of medicine. It's a very strong culture. One of the things that distinguishes it is the language. I can tell you from experience, if you're sort of a math nerd like me, nobody's going to speak your language. Just as an example, a long time ago, when I started doing quantitative image analysis on radiological images, I tried to talk to one of my colleagues who was another doctor about it. I was just sort of going off about this and the dot product. He was like, wait, wait, wait. The matrix is like the matrix of the movie, right? He wasn't kidding. I mean, that's kind of just what you'll have to deal with. I want to add a couple final warnings. If you're truly hardcore into FOSS, you will just have to make peace that people in healthcare systems, they use all of these proprietary products when perfectly good FOSS is available. Part of that is trash issues. Part of that I really blame on us as FOSS creators because a lot of FOSS projects that are actually pretty good if you bother to read the code, just have a lack of swag and swagger. What do I mean by that? I brought an example. Locos, a little merch, make your thing stick in people's mind. If you push past all of this and you're creating software, there's a final thing I want to warn you about if you're working in a hospital system. Unfortunately, within the hospital bureaucracies and health system bureaucracies, there are some people with power and some pretty weird ideas about the possibilities and ways to make wealth through technology. At some point, like myself, you will run into people who tell you, no, you can't open source this because otherwise you won't have any money and we won't have any money and that thing will work. It's not that they're evil. It's just that they aren't aware of how these things can actually be viable. Just as one example, most hospital and health systems have some really kind of wackadoodle legacy systems that are all kind of joined together in a weird way in the hospital. If you do something that needs to harvest and move around data, then you can make FOS and also charge the hospitals just to customize what you made. This is just one model. I don't have time to get into all of them, but you have to tell people this because otherwise you'll just hit a hospital lawyer who says, no, no, no. You can't open source that. Now that I've told you about all of these things, I want to tell you why to do this anyways. The simple answer is it matters. I've seen so many bright minds like literal PhDs in physics go to startups where they do things that in my opinion don't matter as much like use neural nets on fashion on the internet, whatever. I tried this in a room of doctors and only one person got one of these. I'm curious if anyone will even guess I'll give you some merch. Can anyone identify either of these diseases? No. Okay. I'll give away a merch at random later. These diseases are diseases that we've had phenomenal success in getting the life expectancy up on. It's cystic fibrosis and sickle cell. Specifically, I can tell you in the case of sickle cell, or both of them obviously, 100 years ago, computers and computer programmers were not part of the story. Today, especially in sickle cell, that curve is going up and that is powered by software. I can tell you that because I work with people who work specifically on this. There's also the international humanitarian angle. In my first slide, you saw me on the coast of Greece where I was part of an emergency volunteer crew. In those efforts, software actually plays a role because we have to do things like track infectious diseases that are coming from people and going places. You'll fight a strong culture in medicine, but you can win and you can do great things. You just need to come prepared. The three things at the top there, I think, are just non-negotiable. You might not have the funding to get all sorts of swag immediately, but for crying out loud, at least get a logo. I've seen so many beautiful projects that don't have a logo and they don't have the kind of person on them who will go out and speak about stuff and they don't get any use. They're going to die. The second thing is get a medical reader. Get an MD who doesn't hold that much to read your documentation and give you honest thoughts about it. You may end up, like I do, essentially splitting your documentation so there's a side for engineers and programmers and there's a side for doctors. The third thing, not as obvious, but probably the most important, is get your legal game going from day zero. So, I mean that for everyone, even if you don't touch a piece of patient data. If you touch patient data in Europe, yeah, GDPR, all of these things will come into play, but hospitals are large bureaucratic institutions, health systems, anything that touches health. Things like even getting the right contract may take months. But if you don't strain this out, you will end up with problems. So those three things are not optional in my opinion. As you move forward, get some videos. This is because, as doctors and other people of this type move higher in the hierarchy, they get less and less negative feedback and they sort of want to appear in charge of everything and they're not going to go to a meeting and tell you, I don't understand. Videos are something they can play in the privacy of their own home and learn what you're trying to tell them. Another thing that I think is really important, especially because I do signal processing, is getting more than one institution on board early. You will discover that algorithms you might be using at one institution might not work so well at another and it's better to discover that early. And of course, it's great if your team has a nice person you can send to meetings. And finally, once you've really built up what you're doing, please get a no-code interface because a lot of physicians are not even going to want to do as much as putting into two lines of command line and you will never convince them otherwise. So on a deep level, these things I'm talking about, they really have to do with culture. And when I think about culture, I sort of prefer the metaphor of water to fish, which I think it was an American writer came up with. You sort of don't know you're in it until you're out of it. And there are really different professional cultures between medicine and software engineering. One tiny example of that is how overloaded the terms are in computer science and software engineering, like correctness. I mean, how many things does Docker mean? I mean, like, this is just painful for me, even though I'm kind of part of both worlds. In the past year, I've gone to a bunch of things that were about diversity, and I sort of left annoyed, but they talked about breaking the world up into F-cultures and G-cultures. And they say F-cultures are hierarchical, conformist. They emphasize the group, and they're usually non-Western. These are cultures that, and now there are lines people think of as like exotic. Yeah, I've worked in the Western medical system for many, many years, and I could tell you that's medicine, OK? Now, there is a reason for that. We can't just all go our own way and do what we want, otherwise patients might start dying. So you have to learn to sort of navigate our culture. And unfortunately, you have to learn how to navigate your place in this hierarchy. So you have to be very respectful of those above you. You have to not sort of make them feel threatened. So give them their learning in smaller doses. I mentioned videos. The other thing that is super effective is to actually go sit with people. Even if they are like what we have in the Netherlands, technical physicians, they might not be so technical. Those people are supposed to be like halfway between an engineer and a doctor. You may have to sit with them and show them about something like as simple as command line that we're all very used to. But that helps because you get a sense of what they will be capable of dealing with. And you probably walk away and think, God, I just need to make a gooey because there's no hope. But you also get a sense when some of your nomenclature is unsettling for people. And it will be. And finally, please, worry about your legal issues. And make something shiny in the sense that it has a logo and it's well presented. So some final thoughts. I want to emphasize that there's a lot of unevenness in how software is sort of spreading across the world. And I've worked in places like Haiti. I've interacted with professionals in several African countries. Software is spreading. And unfortunately, it's often proprietary software. And this is really terrible because what you see is when big companies, just to give an example like Microsoft, sort of move in, they often set up systems intentionally or not that make a sort of vendor lock in inevitable that the health data in the system becomes so fused that the institutions, the hospitals just can't get away from this stuff. It's like sticky. So I think it would be great if people who made FOS sort of got there first and get their shiny in a way that builds trust. So I hope I've convinced you either to think about this or maybe to sort of up your game if you're in this area. And if you have any questions, you can send them to me. My email at the Science Center is right on the bottom line. And that's it. Thank you. Thank you. Thank you.
Journey to an open source contribution
Next, we have Thierry with Journey to an open source contribution. Thank you. So thank you for coming. I'm Thierry Berger. I love open source and I'm here with you today to tell you about a few open source fixes or stories I've done. So follow me. Let's make things better. I don't know about you, but I have a dream. My dream is that different players from different backgrounds, okay, it's a problem technical, but yeah, with different backgrounds, with different interests, players could still be able to play together. So you can imagine an old grandmother playing her match three game, you can see the three candies, and she will be able to share it with her grandchild. Hey, I'm a grandchild. And that grandchild will be able to share that candy to another game, or like a pet's life simulator game, something. So even though they have different interests, they can play together. And it's awesome, so I'm very motivated in that pitch. So I started a hobby project by using the Bay V game engine, which is an open source game engine made in Rust. And the project was going smoothly until I hit a problem. I couldn't input an ad, and it's a big problem because I want my players authenticated. And yeah, every email address is AvanHAT, so that's a problem. So time to fix it. I have to tell you about my keyboard. I'm French, I'm using Azure keyboard. So that means I have to input ALT GRZN0 to input an ad. And actually, behind the scenes on Windows, it's actually equivalent to control and ALT and GRZN0. And that control mapping is pretty important because control can have a lot of capabilities. It can scroll with the mouse wheel, it can copy, cut-paste, it can move the cursor, it can scroll with mouse wheel and move the cursor with arrow keys. Well, anyway, it can do a lot of things, open the task manager and other stuff. So I opened an issue and the library I'm using, it says bevy edui, a bridge to edui, a UI library. Hold the term. But yeah, I opened that issue. It's actually scrolling, it's a very long discussion. You can see it now because I'm using a PDF, but yep. And eventually we landed on a fix. It was a very long discussion. And yeah, I think it's pretty important, pretty interesting when discussions are way longer than the actual fix because it really shows that communication, software development is very important. And yeah, so if you have a problem, just ask questions and eventually everything will progress. So now we fixed our at sign, we can progress, right? That password field was my next difficulty. I have to tell you a bit more about my project. I want to support one-time authentication. So when the user registers to the application, it sends an email to the user. The user copies that email from their email client and copies that into the application, into that password field. And then web. It was working fine on a native, but on web, it's a bit complicated. So bevy edui, I told you a little bit about that, uses AR board, which is a library to support clipboard, but it's mainly focusing on native clients. So it's a synchronous API and web, it's a problem because we cannot really block the browser as it would freeze the entire browser. So it's just not allowed. And AR board desires to say simple, so that means we cannot add web support to it. So bevy edui implemented the local only clipboard, which is handy to copy from inside our application into our application, but that's not enough for my use case because I want to copy from outside my application. So time to fix it. So to fix it, first I checked what were my options and how other projects were doing it, mainly eFrame, an official framework from e4egui. And I could quickly have something working by using the web sys create, create to interface with JavaScript. And I had the copy, cut and paste event going through JavaScript directly from the browser and bypassing all the bevy machinery, which is great. But then I had another problem. I noticed that on Mac web, the controller command was not very well implemented because on Mac user, they don't control A or control C, they command C and command V. So we don't want to support correctly command C and command V to paste and then control A to select all text that's inconsistent, so time to fix it. So I fixed that by using the user agent on web and to detect on which platform I was, so eventually all my controls were consistent. So at this point, my pull request is the state of my whole adventure is that I have a pull request waiting for fixing the clipboard and it's on review. It can be a lot complicated. We did see a lot of little devils in the details. So I let it sit. The main contributor of bevy eGUI is in Ukraine, so you can guess he has a lot of other stuff to do. So anyway, I can just target my branch and I can continue on my journey, right? What is it wrong again? Let's rewind a bit. We skipped a little bit that first fix we did about that add sign. The fix was mostly if it's control, it's not text, but if we are on Windows, it might be text, so we're running the condition. And then web, it will not work because it's a macro there. And it's on compilation time. And web, it will not equate to Windows. It's actually wasn't 32 unknown and rest for the text savvy. And so it's not working. So now that I've studied the subject more, I could have done the same check I was doing before with the user agent to detect the correct platform. And that would have fixed all my problems. But then I did that in another pull request to separate things and do things the correct way. And I was a bit confused, so I first tried to remove that and then I was like, oh, okay, what about alt code? If I remove that, I can input alt code because I'm French. Did I tell you that? And yeah, I'm French, so I like to input A, O, R, weird characters. So I removed that and then I was like, oh, well, there is another if right there, maybe it just would fix my problem. I don't know what I was thinking. I was like, it's an emoji with the exploding head. But yeah, I was like that and pretty telling. But anyway, so I decided I would have to step a little bit back. Mistakes happen. So I blanked that out before in a previous slide if you remember. Baby behind the scene is using Winnit, which is a backend library for handling windowing. It's basically a low level stuff to send raw inputs. So I noticed they had a lot of fixes related or not to my issues and I was like, ah, will I have to do all my fixes again? I wasn't too confident in it. So yeah, mistakes happen because I think I would have been able to fix that by using the user agent and call it a day. But anyway, I like rabbit holes. So I went to update Winnit update. Yeah, why not? So I knew it wouldn't be too easy to do because I had to track multiple main branches, multiple unstable dependencies. I had to track baby main branch and Winnit main branch, which had multiple commits every day. So I had to have a stronger plan than doing it in provision. Yeah, well, anyway. So I had to first update, make everything compile and work. And then after everything compiled works, I could update to the new Winnit goodies. Yeah, Winnit API and good stuff. So first, when doing a dependency update, check the documentation. But I was updating two main. So documentation is not really great. So that meant foraging through the source code, pull requests, chance logs, working with the server, and occasionally chatting with relevant experts. Winnit uses Element, which is a response from them. So yeah, thanks, Element. First, when I was ready, I rolled up my sleeve and dove into the code. And the first thing I did was updating a lot of NM names. And I'm thankful that most IDE support for search and replace. Yeah, VSCode, sorry. Then another task I did was update a lot of dependencies. As you can see, there was a bunch. And I like to focus on a particular one, row window handle. It's a create to provide a common interface to interface with the window. Most of the dependencies had updated to a new version. As you can see, version 0.6, actually. In Bevy, we use continuous integration testing and cargo deny, which helps us prevent from duplicate dependencies. So I had to have my whole stack targeting the same row window handle dependency. And WGPU, which is another low-level create for graphics, wasn't updated to that. And I felt adding yet another main branch would be too much of a time sink. So I had to use version 0.5 for row window handle. And it's quite interesting how it's supported by the whole row window handle ecosystem. You can just add a feature to most ecosystem crates to say, I want to support this particular version and everything will be consistent. I had to do a few pull requests to the dependencies. But everyone was very responsive. We eventually had something consistent. So now we can profit, right? And progress on my task. Not yet, because the WinIt update is pretty complex. It can impact a lot of architecture platforms and stuff. And I don't have every platform to test. And I have also limited time. So I reached out to the baby discord to help. Like, hey, my pull request is nearing completion. Can you help me review it and test it, please? So yeah, we caught a few bugs. So I'm very thankful for everyone who chained in. And eventually the WinIt update was merged. Yes. So now we can profit, right? When I'm doing anything, I like to focus on the objective now. So that meant taking a few shortcuts. I noted them all as faithfully as I could. If you check out the whole pull request and the WinIt follow up, there is a lot of things. But I didn't write it in one go. As I discovered then, I would write them for me, for readers, and for future readers, for afterwards. Yeah, so now I think I will step back a little bit from all this and go back to my use case. Let's remind a bit. We implemented, yeah, we did a lot of things. We implemented copy-paste via JavaScript events. We detected the platform using user agent. And we even updated WinIt. So whoa, that's a lot of things. So does it work yet? Not yet. But I'm very confident that we have everything in our disposition to make it work. So next time we talk, I will tell you how everything works perfectly. Thank you for your time. Everybody can help. And if you want to help Bevy come into our Discord chat or just come talk to me afterward, I have free Bevy stickers if you want. So yeah, just come and talk. Thank you. Thank you.
Trusted Postgres Architect - Deploying Postgres with Infrastructure as Code
So, next we have Boris Mahihaas with trusted Postgres architect, deploying Postgres with infrastructure as code. Right, so thank you very much. Thanks for coming. My name, as she says, Boris Mahihaas. I used to be a solutions architect at ADB, but I grew a little bit of white hairs around senior solution architects, but this pieces off a lot of the building architects. So actually my real title is holistic system software engineer, because I would like to see the things from the fundamental interconnectedness of all things. I used to be a developer, operational people, a DBA consultant, so I'd like to see the whole stuff. That's why I see the value of the DevOps philosophy, because it's trying to get the whole thing kind of a one thing of deploying stuff in a more reliable way. Apart from that, I'm also an air guitar player, and I really love metal music, and within other type of music. I'm going to talk about trusted Postgres architect. So here, who uses already Postgres here in the room? Nice, okay, that's very good. Okay, so who didn't raise the hand wants to use Postgres maybe, but I think everybody already raised the hand. That's good. Okay, there you go. Thank you. So this talk is for you. All the rest is also an interesting problem, this, because it's about reliably developing Postgres in multiple different infrastructures. Okay, so this is a use case. This is a developer. She is trying to develop a new project. She wants to use Postgres finally because it's being one of the most popular databases in the last years, and then she has this brilliant idea, but she doesn't want to start using single commands all the time because she wants to have an environment where she can test, test, test, and when everything is working finally well, she's going to be able to deploy that into different environments, either test environment, pre-production, staging, whatever you call it, but exactly the same thing. So the typical stuff that people do is like, I have a container, I'm going to put that specific container into the server. This is not exactly that, but it can also be relying on containers in order to emulate the final architecture. So let me explain a little bit more. So you want to do it in a reliable way, and that's why the name of the tool is called trusted Postgres architect, which is the tool, TPA. We like to call it TPA because it gets people confused with TAP, which sounds like tap, which is for you to get your favorite beverage at the bar. The first goal that she has is to deploy one single instance running Postgres 16. This could be also if you are running already Postgres 14 or 15, you want to try the new features of Postgres 16. Who is already running Postgres 16 here in the room? Okay, so much less than the people who was already using Postgres. So this is probably one way for you to test the new version. So I'm going to just show you code here, which is YAML. You might not like it, but it is the standard way of doing Ansible stuff or doing deployment. So in this whole screen, which is pretty large, I'm going to put all the code that you need in order to have this one single instance. First of all, in TPA, you have to specify your architecture. This is a master one. I know now we call it primary, but master still sounds nice because it reminds me about master, master. So that's why it's called master one. And obviously it's going to be called, FOSTA, is the name of the cluster. Then you can have cluster variables, plenty of stuff that you can ignore at the moment. I'm going to come back to one of them afterwards. But the most relevant here is this one, Postgres version 16. That's what you need. Okay. So this is the version you want then, and it's going to put you that version deployed. So this is the cluster variables. I'm going to come back to the cluster variables later. Then because you want to be able to do deployments in multiple locations for fault tolerance, high availability, it's always good to specify a location. We are in Brussels, so we are going to call location Brussels. And we are going to have an instance, obviously. Thank you. At the ULB, but first we're going to say which type of instance and which default we have. So we are going to do it with a Debian image. It's going to be a specific detailer by TPA, but you can use whatever image you want here. And here the platform, it says that it's Docker. This is not cloud native stuff. It's really an easy way to have a virtual machine that I can connect to it and try to behave as if it is a virtual machine, but it's a container with everything that you need there for a virtual machine. And as you can see, TPA uses Ansible. So we are going to have this Ansible user here for connecting to the machines. And here is the instance. You specify only these parameters, the name, the location. This is a number within your cluster and the role. And the role is to be a primary. So here we use the most modern way of referring to the primary node. That's it. That's all the code that you need for one single instance, of course. Then because this is Ansible, you do TPX-SEC. This is the executable of the trusted architecture. Provision, so you provision your cluster and then deploy. And then you get it. Okay? So how do you connect to it? Yeah? Well, I told you that the Docker containers are going to behave as a virtual machine, so we can SSH to the machine. So we do SSH using that file that is going to be generated in the process of doing the provision. ULB is the alias that we gave to the instance. And then we do the typical thing. I become user Postgres. Oops. I become user Postgres and I execute PSQL. Yeah? So it's really nice because it's using the super user Postgres and you want to have this for applications. So let's get a new requirement. But this is how you connect, okay? You want to have an administrator, which is not the Postgres user. So we are going to call it Slonic because that's the name of the blue elephant. Yeah? And this is going to be an administrator. And then you have Ada Lobleys, which is going to be the application user. And we don't want to use the Postgres database, so we are going to have a FOSMDB, which is owned by the application user. So this gives us a little bit of more security already. So how do you change the previous code in order to allow this new request? So in the cluster variables, we are going to just keep these two variables there, the failover manager, which I'm going to use later on, and the Postgres version. And then I'm going to add the users. Yeah? So this is how you add the user. You say the username. I'm going to ask TPA to also generate a password for that one. And the roles of that particular user, in this case, is going to be super user. That's the administrator. You can also grant permissions and stuff like that. But in this case, I want to have a role attribute. And then we have the developer, which is doing the application part, Ada Lobleys. We just got the password for that one. Then we ask for the database. We give the name and the owner. And that's it. So I'm adding new stuff. So it's not just for the first deployment. It's also for maintenance. Okay? So you can do a git commit of your new version of the stuff. So you can keep a track of your infrastructure with different versions. If you want to revert this, you can also do it. Right? Then, of course, provision and deploy. And you can continue. Now I show you that you can connect to the database through SSH and then PSQL. Now we wanted to do it with an application. So what we are going to do is that we are going to ask for TPA. Give me the password that you generated for this cluster for that particular user. The password is a random stuff, which is not that random. It actually contains a reference to a metal band from Belgium. If you can figure it out, I will pay you a drink from it. And then using that password, you can connect with the normal PSQL. You provide the host IP, the port, the user. And you put the minus capital W user that you can put that password there if you don't want to put that in the PGE pass file, for instance. But now I'm not using the SSH stuff. So now I'm really behaving as if it is an application. Okay? You can take a picture and try to figure out the reference. So we have this now with that little amount of code, but we don't have any fault tolerance yet. What happens if this thing crashes? Well, we want to have a replica, exactly the same version, physical replication, and that's the new thing that we are going to do. So let's take the code again. We have that cluster variables for the failover manager that says absolutely nothing running Poster 16. GPA can do it with a tool called Rep Manager, which I like a lot, and also Patroni, which is also very, very good stuff. So you can choose. In this case, I'm choosing for Rep Manager. And then in the instances, if you remember, I have this primary one. The only thing that I need to do is to add another instance. This one is the VUB. So you see the French-picking one, the Dutch-picking one, but the city is in English so that nobody complains about the one that it picked. Now you have this one. It's a different role. You see this is a replica. And this is the primary. So I have to say who is the upstream of this replica and is the ULB. And I have a cluster with two nodes. Again, TPXSek provision, TPXSek deploy. Let's continue. I want to have more fault tolerance. What happens if there is an attack in Brussels and the old universities get destroyed? We want to have a third replica, but we don't want to do it replicating from the primary. We want to have cascading replication. But if somebody deletes a table here by accident, it's also going to be deleted in all the nodes. So you need to have some backup and restore for point-in-time recovery. That's why you want to have in another part of the country your barman, because you trust your barman, which is for backup and recovery management. It's important that your backups can be recovered. If you just have backup but you never recover them, you don't have backups, basically. So this is what we are going to build now. Let's get back to what we have. We have the location Brussels and two instances. Let's add another location, Vlonderen. This is in the north. And then we are going to add a replica in a very nice place called Achel. Used to be a trapeze beer. Not anymore. It's still a very good beer, but it's no longer a trapeze. This is just for your common knowledge about Foslin. So then you get the location is Vlonderen. I'm going to say that it's also a replica, and I'm going to have the view be a upstream. This is how I build cascading replication. I could do now provision and deploy, and I have my older replica, but I also want to have backup and recovery. So what I do is I add here another location, which is Wallouni. And then I add a very nice place, which is still a trapeze cheese, also beer. This is my favorite one, actually. And then look at the role. It is Barman. So here's how it gets a backup and recovery management just by adding this. So it's an instance with a Barman role. Now where am I taking the backup from? Well, I need some space there. I didn't put it on the bottom because otherwise you wouldn't read it. So it is coming here. You just put backup, rush for, and that's how you build it. So this is all the code you need, and you have already a cluster with cascading replication and backup and recovery tool. Good. You do provision and deploy, and then you're done, and you have built an architecture. So this is all done with Docker containers. The idea is that you can take exactly the same file and put it into virtual machines and other stuff. So if you don't remember how to do the configuration files, it's very easy. You can also use the tool TPXSecConfigure, the cluster for them, and you can say, I want to use this architecture, running PostgreSQL version 16. The platform is going to be Docker. My operating system is Debian, and the failover manager is RepManager. And you get something very similar that you just need to change some names. Now, look at the Docker thing here. If I change it to Bear, because I like Bear Metal, you just change that and you get a different configuration file with some IP addresses that you just need to add. It's basically the same. But you can also deploy to AWS. So TPA is going to also create your virtual machines at AWS if you have the credentials, and it's going to manage all the network things. So you just have to do provision and deploy, and you get everything. It's super cool. Okay, so configure. You provide architecture, the platform, and the OS. Then you do provision, and then you do deploy. And then deploy also provides some hooks, like doing pre-deploy, pre-NDB, post-deploy. You can have some stuff like enhancing your cluster. So to summarize, you have an architecture here, and an executor of the executor, which is of the architecture, which is the orchestrator here, and it's going to deploy to some machines. This machine can be running on virtual machines in AWS if you have the credentials, or it can be Bear Metal, or it can be Docker containers. When I see a ship like this with containers, I always think about the Albatross and the rhyme of the ancient mariners. Okay, so for those people who also listen the same kind of music, you know what I'm talking about. The cool thing is that it is exactly the same architecture here, and exactly the same way of doing the provision and deploy. It's just a different target. So instead of submitting your container to somewhere else, you say, I'm going to deploy the same architecture somewhere else. So what we basically do is when we do a project with a customer who wants to run an architecture, we deploy that with using TPA, the definition, and then we pass to the support team, which is going to continue talking with the customer after we have finished the project, exactly the same architecture, but in Docker containers. So whenever the people who is having the project in production has an issue, contact support and say, like, I have an issue with my architecture, they can deploy a model of it with Docker containers, and then they can test the whole thing there. So it gives you really a continuation of your project. It's not just the first deployment that is easy, but it's also the maintenance and the documentation of it. You don't want to document everything on PDF. You want to document in code. So your configuration file is the documentation of your architecture because you are using your infrastructure as code. That's the main advantage of using this kind of tool. That's why I like it a lot. All right. So these are the platforms. If you want to have a look at it, it is in GitHub now. It is released with a GPL version three. It is recently been open source, but this tool, we have been using it for six years already. So it's quite mature and have our best practices. Everything is done with security layers, SSL, host-based authentication, everything is done for you. And you have the documentation also at enterprisedba.com. To include, it is infrastructure as code. We always put them in Git so that we can have different versions. We know how it is evolving our infrastructure. It is not good only for testing because you can test your entire infrastructure, but you can also use it for deployment in production afterwards. It is a way of documenting your deployment, not just PDF, but documented as code. And it's not just for the deployment, but it's also for the maintenance of your stuff. And finally, we get it open source. It's been there for a while. We have been using it. We have been fighting for getting open source and you get it there. So you are free to use it as much as you want. Now all the documentation is there, but you can also contact me via my personal email, company email, also mastodon. And that part is also for the other social media with full of haters. And Hale Slonic. Thank you very much.
Switching the FOSDEM conference management system to pretalx
Okay, so next we have Johan van de Waal speaking on switching the FOSDEM conference management system to pre-talks. It's already time? It's not too early? Okay. Wow, thank you. Hello, everyone. I'm going to talk about, well, maybe not such a technical issue, but I'm going to talk about how we migrated from PENTABARV, which is a logo on the left, to pre-talks as our conference management system. So very short thing about me. I do scientific programming. I developed together with my friend over there, fiber-based monitoring solutions. And apart from that, I've been in FOSDEM team for quite some time. I visited for the first time in 2007. I did some research for this presentation. And I've managed the geospatial dev room, and since a few years I've been coordinating the dev rooms, and I'm part of the server team of FOSDEM. I am not a web developer, and I'm also not good at slides, as you can see. It's important to know. So what is pre-talks? Oh, no. What is PENTABARV? And what is pre-talks? What do we use it for? PENTABARV and pre-talks are the tools that we use where people will submit their talk, where dev room managers or we or staff will choose our talks, we will review, we will build a schedule, and then finally we will publish it on the website using that tool. This is the tool we used, which was called PENTABARV. This is the new tool pre-talks, which we use this year for the first time. Why did we switch? Anyone here, who of you has actually submitted a talk this year? Okay. Who are the dev room managers in the room? Well, okay, at least one. Yeah, most of them are in their dev room, of course. So I would love to get some feedback from them. So what was the main issue with PENTABARV before? The main issue with PENTABARV is that it's Ruby on Rails. I tried to get it running on my computer for a few years because we wanted to improve it, but it didn't work. I couldn't get it working. Actually my next slide is maybe more interesting. This is a screenshot I made. So this was a state of pre-talks. This is their master repository. So you can see it has been abandoned for quite some time. We made a fork with a nice name Postgres9, which gives you an ID about when it happens. And you can see we did some updates, but not too much. You would get people making pull requests like this. So yeah, I could not install it. So in the web archive, I found install instructions, and they wanted to add that. I also found those install instructions, and we had some in our internal wiki, but without that, even with those, I could not get it running. That's why at some point I said no, I will not improve PENTABARV. I went to improve another project, which is still in use for other conferences. So I had a look at pre-talks. Pre-talks is a Django application. And when I tried it, I was struggling with that for a very, very long time. And at the end of the evening, I said, well, let's try the other one. So I just did this Docker compose stuff, and I had this thing running, and I could import a schedule which was generated before. And I almost had something that looked like a full conference system for FOSM ready, maybe after one hour. So I was quite happy with that. Yeah. So I was not the first one who had planned to move from pre-talks, because what are actually the issues with PENTABARV? The main issue was nobody could install it. It was still running, but we didn't know for how long. And if some strange bug would occur, I'm not sure anyone in our team would be really capable. Well, if there's really a bad bug, people will start to become a bit better, and they might fix it, but it was unmaintained for such a long time. So there have been many plans to move, but they usually failed because then they said, well, we need to have that feature, and that feature, and that feature, and that feature. And the other thing is nobody works on FOSM until, let's say, September. In September, we do a kickoff, then we open the call for dev rooms, and then it's like, okay, yeah, we're too late. It's not yet ready. We cannot use it. And then the next year, nobody will work on FOSM until September when, again, they kick in, and it's not ready. There's also some resistance to change. People will say, it works for us, so we don't need to change it. That's part of the... This is mostly the internal people. This is not the people submitting. We had people, kernel developers, sending videos of trying to login and do PentaBARF without it working. So I think that's quite a bad state. So in order to avoid those things, there are a few things I wanted to have before the kickoff that we have in September. And for me, there's were two things. It was building a website. It should be possible from PTOX. And the second thing is having an audit log. What is an audit log? This is an example from PentaBARF. It is everything which is entered into the system. It's actually interesting that somebody gave feedback on a talk of a year ago, but it was the last thing I could find. But it shows the difference everywhere in the system. And this is really useful because we have had these discussions. This year, some deaf room manager approved a talk in another deaf room. And then, yeah, that's a bit... It's not nice because the speaker, he books a stick and he thinks he can go. And then, oh, that was a mistake. We fixed that, by the way, so they can no longer approve each other's talks, but in the beginning it was possible. We had a presentation where they completely changed the scope after it was approved. It was also not very nice. And then you can always go back and see what was history. I actually would really recommend people if they do something. Use such a log even for normal database, but definitely for a conference management system, it makes sense. So this was one of the two things I wanted to have ready by September. It didn't have to look nice. This on the left is very useful. This on the right, well, you can do those things if you need them, but you will not get happy from it. But at least we had a way to find out if strange things happened, how they were happened. It was also a template for both changes, or if we changed some configuration, then at least we could trace back what was history. The second big thing which we needed, which I said, was to be able to build a website. Why? Because our website is used by all of our other integrations, or most of them, which include Matrix S review, which is what people use to review their videos, and all the scheduling applications that you have on your phone. So that was also one of the things which had to be ready, at least in some form, before we could switch. Yeah, I forgot. Oh yes, now I know again. A third thing which I did, but this actually only started after this initial session. This was during the actual organization of the event. I created a plugin for pre-talks, which has some specific settings for FOSDM. For example, people, they will pull out a call for papers on the website. They could enter it here. Well actually, all of them, they send it by mail before I had the system ready, but at least next year they will be able to do it. It's a bit similar for most of the other boxes which are there. So they can close their call for papers, so people can stop submitting to their track, because some tracks like to keep it open for a longer time, but then at least they get a URL where they can still submit. So if you're really quick, you can get that code and submit to the main track, but I don't think anyone will accept it. During the event, this is actually something I actually fixed this morning. Dev room manager will find some instructions. I hope that this will grow a bit over the next year, so that they have only one place to look while doing their team. Yeah, so I wrote here most of the things were already a bit late, but it was also only the way during the conference that it was like enough vibe to add those things, to build them, to realize that we needed them, because if you just click around in the interface, it all looks fine, but it's only when you start using it that you notice that you need some extra tools, unless you're really good at testing. I'm not. We had to make some changes to pre-talks itself, mostly to limit Devroom managers to edit other people's things. I made some changes to the review system. As I said before, sometimes it is too late. This was something I didn't understand that no one of them complained, because if they did reviews and they clicked next review, they would get a random track from another thing. That doesn't really make sense, so I changed it. It would always stay in the same track. So what are submissions? This was by default not enabled, but we have some people who submit a lot of talks. One person submitted 15 talks for this Boston. So if all of those are in different tracks, all of these Devroom managers, they would spend time in looking at the same proposal again. Well, now if they see the list, then at least they know, okay, he is already there. Let's keep him. The last things we've had to change, which are a bit complicated and where I'm maybe not completely happy about the workflow, is the fact that we have parallel scheduling. So pre-talks in itself, it's actually made for, you have a large group of reviewers, and then you have a small group of people who actually built a schedule, which works for most conferences, but which doesn't work for FOSM, because we have a lot of people scheduling. So well, that's actually the nice side. Some of the last things which I want to mention, these are like the annoyances of at least some of the people, mostly from the staff. Pre-talks is much less information dense. If you look just even to the resolution, you see that here, all information is a bit spread out over the screen, while here it's very close to each other. You have a search book. If you start typing here, you will get to any talk. Here you would go to proposals, then click on talks, then type in the checkbooks, then search. It will take you much more time. So this is one of the annoyances we had this year, which we hope to improve for next year. Things that we had over Penta, so these are things which already went better this year, even though it was like a migration year. There were much more reviews, so that was, we could reach a larger group, I think, because it was easier to use, or maybe because we promoted a bit more. Devroom managers, they can now send mails. Before they had to export all the email addresses, then run it through their own mail program and then send mails. While here they are in the system, if you would go back as a speaker, you can click open those things and find your mails by yourself. So finally, I have only three minutes left. What are the roadmap? What are the IDs? First of all, the audit log, which I showed you, it's a bit integrated. Well, actually, the code is quite separate, but as soon as pre-talks makes another release, I want to make it a separate plugin, which can be installed completely apart from Falsum, because I think it's interesting for all pre-talks users who choose Postgres, they just should use it. It will always help you. And then the next part is actually a bit going back to what I told earlier, Pentabar which was so hard to install. I don't want to create a new plugin which is as hard to install as pre-talks. Well, nowadays it's a bit hard. We have a demo site, which is that one, pre-talks test. So I actually want to make that into something that you can install with. It will not be one click, but it will be quite easy to install. So people who want to improve it, that they at least can do. Yes. Then the other thing is we made some custom changes. I hope to get rid of them, integrate either them into pre-talks or make sure that there are signals which are like places in the interface where we change something, but that it's already ready for it so that we can put it into the plugin instead of in the code itself which is forked. The last thing is we want to get more information about previous year submissions because it's interesting especially for main track speakers to see has this person presented before, how was the feedback? Maybe he gave the same presentation already, then we will skip him, those kind of things. Finally, my last slide. You can help. Well, an obvious way is to help upstream with the project. Pre-talks is used by a lot of other conferences. I believe they have about 100 conferences or something like that organized every year with pre-talks. Maybe much more than, sorry for it. We have our own repos, but especially the first one is useful. That's a pre-talks integration because that's the place where we really do the bug tracker. I put some questions there and I intend to put a few more also with questions for you. Then especially I mean you are the users like the dev room managers. Which settings do you want to have for reviews? Do you want to score from 0 to 10 or from 0 to 100 or in different categories? I just made a random choice and you should get some feedback. Then there are the two forked repos which are actually only useful in combination with the other one. I just give them for completeness. That's my talk.
From OpenLLM-France to OpenLLM-Europe: Paving the way to sovereign and open source AI
So thank you for your coming. It's quite early, 9.30. It's difficult to start, so I will try to push energy to this session. So just before to get started, I would like to know more about you. So with three very simple questions. First, who has ever locally run LLM on his laptop using Lama CCP, VLLM or LLM Studio? Please raise your hand. Okay, right. Second question, who has ever fine-tuned a model? Fine-tuned. Okay, let's find 10. Okay. And the last one, who has, like me, dreaming to have one-dred open-source model in here? Not only one, not only open-weight model, but really open-source models. Raise your hand. Okay, you are in the right place. So we will do the job. Okay. Yes, my name is Michel Marie Modés. So I have a co-founder company, a software company called Inagora. So we started in 2001. So as the first time, we will be very close to our 25 years. So it will be for next year. And our mission with Inagora is to invent and develop good tech for good. So what I can sum up as ethical open-source. And for AI, we do the same. We do ethical AI. And to come up to achieve this goal, so we started a community, a very large and a brand community called Open LLM France. So we started in June 2020. And we have two main goals. First, as well to build trusted sovereign and real open-source generative AI technologies. And the second goal is to build a strong ecosystem around LLMs and generative AI systems. So for the second objective, so I can say that we have success because the community right now is more over 450 active members with a strong support from the academic and public research in France. So it's very important because, for example, with the GenC, we can use freely supercomputers like GenZ. And it's very useful for us to give freely GPUs to train our models. So it's very important. And at the same time, so we have a lot of corporates, corporate private company, who have are using AI technology or many to build with us AI solutions. So and all this track for today. So I think there is a lot of important things to build ethical AI system. But my talk, I will talk with you about three topics as well, what we could consider as open-source AI. So this is my first part of my talk. The second part of my talk will be related to diversity and the underrepresentation of our culture, our language in these models today. And the third part of my talk this morning will be related to data quality and the evaluation of this model. Okay. Under. Okay. Right. So to be very clear, and to start on the biggest problem, so the most popular open model that you are using today are not open source. They are open wave model. So this afternoon, Stefano Maffuli from the OZ, open source initiative, we have a talk to report to their progress to this definition of what we could consider as open source AI. So I'm very proud because I'm part of this small group and private group inside the experts from external from from the OZ to try to define and to get this definition because it's important to clarify the situation because as you know, and I'm not alone, but Stefano and probably some of you's have used as a published post to raise the problem to the misuse of the open source term today by some players on the far work ecosystem. And so and I put in this slide, you know, the OZ definition of open source. So to be very clear, if you have limitation on the use, the license and the term of use of a lease license, or if you don't have the artifact, the element to train again the model or to make a derefited work on it, you can't say that you are doing open source. This is very clear. And today, the main part of the popular model we have, you don't have a view and access to the data set used to train the model. For us, for this community, what we open source AI means three things. First, as well, that we are able to have the open source of the model. All the tooling system used, for example, to train the models to evaluate the model of the pipeline to do the evaluation of the model. And so for different things, it's not very easy to find this information on an open model today. The second point is related to a license. So if you have for us our license, we don't have this license, we have to have, we thought in the limitation of who and what we are doing with this model. And the most important is the third point is related to that asset, open corpus, open corpora. But you know, it's very interesting because probably if you follow the news related to AI, you saw during these past days some new models with data sets published under open source license. So, and I think it's very important and I think that for 2020, not only the year of open source AI, but also for data set publication, open source license. So I changed my presentation last night, just after the talk of Joss, the co-founder of Next Cloud, because he present an ethical rating system. And I'm very glad to see that we share the same point of view. And it's very simple for also for the Next Cloud community. If all these conditions are met, the three conditions, so you are in the green area. If you have only one, two conditions, so you are in the yellow, only one orange. And if you are using, for example, open AI, in fact, ChabGPT from open AI, zero condition are met. So you are in the red area. So if we have today this morning developers from this beautiful Next Cloud community, thanks for your job. It's amazing and we love it. And so for us, by the way, we are in the first green area and we try to do the job. The second part, the second topic I would like to underline this morning, it's the problem that AI generative models are more and more representation of a picture of what we are in terms of culture, in terms of society, in terms of language. So I think that's figures talked by the by themselves. So in the left, you can see that since 2018, less than 8% of LLM has been created in Europe. And on the right, what you can see that it's the volume of language used to train LAMATU model. So 0.16 for French and 0.17 for German. So percent. So I don't know what do you think about that. So but in my point of view, we can say that we are not really well represented as our culture values in this model today. So we have a problem, I think. And we have a community we try to solve. So first, first try we did, it's to adopt a data first, drive an approach or quite a quality first, drive an approach. And because the small also is beautiful. And we try to get the proof that quality of the data set is more important than the quantity of data you have. And to demonstrate this this point, we release a first model in October called Claire. So Claire like the woman, the show name in France. So I'm not against I have nothing against a podcast, Albert, Alfred, Mr. But you know, we prefer in our community to promote women because by fact, it's our little contribution to have more women in our AI ecosystem and a global unity. So I will, I will not go deeply in Claire because Julie, the real one. Yes. Julie will go deep and tell you all about Claire what we did. Oh, we did this model. But just for very, very, very, we just gave the proof that it's we are able with a lot of amount of French tokens to give a very, very conversational model. Conversational means that Claire is able to understand dialogue between people with their realization. And the second part of Claire, the second features, it's that Claire is able to talk like, like you, to make a dialogue, human like dialogue with defluence, hesitation, because we train Claire with conversational data. So we continue to collect a lot of data. And today, so we are around 140 billion of token in French. So and we I'm very glad and happy to announce that we started to the training phase to train our new model called Lucy. So Lucy, the main goal of Lucy is to fix or to yes, to improve the under representation of the French language in generally in LLMs. But at the same time, we put in our data set some over European language, the German, Spanish, Italian, some code to some some some source code to make our model to have a capacity of reasoning. And we try to build some new features to make this model efficient, not only in French, but for over language. So probably you will be interesting to follow this work and probably our custom tokenizer and so on. But the most important things I would like to share with you this morning is that we are not the only one community involved in this goal to build this sovereign LLM in Europe. So I'm sure that this list is not exhaustive. If anyone knows new or other initiative, please call me just after the presentation. I will be very excited to discuss with you. But the most important is that we are strongly believe that we have all the capacity, all the technology, all the GPUs in Europe to build our models. And it's why I'm very delighted to announce you that today, during the first day, we changed OpenLLM France to become OpenLLM Europe. So you can use this QR code to inboard yourself in this in our Discord server. So we all the content we produce during the six months in French is still available, available. But we have created the channel for each European language. So please welcome. And if someone want to be part of the community management team, please contact us and we will be very pleased to inboard you in our initiative. So that's my tool for today.
LinTO Studio as Your Ultimate Open Source AI-driven Media Management Solution
Okay, great thing everyone. Thanks to come to discover linto. linto is your ultimate open source AI driven media management solution. So I'm Damien Lenn. I'm head of R&D engineering here for linto at Lina Gora and I'm proud. So what is linto? Essentially linto is a set of voice technologies that enables you the best on the open source side of voice tech. You can find in linto all the cognitive APIs that you are craving about like transcription with a live or batch transcription. We have a set of NLP APIs that enables you to add punctuation, name entities and topics identification or so on. And also we worked on speech synthesis. This is the first set of linto technologies. Leveraging those technologies we built a full-figured surrogate, I mean alternative to Alexa and Dialogflow to build agents, smart agents which includes chatbots, smart assistants, voicebots with custom full software work walls that work on the browser that's very neat. And finally we the past two years leveraging further our technologies we built a business oriented solution which is called linto app which is a media management platform that enables you to load media and to make to run these cognitive APIs to edit the transcription in a nutshell to turn routine recording into fully qualified data lake. So there's a lot of software's closed source that enables you the same kind of features but more or less all of them uses the APIs from the big players you know them. Okay so the question here is always the same what happens to my data when I use the services provided by author, dictation, happy scribe and so on. In a nutshell you just send your data to them. So linto studio I will present you a quick video to show you the platform but here you have all the functionalities and note the link which is currently displayed you will find the link to immediately just use our alpha version which is online free you can just create your account and try yourself just after the meeting and you will find the link to our github pages to download and work with the source code. So linto studio enables you to use the APIs I've been talking about to add automatic stamp with our modified run times for whisper.ai not a whisper by openai. We are so enabled to speakers and turn identification and all I've been talking about before just note that the platform is a web platform where you can collaborate in real time using organization roles and share resources within the platform. It's shipped with companion Android application that you can use to to recall. The final slide before I move to the quick video of course as my colleagues presented you a work on the large language models of course we want to also leverage these technologies within linto studio and add this kind of feature I'm drafting here on the picture to work with the documents loaded into the platform and ask some things with large language models. Okay so here I jump to the video. Okay so I recorded this yesterday. Here on the left I'm currently recording something within the sorry so I'm recording with live transcription. Okay whenever I'm done I just stop I can navigate local files and listen back what I recorded but what I want to do is to send this recording directly within the platform which is of course the big window displayed on the right so I can change I can send it to the platform I choose the language the model I want to use then the media I uploaded just lands into the platform and here I can see that the transcription here includes the capitalization and normalization I can also explore the platform as I tell you media management solution so it's a multi-user platform we where everyone can create accounts and use roles within organization so here I just showcase the way you might invite users and assign roles within a given organization here I show the share mechanisms which is total rip off from notion way of doing things and I'm proud of it it was flawlessly I can share with external users as well send email automatically when I just share transcription to a user okay here I jump to the editor where I can you see use AI insight which is our NLP APIs okay you just click on the one you want to use and start generation forum identifying stuff in your text like name entities or locations and decisions topics and put highlights you can also manipulate the text and add manual highlights to annotate the text okay also we have another editor which is also very neat where it's a place where you can basically just built the SRT or VTT and you work with the screens you have the center the current screen you can arrange arrange them the timing you can of course correct the text which enables you to add something that you want to rip on the video directly some close captions here's the way I want to navigate within the platform I just can use tags and fetch the document I'm looking for also using full search text and so on and once again I get back to this recording I can show you here that I can also correct add some correction corrections to the text change speakers which is a real-time collaboration with a reconciliation of multi multiple users editing the text and finally as you saw we can export the document okay that's our platform demonstrated in a nutshell I took 10 minutes for this presentation hoping for any questions from you so if I am if you thank you for this presentation I have two questions one of them is technical and the other one is about money I'll start with the money this specific project how is it sustained that do you have revenue for this specific project and so what's the business and then the second question was what kind of power of computing power do you need to run this for a small organization maybe okay so the goal here for our business is very clear we offer as linear go around services for tuning models okay so this particular platform is also intended to be a SAS service where the user will be at some point when we have time to develop a subscription for that users will be able to use our system as a SAS but the source remains free and it can be austere on premise with the same features like away like like always at the Nogura we have no premium plan or whatever but we just feel that it's convenient to just host directly a solution as a SAS offer the other question was about the computing power okay so it requires quite a lot but we batch the process of the transcriptions and the long models inferences we just provide the best default way of doing stuff and if you dig in the code you'll see that our runtime supports kind of everything you can dream of we can run on CPU of course it will be a little bit clumsy we work on CPU with Intel extensions for transformers and so on and we of course work on GPU if you want to process a large batch of transcriptions when the hosting on premises any other questions we got time for one more how do you handle a typically French language setting which is irony how do you handle because of the keywords and so on the typically French set which is irony meaning that the speaker means exactly the opposite of what he says he's asking how do you do with the irony of French language of course using the you know the irony mark you know this one thank you Damian all right we're gonna start the next talk here in two minutes
LangChain From 0 To 1: Unveiling the Power of LLM Programming
Hi, y'all. I have the privilege of introducing you to Stefano. And he is from Italy, in the middle of the Italian coast. You've been a Linux enthusiast for 20 years, got me on that one. And your focus is on VoIP, interestingly enough. This is his 10th Fosdom, and your favorite animal is you after four beers. Very appropriate. Everyone, welcome Stefano. Thank you. One of my hobbies was caving. I spent 10 years going into caves, descending pitches with ropes, crawling into mud, and doing those awful things. The reason for doing that is that the very few time I had the chance to be the first one in an unknown place, it was awesome. When you are in an unknown place, you face some dangers, but you also have infinite possibilities. Behind the light of your headlamp, there could be anything. A river, a beach, kilometers of unexplored passages, who knows. And I feel the same about the AI today. And I'd really love to increase the power of your headlamp today. So I'm going to kick start you into Lang chain. This is the GitHub page for the talk, where you can find the proof of concept code and the presentation itself. It's better if you look at the code during the presentation. We'll explore Lang chain using one of its notable use case, that is retrieval of met generation. And for doing that, we will look at some of its components and concept that are document loaders, text splitters, embeddings, vector stores, retrievers, prompts and templates for generating prompts, large-length models, of course, and finally we'll combine some of those together in a chain. Then I'll experience the adrenaline of a live demo, and maybe we will take a look at some other notable use cases. Let's talk about our main quest first, that is retrieval of met generation. This cutting edge techniques involves giving additional data to the LLM to enhance its responses. It's interesting because when you give additional data to the LLM, the answers become more precise and relevant, and it's also allowed the citation of sources, and allowed to respond to data that are not in training data set, that could be even personal data or real-time data. It's a very discussed topic, and it's an intriguing case for showcasing Lang chain. This is the scheme of what we want to obtain. Multiple use cases exist over retrieval of met generation. We will look at the simple one that is question answering over unstructured data. We will take some text that is our unstructured data, and we will put it into a storage. Then we will ask a question and use the data from the storage to help the LLM answer the question. Let's look at it in more detail. We will take data from a transcript from a YouTube video, and we will load it into a usable format. Then we will split it into smaller parts and compute a vector representation, also known as embeddings, of this data. We will store it into a database. Then we will ask a question and compute the vector representation of the question, and use this vector representation to find similar documents. Then we will put the question and the retrieved documents into the prompt and give it to the large language model. If you're thinking that it's complex, I assure you that it's not, and it fits in a few lines of code. If you're thinking that it's trivial or worthless, I assure you that it's not the case-hater, because there are a lot of concepts behind that. Why using LineChain? LineChain is a framework for developing LLM-powered applications. It offers us a lot of ready-to-use of the shelf components and building blocks that make our life easier. Should we take our code in production, it also has components that make it easier for us to do it, and also it has a lot of samples to copy. It's fun because it has an extreme speed of improvement, and something interesting came out of its community continuously. On the other hand, it's very young, and breaking changes may happen, but we like risk. We are using Python. LineChain is also available in TypeScript, but that's not make-up-of-tea. We also have our main requirements that are LineChain, of course. OpenAI that we will use as embeddings and LLM provider, and TraumaDB as vector store. Since we're using OpenAI, we will provide an API key. Okay. In this part, we prepare and store our data. We will use four components that are a document loader to retrieve our data, to get our data, and convert it into a usable format. A text splitter for divide the document into smaller meaningful units, an embedding function to compute the vector representation and the vector store to store our vectors. The document loader is an object that takes from various sources to the data source. It takes from various sources of data and gives us a transform it into a usable format. That is a document. Multiple sources are available, and for instance, we can have files like PDF or text file or web pages or cloud storage such as Amazon S3 or Google Drive, social media like Reddit, Twitter, GitHub, and papers, and of course, YouTube transcripts. It's also very easy to write your own if you don't find something that fits for what you need. You can just extend the base loader class. This is our document loader, and we are using the YouTube loaders from the LineChain community. And this will take the transcript of our video and put it into the document class. This is the document class. It has a page content string that will hold the transcript of our video and a metadata dictionary that will have a key source with the URL of our video. Now that we have our document, we want to split it into smaller meaningful units. Why do we want to split it? Well, for free reason. The first one is that the input size of our LLM is limited, so we want to give smaller pieces. The second one is that, like me, our LLM tends to be easily distracted, so we want to increase as much as possible the signal-to-noise ratio and avoid to distract it, giving it useless information. So we will choose only the pieces important to answer the question. And the third reason is that usually we pay per token, so the more we give, the more we pay. We can think of five levels of text splitting from simple to complex. The simple one is splitting just counting charters or tokens. This is simple and easy, but it has a problem, and the problem is that probably we will end up splitting in the middle of a word or a phrase. The second level addresses this problem, and this recursive splitting. It recursively tries to split text on special charters like new line or punctuation, then combines those phrases together till the maximum length specified is reached. The third one, look at the document structure that works for HTML files or markdown or code. And then there are semantic chunkers that is still experimental on a long chain, and it's very interesting because it combines phrases together only if they are similar and use embeddings to compute similarity. The last one is highly experimental, and it's asking an LLM to split our text. This is highly experimental and also very expensive. It probably makes sense only if you are thinking that the cost per token is going to zero. We are using the recursive charter text splitter, that is the second, and it's a good default choice. We can specify the length of the text, and if you want some overlap. There's not a golden rule about that, so maybe you want to try what works best for you. Okay, now we have our documents, and we want to compute the embeddings. The embeddings are a vector representation in a high dimensional space. That means that we take our data and represent it as a vector. Each dimension of this vector will reflect an aspect of context or meaning of our data. There are thousands of those dimensions. If two pieces of text are similar, they are next to each other in the embedding space. That means that we can compute the similarity of two pieces of text just measuring the distance between those vectors. It seems complex, but for us it's very easy because for us it's just a function that we use when we create the vector store. We are using an external provider here, that is OpenAI. And auto privacy, obviously if you use an external provider to compute embeddings, you are sending your data to the external provider. We now have vector representation of our data, and our data is split. We want to store it into a vector store. A vector store is a database that is tailored for storing and searching embeddings. We are using TraumaDB here. It is open source, it's very easy to set up. This is the initialization. And as we said before, we are passing the OpenAI embedding function to it when we initialize it. These are the most used vector store in the reports of the state of AI for 2023. And TraumaDB is at first place, and FACE is also open source, it's from Meta. And Pinecon is a very popular cloud vector storage. Okay, we now have hard data into the vector store. We want to use it. We will use four main components here that are a retriever to search similar documents to our question, a prompt that will give the LLM the instruction on the output that we will give, the LLM that is the heart and lung and brain of our application, and finally we will combine those three together in a chain. Okay, the retriever is an object that is responsible for searching documents that are relevant to answer our question. The simple retriever does this just computing the vector representation of our question and search for document that are near to this vector in the embedding space. This is the simple retriever. Long chain also offers us more advanced retriever like this one, this is multi-query retriever. Please use the LLM component to formulate the variation of our question and then use the embeddings of those variations to search for similar documents, similar and hopefully relevant to answer our question. Now that we have similar documents, we can put them into the prompt and the prompt to give to the LLM. This is the prompt that we are using and the prompt is just a template with the instruction for our LLM and two variables in this case that are the context that will be our documents and the question itself. I love delving into details because it's just a template and also we can take this prompt from the long chain app. Long chain features an app with all the prompts and other objects that we can use, all the of the shell components that we can use. We have the prompt, we want to give it to the LLM. We are using OpenAI SLLM and this is how we initialize it. I use streaming, the first variable because it really improves the user experience and temperature zero means that we don't want creativity or hallucination, we just want precise answers. Maybe you can argue that I should have used different LLM providers but nobody gets fired for buying OpenAI so I chose that. These are the most used LLM providers always from long chain state of AI. OpenAI is at first place and I'd like to rant a bit about that because CloudAI, the third on that list, is labeled from almost from everywhere in the world except from Europe. This week the Italian data protection authority is going against OpenAI over privacy issue again. I know that there are a lot of privacy advocates here and I also care about user privacy but I think that defending the user right shouldn't mean going against going against war against them. That's my two cents. Those are the most used open source providers. It's interesting because the first three has a very different business model. The first one rents hardware, the second has a cost per token, paper token and the third one is for surf hosting. We now have gathered all the components, we want to put them together. This is all the components called one after another. We have our question and we pass the question to the retriever and we get a list of documents. The list of documents is joined together in the context variable then the context variable is used in the template to generate the prompt and the prompt is given to the LLM. It works nice and easy but we can do better and this put everything together using a chain. A chain is a sequence of components that does a function and it's better than just calling the component one after another because it has several advantages like it offers sync and the sync support and that allow us to take our code directly into production without changing it and also as advantages of observability and it's integrated very well with other launch chain components that are used to take code in production. This is the code put together using the LLM expression language LCL that is a new way of doing those chains. This is an acquired taste and it's quite new. It's from September but I find it very useful when you get used to it. Okay, let's see how this works. This is our code and there are two examples. One uses the chain, one not, this is the one that doesn't use it and it's just a few lines of codes. It's very easy. Okay, I forget the open AI key. Okay, I forget the open AI key. Of course it doesn't work. I'm not connected, you're right. Okay, I have a backup video. No, no. By the way, it's just for giving you an idea of the piece of calling the various components and the parts that takes the most time is computing embeddings and this is the streaming output. Okay, I have prepared some questions that are those questions and those are given too fast, sorry. I gave the question to the LLM and this is the output of the output of the LLM. Also, okay, it's nice because this one, the retriever wasn't able to find the answer for this question and so it wasn't able to give us a response and the LLM told us, I don't know. I'm not sure if I can move forward. Maybe I also have it for the LCL. The LCL version uses the multi-query retriever. So you will see now that it will ask multiple questions. Each question is transformed into multiple questions. This is low, I'm sorry. Okay, those are the questions and this is the answer that came out. Okay. There are also other interesting use cases of luncheon. We look at the simple one that is question answering over unstructured data. Also it's very interesting question answering over structured data. This one uses the LLM component to convert our question into a sequel query that is executed and the result of the query is used to improve the answer of our LLM. It's very interesting. Another one is data extraction. You just have to provide a JSON schema and then unstructured text and the JSON schema is automatically filled in with the data from the structured text. The LLM understands what to put into the JSON schema. It's interesting because there are people paid for doing that work. Summarization is very useful and it has a lot of, let's say, problems. It's an open problem. It's very interesting and useful. Then there is a synthetic data generation that is useful if you want to find a model or maybe if you want to anonymize some data. It works like data extraction backwards. You have a JSON schema and the LLM generates a text unstructured that contains data that will fit into the JSON schema. Finally, there are agents that is a key concept of luncheon and it's very fun. With agents, the LLM takes charge of choosing what action to do. It's worth studying. It's very interesting. Okay, that's it. So, thank you. Do you have any questions? I saw his hand first. Thank you. Very interesting. My question is how does this scale? You showed an example in which we have just one transcript. What if we had billions of transcripts? I didn't see any mention to the ranking of the retrieved chunk. If you can elaborate a little bit on that, it would be very good. Thanks. Okay, luncheon helps to take this in production. This was proof of concept so you can take this in production. Also, it's out of the scope of this talk. This was luncheon from zero to one. So, that scaling is from zero to 100. You can find a lot of examples on how to take that in production. If you take a look at the GitHub repository, there is also a link on how people from luncheon use this in production with the chatbot that helps searching in the luncheon documentation. You can find the code and it's very interesting. If you want to take it in production, it's worth copying that code. It's the best practice. Did I answer your question? I'm sure you'll see this coming. If I have some money to spend on a hardware and I want to get an LLM, there is a lot of proprietary intelligence that you use, like the Mbendix in particular, and also the other part that it's on the query side at the end of the chain. How difficult it is to do this without using OpenAI? It's really easy because luncheon allows to swap those components. I use it here at OpenAI because it's the easy way for having a result. But if you, for instance, use the Ollama, you can self-host the LLM and ask questions to the LLM, or maybe with a face you can rent hardware and run your open source model on their hardware. So it's easy because those components are swappable. All right, y'all. Let's give Stefano one more round of applause.
ML Guided Optimizations in LLVM
Can you hear me okay? Cool. Okay. So if sound doesn't work for the rest of the presentation, this is basically the key of it, right? So I'm a compiler engineer, I'm not an ML specialist, so I'm not a compiler engineer, so kind of like a heads up, if I say something wrong about ML, that's why. You can use ML in an industrial compiler, which is LLVM. Actually, show off hands, does anyone have you heard about LLVM, Clang? Cool. Okay. About half. I have a slide about that too. So out of the box, actually, as of Clang 17, it's not very well documented, because it's still work in progress, but you can actually connect to Clang and train models. So that's an interface just for training. It's a DMM kind of an interface. I think that means something to the ML community, if not, tell me. And this is not vaporware, it's just a virtual computer. In the sense that we actually use it for real, right? So I mean, you can read what's there, but we've been using it for about almost four years now, and we have some experience with it. And most of the talk is actually about trying to get to point three there, which is like what we've learned. The rest of it is set up. Okay. So LLVM, for those that did not raise their hand, is an open source project. It's a compiler. Actually, LLVM itself is a library. So it defines an intermediate representation. That's what IR stands for. It contains state of the art optimizations. It also knows how to lower to X86 or ARM or other targets. And then Clang is something like it compiles C or C++ down to LLVM IR. So basically Clang is built on top of LLVM. And so it's Swift. There's a Rust compiler. There's a Fortran compiler as well. And I mean, the LLVM project is bigger than this. There's a full tool chain there, like debugger, linker, all of that. Actually, shameless plug for the LLVM community that I'm part of. There's a dev room this afternoon here somewhere. To us, to Google, so I work at Google. To us, C and C++ is very important. Basically, anything that is performance critical, which is basically anything is written in C or C++. When we say C and C++, I really mean LLVM. And when I talk about LLVM, I mean LLVM at the tip of three in GitHub. So we don't have a special fork or anything like this. And we really chase the head by plus, like, well, minus usually two weeks. So we're very close to the head all the time. We have a release theme that keeps it basically in sync. And even small performance improvements matter, because a 1% saving across the fleet really means that much less hardware you have to buy, what you have to produce or consume, et cetera. And we keep doing this. All the performance improvements that we make are small, but they're constant. And it's like interest. It compounds. Our binary is no shocker. They serve RPC requests. No surprise there. The key thing is that to do that, to optimize these things, there's many things you can do. But as a compiler engineer, we're primarily occupied with how do we make the RPC request complete quickly. And the RPC request traverses a lot of code. Most of it is actually not the code that you want to execute. So there's things like networking stack, serialization, deserialization, security, blah, blah, blah, blah. And all of those things are reusable code. And they try to be genetic, which is the exact opposite of what I want for performance. Because for performance, I want it to be as specialized to what I'm actually doing. Like, I don't want it to be genetic, right? And for that reason, actually, the biggest levers that we have for performance are we collect profiles that tell us, like, well, actually, the program is spending time and then we reoptimize it. So we recompile it with them. And link them optimizations, which are basically like we can look at the whole program and try to, based on that understanding, try to make the right decisions. So things are big, like, you know, lots of data, lots of instructions to execute, nothing fits in any cache. I'm not being ambiguous there. I'm being actually precise. No cache fits the data that we're talking about, the instructions or the actual data being processed. So that's why, like, optimizations like inlining are, you know, very impactful because they contextualize, so they specialize things down to what you actually really have to execute. And then you end up with large functions, which means that optimizations are register allocation or have like a big problem to solve. What am I doing? Okay. Here we go. Okay. Which kind of gets us to why we want to do ML, right? So we want to do ML because we're looking at problems that are, sorry, sequential decision making. So inlining is about, hey, is this call site worth inlining? Sure. Okay. Fine. Well, the program just changed now, right? So what about this other call site? Is it still worth inlining? Maybe not, right? So as you go along, the state of the problem that you're trying to optimize changes, we don't have an Oracle that tells us what's the perfect optimization decision, especially at like the scale that we're talking about. I'm kind of like getting us to say reinforcement learning, probably no surprise to an ML community. Because I mean, otherwise what we do is like we have heuristics that can only operate on like local information. And because I mean, there's the one that actually we can make sense out of, right? So, and we have evidence that they're not good enough in the sense that we know that if we play a bit with them, we can, we can find headroom in optimization. So, but, you know, we cannot constantly twizzle with them, right? Like we want something a bit more systematic. So that's why we are interested in ML. We are also scared of ML because the compiler is about everything that ML is not. So the compiler must be correct. I don't think that it's a surprise to anyone, but it's a non-negotiable. The compiler must be deterministic again, because otherwise it's something that you cannot live with or, you know, to take forever to compile things because we cannot do incremental builds. So ML at least like naively to us felt like something more analog, right? Like it's more like, well, fuzzy, maybe something and that's not, not what we are about, right? So how did we go about it? Well, first we're not asking ML to deal with correctness. So already in the, in the code that I'm talking about, like in the compiler code that makes decisions like in lining and register location and things like this, we kind of already had a separation between what's correct. So, you know, there are certain things that are illegal to do so that we don't do them. We don't even wonder are they worth, like, you know, would they be valuable in doing it? We just don't do them. What we did here is we stressed that boundary even more. So we created like a very clear interface between ML questions and like what heuristic or policy questions and, you know, correctness issues. So the correctness stuff is, you know, written in normal imperative C C plus plus code that we can all look at and agree that it's actually correct, right? Module of bugs as always. But then out of choices that are equally correct, we go and ask ML, you know, which one should we make? To the end user, we don't want to tell them any of these not because it's like a shame or anything, but because it's the more different the compiler would look like the more difficult it would be to adopt it. So how about we make it look the same as it is today, which means no new dependencies, nothing extra, just additional flags, right? So that's something that is fine. So which really means that when we give the compiler to the user, we embed, we need to embed the models inside and not show any sort of like dependency on some sort of like an inference engine or anything like that. But for training, there's totally different. So for training, we're totally cool with like, depending on like TensorFlow and like whatever and, you know, like random generators in the weights and all of that is fine because that this training and actually we're fine with compiling a different compiler just for training, because that's not something that, you know, like, it's not for like everybody, right? So it's just for whoever does the training activity, which we also want to be rare because we don't want to like keep training it as you're trying to ship a product, right? So, you know, like, we give you the compiler and then like, hopefully the more the models are good enough, just like heuristics today, like, you know, like to, to resist changes that people make to their code, right? So basically, there's two types of interfaces that we ended up having. One is between compiler and policy. And there's like domain specific. What I mean is like, there's a different question that you ask is an inlining pass from the one that you ask is a register locator from the one that you ask is a instruction selector or something like that. But then the ML obstruction, like the way we interact with the mail is common because fundamentally ML to us looks like a function that we pass a bunch of tensors to and it comes back with an answer. And we, you know, like how it's implemented is, you know, it's not important, but it's irrelevant from the perspective of the interface and the implementations that we have are like either ahead of time, like I mentioned, or, you know, the interpreters who use TF light, like the people in embedded or for the DMM case, we're actually doing IPC over pipes. So the state in LLVM today, like if you, if you go to GitHub and you pull LLVM down, you basically have everything that you need to, to, you know, add the mail to a pass if you're a compiler engineer. It's TensorFlow centric, no surprise there, but it doesn't have to be. So the obstruction that I mentioned earlier can be, you know, like, I mean, you can, you can plug by the pytorch or anything like that. I mean, we, we made a pipe based protocol work over that obstruction. So it's clearly not TensorFlow specific. Any tools that are genetic, you know, like other utilities, like how you collect a corpus for training, right? So that's a problem. That's also in LLVM. We used to have them in, in, in a different repository, also open source, but they make more sense to go into LLVM. The training tools that we use, so for example, the, the fuchsia operating system that I had on an earlier slide trains using those tools, they are available there to as a, as a reference. But if you are a researcher, you probably want to use something like compiler Jim that is more research, research friendly. So there's kind of like different concerns in, in these tools. And then there's also like using the tooling that I mentioned, like there's, there's another body of work that produced a large corpus of IR that you can use for like whatever you want, like training for these purposes, or maybe doing LLVM training or anything like that. There's links there. In fact, like all the links in the, in the slide that are in the, you know, like when you go to falls them and you see the talk, they're there. Okay, what we learned, that's what I wanted to get to. And I'm doing well with time. Okay, so the, the, it works thing, right? So there's a difference between, I mean, there's been work doing ML with compilers in academia, but I mean, that there's a big difference between that and actually shipping a product and shipping a compiler for production teams. So the key thing is that, at least with a size problem, we have evidence from, from the Fuchsia team that it can work completely, meaning like they periodically, like about every month, pull LLVM, retrain a model on their code base, all on vanilla build bots. So they're like normal CPU based machines. They train for like about a day or so. And they produce a compiler at the end of that that optimizes for, for size, because that's what they care about. There's links, I think, down there, like an example of such a build bot. So it all, you know, this can be done completely openly. And the key thing also is that it works like turnkey, meaning like you don't need someone to go and pay attention to it. It just works repeatedly. And he's been working like this for like almost four years now, which is, which is good. Like we have a signal that we can have like an industrial process that produces an optimized compiler, you know, on a cadence, right? Okay, here's what it didn't work. So performance is hard. So, okay, so you are ML experts, you are not surprised at the statement that for reinforcement learning, the, the quality of the reward is very important. And we understood that through we, okay, it makes sense. However, for performance, the problem is a bit tricky. So it goes like this, you cannot just say, oh, let's run programs and see how well they run, because it takes time to build a program. And it takes time to run it. So you either do it very quickly, which, which means that you're doing it for small little benchmarks, which are completely relevant to what we're doing, right? So then basically you learn on something that has feature value distributions that have no match in what we're actually going to try to use it for. So we don't want to do that. Or you cannot do it. Like, it just takes too much time. So we were like, hold on a second, but we have profile information, like I talked earlier, like, we know, we collect this profile information that tells us where the program spends time and how many iterations loops take and all of that. So can't we do something based on that that kind of like guesstimates, at least a trend, right? Like, we don't care about absolute values, but at least something that can allow us to compare, you know, like to a baseline, the results of applying a new policy. And we thought we could any kind of worked like for register location. But we ended up having to select a winning model out of like a set of models that we trained, you know, like with this over synthetic reward. And we're not very happy with that. Like it's not how to put this like, we're missing that explanatory thing of like, well, why, you know, like, so if I do it for how long do I have to do it? And what do I have to look at when I look at the TensorFlow rewards and all of that? Like, what do I have to look at to know that I have to take it out and now train it or like, sorry, compare these models on on on running benchmarks? There's basically a bit of a waka mall. And that's not engineering. That's waka mall, right? So this is basically the main challenge for performance. And I basically like, you know, scaling this effort to more performance problems. And well, knowing that there's efforts on that, of course, like, come on, okay. ML model evaluation costs. So in the big scheme of things, when we did like in lining for size, or we did register location, I mean, we measured like the micro measurements on how much it takes to evaluate the model. But in the big scheme of things of like the entire compilation of a module, like of a C plus plus, basically, they kind of like goes in the noise, like it was more like a few percent variations. And it's fine. But there's not going to be that funny if the methodology, you know, like gains traction, right? There's not going to have lots of these things that take a lot of time. Also, the size of the model, which is really the weights, seems like it was kind of surprising to us. Initially, we had a small one and then working with some researchers in other teams at Google, they managed to produce a much, much larger model kind of accidentally, like, which kind of like took us by surprise, like it was suddenly 11 megs, like out of nowhere. And it's kind of funny when we're trying to optimize something for for reducing the size of either binary and LLVM itself blew up, right? I think that these are more like things that caught us by surprise. And we, to our understanding, in talking to ML experts, there's ways to mitigate this. But we kind of learned that we look a lot more like an embedded scenario than that we imagined, basically. So kind of like an interesting research topic, I think it's interesting at least to us as compiler engineers, but it's a research topic for the ML community, rather. How would we know without having to actually compare the result that a policy loses power, if you will, right? So, you know, like I was saying, people like Fuchsia, for example, train a policy and then they just decided, well, we'll just retrain one automatically whenever we we produce a new toolchain, right? But is that overly aggressive? Or was it like about time to do that anyway? Like, it'd be great to have a signal that tells you, hey, you know, hypothetically, maybe the feature value distribution changed, and it's out of the domain that actually the model was trained on. So hint hint nudge nudge, maybe it's time to train. But we don't know if that's actually what the indicator is. So that's what I say. I think it's an interesting topic that would be valuable to us, because it was give us an early indicator purely based on compiling, right? We can run the compiler and just see these values as you compile. You don't have to like do benchmarking for for for that. Oh, so in retrospect, I really so this is like honest truth. The first statement is true. We thought that right, like we are convinced that ML is magical. And we will get these policies that are awesome. And there will be at least not regressing and you know, like improving things and there will be no regressions and things will be great. And then we saw that all of them have the typical pattern that we have also in manually written heuristics, which is, you know, some things regress, some things improve. So that's all things are, I suppose. And maybe we can do something better than than that with additional policies that select the right one. But that was a bit of a surprise to us. Okay, performance. So like I was saying, I guess performance is some issues. But we went ahead and like, looked at like, where does the train model find opportunities for additional savings, right? And taking a step back. So what do I do as a compiler engineer in these sort of cases, like I look with Linux Perftool at, you know, runtime information. And I see where it's read. So where there's hotspots. And then I think really hard and look at the compiler and why it made those decisions. And I go and fix that. And then the red turns gray or green and sweet, right? And then I have to do it again and again until I make sure that there's no regressions in other parts of the code base. But that is basically what you do in that case. So when we looked at like functions that we had both indicators in the reward signal as poor as it was. But I mean, it was indicating that, you know, he's doing better. And we looked also empirically at them like, and yeah, they were doing better. And we're like, well, why? Right? So we look at the code and we couldn't tell why like we look at with Linux Perft and there was nothing shining, right? I mean, the code was different, right? Like we could tell that like, you know, pure line by line, you know, deep, it was different, but nothing was popping. And then we did a bit more investigation. And it turns out that the mail was finding or like, you know, the enforcement learning algorithm was finding opportunities in lukewarm parts of the code. So these are things that kind of like end up being like a peanut butter effect, right? Like I mean, nothing in particular is bad, or is improved categorically. But in aggregate, you can, you know, you get like a spread effect that is actually amounting to something. Great, but it's possible that that something is actually just noise, right? And I mean, today, we don't have a way of capturing that. Like we just say, Hey, here's the profile that we got by collecting it from from a running binary. And then I'm as is great. Okay, here I found an opportunity and actually that's just purely noise, right? So this is the part that I kind of had a bit of a trouble like how am I going to title it or anything. So what I ended up doing is just saying what I wanted to say. So as a compiler engineer, so as a developer in the open source, like as an LLVM compiler engineer, if this pans out more, like, you know, if you get more passes and the mail is, you know, like actually delivering more and more value to us, right? What's going to happen, right? So, well, on the plus side, I spent less time, you know, like tuning and twizzling with thresholds and other flags that I have today in the compiler, because I actually can can use a automatic feedback driven, self improving methodology, right? Like reinforcement learning. Okay, I think that's great, because I can actually focus on understanding what actually matters, right? Like for for driving that performance, like what features are important stuff like that. The barrier to entry though might change. So today you can use like, you know, like, you know, cheap, not this one, but a cheap machine, right? And compile the compiler and look at performance, you know, like optimization problems, and it's all fine. And ML, at least my view of it is that it has this risk of like quickly skidding into like, Oh, you need a farm of computers. And today, that's not the case, like I was saying, like, with what we've been doing, the models are small. So we didn't hit that problem. But that's a consideration, right? Like, I mean, is it going to be harder for, you know, the compiler engineer aspirant of the future to enter the field or what? The mental model is kind of different. You can have hinting at that before, right? Like, I mean, like, you don't think of the problem like you were before, you look at Linux perf and you find hotspots and stuff like that. But that's fine. Different, different just means different. It means like, you know, we can adapt, right? This is my pet peeve. Like the when you look as a compiler engineer, the ML frameworks, they are scary, because they're like very low level and they talk about things that I don't understand. And they're not talking about things that I want them to talk about. And we're not sure yet where that interface is. And I think that part of the the goal of the project is to kind of like figure out what that interface is. But today, it's like that. Like I was saying, there's links in the, all the links are actually in the, in the deck. And that's the end of my presentation. Yeah, questions. So the optimizations that you find using machine learning in code, can they also be put in LLVM itself without using machine learning? Or is it, can it only be learned using machine language because it is using the data, for instance, optimizations? So the optimizations, can they also be put in LLVM itself without using machine learning? Is it missing up? Is LLVM missing up? The optimizations that you find using machine learning? Right. So I'll say just to make sure that you're saying like the types of optimizations that we learned, could we just do them as normal imperative code back in LLVM? Some yes, some no. So especially the, when we looked at the type of optimizations that the size optimizer was doing, means some decisions are unexplainable, right? To do the wrong thing early on, but just because he kept learning the statistic by taking that path later is going to be all right. So that's kind of hard to translate into imperative code, I think. But some, some might be. What I'm saying is that the hope is that we, like so far in the evidence is that we kind of, it's hard to do that. We only have one time for one more question, one more question after this. Hi, thanks for your great talk. You've been talking about applying these techniques to clang and traditional compilers targeting, well, executables in the usual sense. What about machine learning compilers? So I'm thinking, yeah, applying ML to ML. I know there is some research in that. Do your techniques connect to that? Yes. So applying ML to ML compilers, right? I mean, MLIR, for example, is part of the LLVM project. And I think that there is work trying to do that too. And the infrastructure would be the same because I mean, it's all the same code, right? I'm not an ML for ML compilers compiler engineer. The word compiler appears way too many times, but we work with those people, like, so I don't see a reason they cannot apply this. I think that the domain though is, has its own idiosyncrasies that you cannot just take exactly what it is and apply it over, but the tooling would be the same. Does that make sense? Okay. One more question. All the way up there, really? Hi. I saw during the slide that one of the problems is that you are not really aware if by choosing a tree, a representational tree of the semantics that you are trying to compile, it's going to be better or worse compared to another tree that you are not for. And I was wondering, are we using the operative research theory? I mean, all the mixed integer linear programming theory that gives you a model of the reality and help you understanding how far you are from the optimal value of a certain representation. So, I'm not sure understood the question. Let me try to say back to your saying, are we applying? Okay, yeah. I'm seeing that machine learning basically relies on a loss on how far you are from a certain optimal value. And I'm seeing that there's a branch of mathematics called operational research that his work is trying to describe a word in an idealized matter. And you try to describe how it's costing respect to my objective value, making a certain decision instead of another one, and you get like a math formula. And there's the simplex algorithm that helps you to traverse those. Yeah, and I was wondering, are we trying to integrate those two fields of mathematics to reach? So, I think, let me give the answer because it's also time. So, and if the answer doesn't make sense, let's talk. I think the key problem is like understanding what that gap is, actually measuring that. And it goes back to the reward signaling thing. So, should we apply what you said? Probably, again, I'm not an expert in that. So, I mean, if you think it's worth doing like great. But the problem is that you'll hit very quickly is that the reward that we give or the signal that we give is bad. Right? So, then probably the rest of it falls, right? So, we need to fix that first before we can apply these things. But yeah, absolutely. Like, I mean, we should try all sorts of like methodologies. Like, there's a whole point. Did I make sense or did I miss it? Okay, let's talk more. All right, everyone give March another round of applause, please. All right, we're starting in about two more minutes. So, please, stick around. Don't forget, the desks are very loud. Please hold them down. Don't slam them. And we have the matrix room up and running again. Can you help me try to figure out how to make the both mics work? Can you do that? Can you hold it and can you talk to that? And unmute it in a second. This, this. Yeah, yeah. Can you start? How about now? Hello? Can someone give me a thumbs up? No. Someone got a thumbs up? Hey, thanks Marty. One second. Huh. At all. Nothing at all? Nothing? Okay, yeah, this is not working at all.
Open Source AI at TechWorks, the UK trade body for Electronic Systems Engineering
Okay, so our final talk today is by Jeremy Bennett here. I have some notes. So you live in Southampton also, which is also in England. By the New Forest, which is almost a thousand years old. You spent some time in Paris and Nuremberg, which is great. You adore compilers from what seems like from reading this. And you have acted with Hugh Grant? Wow, that's an interesting story. Alright, sir. Jeremy, take it away. Thank you very much. You can ask me later where and when I acted with Hugh Grant. Okay, this is our last talk. It's only a short talk and it's a bit of a long story. I want to talk to you about my work I do in my spare time and William works with as well with tech works. Anyone here heard of tech works? It's the trade body for electronic systems in the UK. And just in case you think that's relevant, it's worth about 100 billion a year to the UK economy. It's about a million people working in that industry. It's 8% of the entire British economy. There's a reason why the minister turns up to the annual meeting. He listened. So it's a powerful body and you will certainly know the members, IBM, ARM, Cadence, Mentor, Siemens and the like. So it's a big body. It covers a lot of things. It was originally the National Microelectronics Institute and that's the one on the top right there that looks after silicon chip design. Going round, you've got Power Electronics Group. You've got the UK Electronics Skills Foundation which is the educational charity arm that oversees students' internships going into universities across the country. There's TechNest which is the embedded software group. There's eSIN, the Automotive Expert Group that looks after the automotive industry. And lastly, there's the Internet of Things Security Foundation. I'll come back to that. Now, what are they doing here? Because they're not an open source organization, anything but. But part of our role as open source engineers is to educate the wider world into the merits of openness. And I want to draw your attention to the Internet of Things Security Foundation and that's what it says on their front page. Okay? Material is published. It's a contribution from industry and you can download the material and you can download them for free. Okay? They're freely available to you and indeed there's an example of one and when we say free, we mean a Creative Commons attribution license. And that's a perfectly valid open license for what is documentation fundamentally. And so even though this has some of the biggest proprietary people amongst it, they have chosen to do their standardization work, their best practice work, their guides to the engineers in the industry to make them fully open. And they were put together by an open process and one of my open source engineers, you'll find his name in that document because he wrote a big chunk of it. And that's where the open philosophy is something you sell to them. And I was one of the group that sold the idea of doing this in the open and I'm a founder member of the Internet of Things Security Foundation. So how does that apply to AI? Well, William and I have been heavily involved with AI at TechWorks. I have the last year or two been co-chair with Mike Bartley of the AI initiative we've had going on long under the hood. And most of our members are experienced professional engineers and I think we heard a lot earlier from Stefania about the importance of education, but I'm particularly interested in the education of people who are already experienced. We've got lots of experienced engineers. How do you bring those people into a new industry? They've got their marketing guys telling them our new product's got to have AI and that's probably about the detail they get in their product spec. And they've got to implement it. So what TechWorks is trying to do is fill a gap in the market by making guidance available to those professional members it has. And the initial things we're going to start on is guidance on trustable AI because that's seen as one of the barriers in our industry. And quite honestly, if you've got companies that are making jet engine controllers you really want to trust any AI they put into them. And more generally, the professional engineer. So actually what William and I have been working on and you can join the meetings if you want to, the next thing that we've been doing is the best practices guide. We're not trying to tell you how to do AI. We're giving you the pointers so you can do it. We're not duplicating what other people are doing. We're trying to provide you the set of questions, a Q and A you can go to to say, should I even be using AI in this product? If I should be using AI, what sort of AI? What are the questions and risks I need to address? And the idea is if you're an engineer, but you don't know AI, it'll help you make a good job of your first project and subsequent projects. And hot news, this is I think the first public meeting this has been announced. So TechWorks announced its new AI innovation cross working group. It's a cross working group because it doesn't fit in any of one of those subsidiary organisations. So we'll work with Automotive, we'll work with Power Electronics, we'll work with the Electronic Skills Foundation, we'll work with the In their Things Security Foundation. It was announced on Thursday, there will be a launch event in London and then there will be more public events. The launch event quite honestly is to get the key influences in there to understand. So it'll be aimed at the government, both the civil service and the politicians. It'll be aimed at senior managers in the industry across the UK. And then we'll propagate it down and there'll be lots of events for the ordinary working engineer. But the good thing about that is the work we'll be doing will just like the In their Things Security Foundation be in the open. And there wasn't even a question about doing that this time. It was taken as given because it was seen what the success is. So really my talk is just an appeal to you is don't just engage with the open source community, engage with the wider engineering community and try and bring them online for using open source. And I'm hoping next year we'll come back and we'll be lots of feedback and this group will have fed into the other groups you've heard around here and will have drawn on what they've done and will be a useful addition to what's there. I say you can get involved with the best practice group, just send William an email and he'll hear it. So I'm the last speaker today. So I've got the, my last slide is nothing to do with tech works. It's some thank yous. So thank yous to those here. So I'd like to thank Will Jones, who's been overall charge of organizing this room. I'd like to thank JJ for chairing all day and JJ hasn't taken a break. I tried to make him take a break but he's indestructible. So he's gone through the whole day. Michelle from the Nagara, Jonathan and Stefania. I think Stefania's had to rush off for all their work from the European network on AI safety. And those four people, I should say there were four submissions to do an AI dev room and we've put all four submissions together. So you've got the best of four possible dev rooms you could have had all rolled into one. But the most important people making a success are all of you. We've had tremendous interaction. I've not been in all the talks but when I have, it's been great to have that. So thank you very much and of course we'll see you all next year. Thank you.
Introduction to OpenAPI
Good morning everybody. Thank you for being so patient. I don't think I've ever had a full room with 24 minutes to go before the start of my talk before. So that is a very special experience. Thank you for sharing it with me. I unmuted, but thank you for checking. So I am going to talk to you today about OpenAPI. I'm going to try to give you something new that you could maybe take back and try, whether you haven't seen this before or whether you're just looking to level up your game a little bit. My name is Lorna. I work for Redockly. I'm VP Developer Experience there. I love APIs. My background is in software engineering. I've been a developer for most of my career. I've built APIs, integrated with APIs, worked for API producers, done API consultancy. Now I build the API tooling. It's, yeah, look, it's a thing that I enjoy and I'm happy that you are all here to share it with me. So let's start by talking about OpenAPI. OpenAPI, I know a lot of people raised their hands, but maybe it's new to some people. OpenAPI is an open standard. It's a way of describing your HTTP APIs in a format that aims to be both human and machine readable. What's nice about that is when we use a standard format, everybody uses the same format. And when that's an open format, it's developed in the open. You can be part of that development process and I'll talk a little bit more about the OpenAPI community at the end. You can see what's coming. You can join the meetings. You can follow the issues on GitHub. If you are using OpenAPI as a producer, as a consumer, if you make tooling for OpenAPI, there are no surprises. You know what's coming and you can be part of that. So it really improves our confidence on working with it. I think the most difficult thing about working with OpenAPI is it's just very verbose. It takes a lot of lines to describe what can be quite a simple thing. So I'm going to start by talking a bit about the structure of OpenAPI because I think when you can find your way around, you understand the map, it's much easier to work with it. So this is a representation of the things that you will find at the top level of an OpenAPI description. OpenAPI, which version of OpenAPI is this? Info, a bit of metadata about the API that this description describes. So here you'll find the title, probably some license information, some contact information, the version that we're on. All of that is in the info block. External docs. It's very easy. You publish an IAS developer website, you link to your API reference docs. If the user arrives on the reference docs, maybe from a search engine, is there a link back to that nice developer website that you made them? Check, because I feel like I've put this right on everything I've ever worked on. There is a security section and that will describe the authorization and authentication needs for the different that are used by the different endpoints in your API. We've got a service section, where is this API published? Tags allow you to attach metadata to individual endpoints. They're listed at the top level and then you can just use them where you need them. The paths section is where the real API documentation actually happens. This is what we think of as API docs. We have an entry for each endpoint describing what it does, the parameters that it accepts or how to shape the request and the response or responses that you can expect back. You'll also find web hooks here, so where you have an API that as well as receiving requests and returning responses, something happens and it sends you a response. You can describe those with web hooks. They're a little bit different to the request response feature. Those were added in 3.1, which, although it is the newest version of OpenAPI, is 3 years old, so wouldn't describe it as cutting edge. We also have here the components section. The components section allows us to describe things that we're going to use multiple times. If you use the same filtering, pagination, date format, if those are common patterns across your API, I mean, if they're not, we need to talk. But if you reuse those things, you can define them in the components section and reuse them. So knowing kind of, they can go in any order, but knowing where you are and where the other things are that you might need can make these very long documents navigable. OpenAPI descriptions are often thousands or tens of thousands of lines of code. My favorite test API description to use is the GitHub one. It's quarter of a million lines of YAML. Like, yeah, you need to know where you're going. Your tools can help you. But it's like carrying something that's not exactly heavy, but it's just a bit unwieldy. So let's drill into some of the detail. Here is just basically the top part of your OpenAPI description. We have a version. It's not very exciting. We have an info section. We've got a title. Give your API a unique and meaningful title. We have summary and description. A lot of OpenAPI elements have these two texty fields, the summary and the description. The difference, the summary is just text. It's short format. It's usually shown in a listing. The description supports markdown, specifically common mark. It's usually shown when we're looking at the detail. So if your API is shown in a catalog or in a list, it'll use the summary. And if you are viewing the API reference documentation, you'll probably see the whole description. And don't be afraid to use the markdown features for links and to really enrich what you do within your OpenAPI file. There's an info version field. And I think this is one thing that I see people getting confused with frequently. Info version is the version of the API description. So if you change this definition document, you're going to change the description field. Does your API info version need to match your API version? I don't really care. But if you change your description a lot, can you please bump the info version so that I know I don't have the latest version of this document? You lock it to your API version if that helps or don't. Maybe you haven't made any API changes, but you did add great descriptions, better examples or something else that changes the OpenAPI description of your API. Bump the version so I know I need to get the new one. Please add a license. Yeah. So this is like some nice fluffy rendering. I made this with Blockly. I hope that you like it. And I think it's just easier to look at than the real thing. This is the YAML version. And I can do 10 screens of YAML and I will be having a nice time, but I don't know if you will be having a nice time. So I brought you some pictures. But this is kind of the equivalent of seeing it in YAML. Like now imagine another 20,000 lines and you're starting to visualize how this thing looks. Okay, let's look a little bit at the paths. We have within the YAML path section, we have one block for each combination of URL and verb or method. So like I have one that is item endpoint, it's got a get operation. Got another one. I'm really good at naming things. Called things another URL which has both get and post. Those are different operations. They get their own description. If we drill into one, how's an operation ID? Fun fact, operation ID is optional in open API. It's technically optional. Honestly, you need it. It needs to be unique. Just get your linting to put that in. There's very few APIs where this isn't a useful thing to have and it's not like it's painful to do. We've got a description. You probably would have a summary as well. Won't all fit. I have added some tags to my endpoint. This is related to user and accounts. We might have user and orders or some other combination of tags here. You can have multiple tags. If there were request body requirements or parameters, those will be described here as well. And then we've got the responses. I've only got the 200 response here. It's very bad. You should always describe your 400 response errors. I got 200 response here. It's application JSON and it's just got a couple of fields in it. I'm going to drill into that in more detail. It's the same endpoint. More detail. Shuffled down a little bit. In my response, you can see I have a maybe you can't see actually because the font is quite small. This schema has a message and an event ID. I've got data types. I've got descriptions. And I've crucially got examples here. The examples are the magic because it lets the user know what kind of data will this be. You can tell me it's a string. But if your example is, I don't know, are you UID? I'm like, oh yeah, I know what that is. If you show me it's my username or you show me it's an ID, okay, I am just instinctively going to put the right thing in when I'm using those tools. If you use the same fields in other places and it's becoming increasingly standard that even if you're not reusing them, you'll often use the open API reference syntax to refer to them being stored somewhere else. So instead of defining each of the objects or elements of the response payload, you just refer to use a reference, dollar ref, to refer to that description and put the description in the components. So your path entry looks like this and then we have that detail down in the components section under schemers. So this gives you a very powerful reuse. The key to API experience is consistency. And so the reuse helps us to just, without thinking, get it right, get it the same, get it consistent and avoid having similar named fields that might take different timestamp formats or look identical but validate differently because our back end application didn't understand that they were the same thing. So that's the structure of open API but I really felt when I created those slides that I was missing the magic. The thing that brings me to this and makes me believe in open API as the powerhouse of our modern application development. And when I think about open API, I think about the things that I do with it and the things that it enables. You think about the way that you design your API, giving meaningful operation IDs for each endpoint and these can be used by the tools that consume your API description. Having great descriptions, naming things in such a way that developers don't need to come and read your documentation because they will know from the operation ID what it's going to do and it's very consistent. They feel at home. You describe your error responses even if I never publish my open API description. The fact that I wrote down the error responses makes my API better because I thought about what I wanted to do if something went wrong. I can validate my API and make sure that my open API is valid, is at the standard that I want and I can have my own linting rules as well. Operation ID is optional. Why? Not in my APIs. So I write my own rules. I say we use kebab case here. We use plurals here. We always define an error response. We make sure that our examples match our media types. These are the things that you can add with the additional linting rules. We can create documentation. That's great. You have an API. You should probably have some docs for it. We can also allow other people to pull the open API description and generate their own docs, keep it locally for reference. I have some accessibility needs. If you have an accessible API web-based documentation, I can just generate with something that works for me with my open API locally. It's ideal. Beyond this sort of entry level, there's some more things that I think we are not doing enough of in open API. You have an API. You describe it with open API. You lint it. You generate some docs. This is great. Please do these things. You are all awesome. The next level is how you deal with very complex API setups. If you work in a large organization with many microservices, how does that pipeline look? How do you keep them all meeting the same standards? How do you bring them together to publish as if you knew what you were doing to the user? Don't mind if you do or not, but you need to look like you do. How do you bundle those things together? If you have one enormous open API description, how do you collaborate on that when you are making changes, whether you are an API experience specialist, product owner, engineer, tech writer? How do we give you a clue that GitHub file is not maintained as a single quarter of a million line YAML file? Looking at how do you manage your files? What do you do with references? How do you split across manageable file chunks? Then how do you bring that together to ship downstream? Finally, what do those downstream tools look like? A lot of organizations, organizations come into open API because they want documentation. This is the beginning. We don't want to write a whole load of words. We just want to describe once with open API and then we can generate some documentation and we can generate it in different ways. Then for free, you start being able to get all these other benefits. You can generate some client SDKs. You can even generate your service stubs if you want. Lots of tools will automatically integrate with your API if you have a good standard open API description. So your API gateways and other integration platforms will just take it. But you can also start to automatically look at how do you describe sequences of API calls? How do you test your API? What does a mock server look like? Because you've described this API in so much detail that a tool can pretend to be it very easily. So there's a lot of pieces here that make up the ecosystem. Open API is kind of the seed from which the rest of the tree grows. For me, this is the magic. It's the interoperability. It's the way that we come back to maybe we generate some open API. It's terrible. So then we use overlays or decorators to add all the descriptions and examples. And maybe not all of these end points are public yet. So we just filter out the public ones to make the final open API and generate some docs. Maybe only some of them are available in the SDK. So we filter differently, make a new open API file, pass that down to the SDK's endpoint. Maybe the next generation of your client SDK has some new functionality. Well, that you start with the same source file or files and bring that together. So it's all about how do you not code, generate docs, but how do you create your open API? Don't have time for my design first rant. So I'm going to try and hold that in. However, your open API comes into the picture. How do you maintain and manage it successfully? How do you ensure the quality on it? How do you transform it and get it ready for all the outputs that you choose? There's just so much in this picture. Let's talk about some tools. Now, I've just linked open API.tools here. I'm not making any specific tools recommendations. That's for two reasons. One, this is a really hot area. There's new tools every week. There are different tools for different text acts. When you are ready for a new tool on that day and no sooner, you should go and look at the list and pick something. The second reason is I work for a tools vendor. I work there because I use their tools. I cannot possibly give you an impartial recommendation. I went to ReadDocley because they know me and I know them. I really don't know the other tools that well as a result. So don't listen to me for specific tools. I work on the ReadDocley stuff and I love it. You need an editor. There's basically two ways to go. You can use a programmer's editor, something like VS Code. Please add some plugins to help yourself. ReadDocley makes an open API plugin. Even if you just have some syntax highlight for YAML, the one that makes the indentations a different color helps me a lot in YAML. Find something that works for you. There are some graphical editors and if that's your thing, then go find one of those. You don't need to pick the same as your team because it's an interrupt format. You use whatever you want to collaborate. Try really hard not to lock your team into tools. Again, accessibility needs. I need to do it in Vim and of course I can. That's part of the magic. Open API governance, which is clearly not a tool, but let's skate over that. When you write, your API standards do not exist until you write them down. They are not standards until they exist somewhere that somebody else can look at them and they are consistently enforced. We have a lot of really good linting that can really help you, but the humans are always going to be in this review process. Find your most wise and thoughtful humans and invite them to be part of the review process. Naming is the thing that the machines genuinely cannot do for us and just the joined up thinking of being able to see things next to each other. As you introduce API standards, start small. Do not be tempted by other people's recommended rule sets, not even ours. Pick what works for you. Look at the recommended rule set, but then pick the things that you aspire to and can adhere to today and commit to reviewing every six months and building up the quality on your API. If you're retrofitting standards to an existing API, there will be things you cannot change now and that's okay, but you can set those rules for the new versions. If you don't know where to start on this, I am going to recommend Zalando. Have some brilliant public API standards and you could do worse than, okay, they have a lot. Start small, just pick your favourites out of theirs. It's a great place to start and your organisation will evolve as it goes along. Please put some linting in. The machines are genuinely good at this. They can help keep you straight. Is your open API valid? Does it have descriptions? Does it have examples? I've got one team that I work with where we have a whole API where the description for the access response is okay with a full stop and it turns out we enforced sentences. So it has to be at least one word and at least one full stop. Yeah, we did some work with them on that. Get some case conventions, some naming conventions and be really picky about what you include. I do this with Redockly CLI, so if you are using that, feel free to send me questions. If you use something else, I can't answer your questions, but good luck. Open API documentation. Read the docs for your docs tools. I see a lot of implementations where those functionalities exist in the tooling that you've used, but you haven't really dug into what it can do or looked at how you can extend or configure it. API reference documentation is evolving very quickly in a good way. There's a lot of new entrants in this market. I'm not sure if I'm supposed to be saying that we have a new product coming out later in the year that does this. It's beautiful, but you have lots and lots of options. Whatever you've picked, make sure you're making the most of it. And if you have something that's, oh, our, I don't want to malign any other tool families, but something which isn't specialist docs and it can render documentation is a great way to start. But because you have the open API format, you can use all of one tool set for one thing, something else for docs, something else for your SDK gen, like lots and lots of options. Open API, when you publish documentation, your documentation is part of the product. You should be deploying it often. It should be easy to deploy and redeploy. And make sure that you're treating it like a web product. Get some metrics, have a look at what's happening, see what people run into. If you have interactive docs, what are people calling the same endpoint all the time? Is it super popular or is it super confusing? Why is everyone here testing this thing? Have a look at those metrics because they can really help you understand your product. I want to talk a little bit about the open API community. This is something that I don't always include in my technical open API talks, but as far as them, it feels appropriate. It's an open standard. It's part of the Linux foundation. You can get, you can learn more about it on openapis.org. The GitHub repository is public. Everything happens there. We have a Slack group. It's very active. Also, public to sign up. And there's a weekly technical meeting. I will confess, it's not super friendly for Europe. I think it's 6 p.m. Central European time, 5 p.m. for me in the UK. Yeah. I'm trying to get to a critical mass of EU-based maintainers, and then we need to start mixing that up. But yeah. If it's unfriendly for Europe, it's sort of dinnertime. There's no hope at all for anyone east of here. So yeah, we need to fix that. But the open API community is currently growing its maintainer set. It's working on some new stuff. Like, this is a good time to get involved. We've also spun up some special interest groups. So just to kind of tease some of the headline activities within the open API project. The Workflows special interest group describes a sequence of API calls. So if you have, this has come from the travel industry. So where you need to find the flights, find the seats, ask the user, book a seat. None of those make sense by themselves. Workflows aims to give an extra level of description for that. Overlays is a special interest group that describes repeatable modifications to an open API. So if you have a generated open API that is just thin, you don't maintain good examples and good descriptions when you're generating from code, and lots of organizations struggle to get away from that Java doc workflow. Overlays can help for now, where you can get your open API and make the same changes every time to make the descriptions better and add examples, hide things, whatever. Open API 4.0. Code name, Moonwalk, why? Don't ask. Don't let engineers name things. Open API, Project Moonwalk, is committed to doing some sort of release this calendar year. So that is just starting. The high level goals are to give you a really simple upgrade from 3.1 upwards, so 3.0 you might want to go to 3.1, and to include a wider range of HTTP APIs. Open API is amazing for RESTful APIs. Okay for some other HTTP-ish, RESTful-ish ones. Moonwalk will include the RPCs and a wider family. So if you've struggled with open API, have another look in about a year. Yeah, open API, an open standard for API descriptions. If you're not using it, I hope you will now or feel like it's a thing that you can approach. If you are, maybe I've given you some ideas to go back and look at what you might change in your current workflow. I'm going to leave you with some resources and say thank you very much for your time. Okay, I'm allowed to take two questions. Would anyone like to take a question? Yes. This is a really good question. How do I feel about generating open API from code or code from open API both ways? Let's start at the beginning. A lot of organizations generate open API from their back-end server-side code. I don't like it. And the reason I don't like it is I think when you go code first, you're missing a design step. When you design first, you're thinking about it in the context of the rest of the API. You're more likely to get the naming right the first time because that implementation is not done by an engineer by themselves. So you ideally design first APIs. You propose the change to your open API with a pull request. You're wise people and you're amazing linting. Go a few iterations to get it perfect. Then we build it. And that's my ideal and that's why I prefer it. The other question, generating code from open API? Yes, go for it. I think we have this machine description and there's a lot of boilerplate. So we can go quite a long way to things like client SDKs from open API. When I talk about the transform step where you have an open API and you make it better, for docs, you're going to add examples and descriptions. For API gateways, SDK code gen, that sort of thing, you're going to add metadata here. You're going to give the type hints that the specific programming languages and text stacks need. And you're going to give extra information. You might not have that at design time, but if you think of it as a pipeline that splits off, you might want to add some extra magic from your standard open API to enhance it before you generate code from it. But generating code is typically fine. It will only be as good as your description is. And lots of those fields are optional. So cool. I am out of time. Thank you so much, everyone. I hope to see you during the event.
Deploy Fast, Without Breaking Things: Level Up APIOps With OpenTelemetry
with the topic. It is a very big mouthful of a topic today, but I'm hoping that we're going to break this down for you today and that you're actually going to learn something that you can take home back to actually implement yourselves. I'm here just to be talking about the open telemetry part. Sonya is actually the brains of this operation. She's basically been planning this whole thing, set everything up and just invited me at the end because yeah, because I'm pretty. That's basically all that I'm contributing today. So I am hopeful that a lot of you have had any type of touch with open telemetry and observability in general, but also that you know the basic DevOps principles and how that is going to be connected with API Ops. Just an introduction for both myself and Sonya. I am Adnan. I do developer relations as you obviously might have already figured out. And yeah, Sonya here is a product manager at Tyche and I would like to hand over the microphone. Yeah, hi. I'm a product manager at Tyche. So we do API management. We have an open source API gateway. If you were in the session before that, you have seen it on the screen. It's an API gateway that's written in Go. It's really fast and has lots of capabilities. So do check it out. And now we are happy to talk about the topic. Cool. Just a quick rundown of the agenda for today. We have four main topics for the agenda today. First and foremost, we're going to talk about API Ops, what it is, how you can get started. And then from there, we're going to take a closer look into how to do API Ops hands on. So we're going to start with the Kubernetes cluster. We'll walk you through how to use Argo CD and Tyche for your API gateway and basically just enable very fast flows and very fast deployments and release cycles within your APIs. From there, we're going to move into production environment. So we're going to say, okay, so what do I need to do to get observability, to get insight into my production APIs? And from there, we're going to shift left even more and figure out how to integrate the release cycles and make them have integrated. I'm going to say integration testing as well. So we're shifting left even more using the production data, so the observability data for testing as well. So that's going to be, I'm going to say my most favorite part because I'm here from Trace Test and we do that. But for right now, let's do the API Ops portion first. Yes, so what is API Ops? Thank you. So you might be familiar with API management and I find that sometimes in API management, we have too many manual operation. And as you all know, manual operation, that's a cause for disaster, that's a cause for error, that's a cause for security problems and we need to speed up things. So my interpretation of what is API Ops and you might have heard about API Ops and some vendors will try to push their ideas of what is API Ops. Some would say it's about deploying your API fast. I'd like to bring a bit back the cultural side of DevOps and say that API Ops is the offspring of DevOps and API management. So it's applying the culture of DevOps to your API management lifecycle. And why? Because you want to deliver value fast without disrupting your users. So if we think back about the DevOps culture, the DevOps principle that originally came from before we started to have lots of vendor trying to sell off things that are DevOps applied, it's about fast flow. I want to be able to commit and have it used by user to have feedback. So to have that culture of having feedback loops. And it's also about enabling that culture of learning. I want to understand what's going on. I want to learn fast, fail fast and be able to provide value to my users. And we're here today to tell you that we think that observability is a key enabler for all that in API management or API Ops. So let's take a look at how to implement API Ops in modern Kubernetes environments to have fast flow. So typically you will have a developer that's building a service. You will have things like open API specification along the way. So we had a talk in this room earlier about open API. I'm not going to go more into details, but it's definitely a space, a place that you have to take into your CI, into your continuous integration, making it all automated. Today we're going to talk now a little bit more on the deployment side. That's why we haven't added it, but of course things like linting and generating documentation. All that should be part of your process. So once the developer commits something, it goes to the CI, continuous integration, and the result might be a Docker container. So it gets published. And now we want to deploy that. We want to deploy that new version of that service. We want to deploy it with an API specification. And for that in Kubernetes, the new way of doing continuous deployment is to use GitOps. There are projects like AgroCD or Flux that are able to do GitOps. What does it mean? GitOps? You're lucky you're really pretty. Okay. So the main thing about GitOps is you don't have a continuous pipeline that pushes the things and deploy to your server. That's the Kubernetes cluster with something like Agro. Pull the information and deploy it itself. So how does it look like? You have then at the end of your CI pipeline, you have to make a change into your deployment repository. You have a code artifacts for all your changes, all the configuration. And you might have a new version that is placed into staging. And AgroCD on your Kubernetes cluster can be configured to automatically pick it up and deploy it. So all automated. Now there's another thing that you need is to expose an API is an API gateway. So in that example, we are using tag API gateway to use the authentication, verification, monitoring. So we add an API gateway, open source API gateway to that. And that's going to be interesting also for the observability part later. So an API gateway helps you to centrally manage your APIs to use authentication, authorization, weight limiting, all this capability that you need in operation. How do you add that? The Kubernetes and GitHub way. Typically we focus on resource definition like it's the way in Kubernetes. So you can add things. And that's a very, very simple where you can say which protocol it use. You could define things like weight limiting, like security policy, which service is proxying on your cluster. And again, it's configuration as code. So it's again central repository. And when you make changes to it into your deployment configuration repository, something like ArgosCD will track it and will apply it automatically. So what we see at the end in your ArgosCD application, you see, okay, all my application definitions, all my application are synchronized automatically with whatever I put into my Git repository. So now we have the first step, right? We have automation for fast flow. We are preventing configuration drift. We have enhanced security. All is automated. No manual error. We are more efficient. We also have an audit trail. So we see exactly what was changed in the deployment of your APIs. And we have better collaboration and visibility on what's happening. Wonderful. And obviously, as the slide says, that is not enough. So we're getting the automation part down. What do we do next? Step three in the whole process is to get additional feedback into your feedback loops so you can connect both ops and dev correctly. So what this means is that the ops team needs to enable the dev team to fix issues by exactly knowing what the issue is, so that the dev team doesn't need to spend useless cycles trying to figure out what the problem is. And we do that by using OpenTelemetry and using Yeager, which are observability tools within our API ops pipelines. Now, this is what we exactly don't want. We don't want to see gears turning and hoping it's all fine because it's not really fine. You don't know what your users are seeing. So we don't really know if our users are happy. We just kind of know it works. And then you kind of do prayer driven development, as I like saying, that's not really what we want. We want to use observability to infer the internal state of our system by getting telemetry out of our system to understand what's actually happening. And then we can figure out whether our users are happy. Because this is something that we can see by using observability with distributed tracing. When our API is exposed telemetry, we can actually see, oh, okay, obviously something is wrong because we have breaking APIs. So it's pretty obvious that our users are unhappy because we can obviously see things breaking for them. And this is a particular view that you get by using Jaeger. Now, let's get to the fun part of actually showing you how it all works and how you can set it up yourself. Now, the way you do it is you use CNCF observability tooling. So tooling from the CNCF tracing landscape, more specifically open telemetry and Jaeger. Open telemetry is an incubating project. Jaeger is a graduated project. So they're all fully open source supported by the CNCF. Now, the specifics are that you use open telemetry as the open standard, we're very focused on open standard for the whole dev room today. So once again, it's an open standard to generate, collect and export your telemetry. Remember that part, it's a bunch of libraries and APIs that help you generate, collect and export telemetry. Now, where do you export it to? Well, you export it to Jaeger, which is a tracing backend, which is just like a data store for your distributed tracing. And then you use Jaeger for all of your production monitoring troubleshooting and whatever else you need to do in your production environment. Now, from this, one of the bigger issues is that open telemetry is quite hard to implement if you're new to it. So some vendors like to bake it in into their systems. One such vendor is there was a lot of suspense, right? Yeah. Yeah. So one thing that we did in tech is to add support, native support for open telemetry, because we know that people that works in the API space, they use API's to proxy multiple services, and the developers might not yet have implemented open telemetry. But we know they need one where to report the data on all the APIs have really visibility on what's happening. And so we added support, native support for open telemetry in tech to enable our user to export this data and to capture them automatically for older APIs. So that's need a couple of settings. This is settings for our hand charts. So where do you need to enable it in tech? You need to say where do you want to send the data to an open telemetry collector could be also directly to an observability backend. And this is what you get. So for every API request, you get a distributed trace for what's happening at the gateway and till the upstream service. So you can see, first of all, you can see any error that's happening already at the API gateway level, authentication error, wait limiting. We see sometimes people only monitor what's happening on the service, but they don't realize they're already missing a lot of people having issue with the authorization, authentication, wait limiting. And then you see what's happening in the upstream. So you can very, very quickly catch up errors, understand not only the timing text, the HTTP response code, but really what's happening if there's an error, if something is slow, where is it happening? Is it on the API gateway, is it on the upstream service? What are the details of the transaction that enables a team to better troubleshoot the issue? And with that, we have now achieved feedback from production. So we have healthy development lifecycle with feedback loop between Dev and Ops. If there's an issue, then the Ops team can report it, can take a look. So it's not only an error on a metric that goes up, it's really a trace where you understand where's the problem, you know, which team needs to act on. And it enables you to provide a better user experience, fix the issues earlier. Again, what have achieved, feedback from production, we no longer relying on user reporting feature, no longer somebody that calls support and say, oh, I have a problem, something is done, no, you see it, you see it all, so you can be proactive. You understand the API performance, you understand really what's happening, where the error is happening, and you can solve issues faster. And with that suspendsful mic switch, again, it's not enough. So we need to introduce another layer of, actually this one, no, we need to introduce another layer of protection. Because right now, we want, we're only stopping bugs after our users are seeing them. So we exactly know that a user saw problem that broke our API, and then we're now rotating back to fix it. We need to be more proactive and figure out how to stop the bugs before they even reach our users. Now, so this is a shift left even more approach, but actually for you guys, it's shift left even more approach. Because we want to add observability to our release cycles as well. So not just our production systems. So the way we're going to go through that a bit is by doing this little squiggly in between, as well. So this basically means that you need to implement something called trace-based testing, which is also called observability driven development. If you like honeycomb and their CTO, it's a term that they coined. Okay. Anyway, the way that you use trace-based testing is you quite literally using the distributed tracing that your observability, like open telemetry exposes, and then you're running tests on those actual data points from your infrastructure. So that means that even though we can see that we have our gears turning, that's awesome. But my initial connection to that API gateway is returning 200. But how do I know this is not broken? How do I know if this is on fire or not? This is an external service. I don't like I don't manage this. So this is something that easily breaks and that you don't really have a lot of control over. Now, let me show you how you can actually get to that state where you can do your testing against the distributed trace itself. This is a screenshot from Trace-Test, which is also a CNCF tracing landscape tool. You can build your test by getting the trace itself from Jaeger, and then you're writing your test specs directly against trace data. So you're not using any mocking, you're not using any faking or whatever the word is nowadays with kids use, I don't even know. You're literally getting the actual data back and running your test against that data. Now, the magical part here is that you can quite literally test against anything that's exposing telemetry. It can be an API gateway like TIC, it can be databases like Postgres, it can be caches like Redis, it can be pretty much anything that you have instrumented to export traces. Now, this is a really cool use case for authentication as well, but also for GraphQL. Now, for authentication, you have a very good example. Yeah, something like Off-Flow where you have multiple service taking to each other and getting the request, that's one of the really cool, useful examples. And also something that I've noticed as well is for GraphQL. So one thing for GraphQL is that it often returns a 200, even though it's failing because the actual error is within the response. So you don't really know, it's very intricate to test that. One thing you can do with trace-based testing is you can drill down to the actual middleware that handles that in your API gateway, find the exact error that happened, and then you can run your test spec on that exact value. So with all of this, we're getting step one, which is functional testing. So we can actually functionally validate our behavior of the system by using all of the telemetry that you've implemented in the prior step to make your production environment reliable. Now, but it doesn't really stop there. We also have step two, which is performance testing, because every span has a duration. You can quite literally go in and say, I want my duration for this span to be less than whatever value of 200 milliseconds or something, which means that if you have external services, external APIs, upstream APIs that you're not in charge of, if their performance is bad, you can validate that and you know exactly what part of your system is misbehaving. So this is the performance aspect as well. So you're getting basically two things from one, I'm going to say exercise. Now the way you do it, I'm going to walk you through quickly. You do this shifting left with trace test, which is, as I said, open source part of the CNCF tracing landscape as well. And what it does, it is quite literally giving you the infrastructure by actually the distributed system architecture by looking at the trace data. And then you can both get the overview of what your system is doing, and you can run tests against exactly what's happening in your system. So those are two powerful things because as engineers, it's very hard to know what the system is doing if it's highly distributed with a lot of microservices, especially if you're a new person on a team, it's just, it's a pain to do that. But with trace test, I want to show you how you can implement these integration tests in your Argo CD, like right here. So this is what an integration test in a post sync hook would look like. You have a API that you're deploying, you have your integration test, which basically runs a Kubernetes job from Argos, from the Argo CD sync hook, then it runs a few integration tests. If they, if they're failing, awesome, you know that they're failing, if they're passing, even better, you see that they're passing, but doesn't really stop here. The thing that you get with this is also every test that fails, you have a URL to go to that particular test to actually see precisely which part of that transaction failed within your API, within your API microservices. And I really like that part because this is not just, oh, yo, this failed, this is actually, this failed, here's exactly how, where, and what happened. And with that, we're actually getting to a stage where we're validating our production, but we're also using that effort we put into our production reliability to validate pre-production as well. So you're basically getting the exact same overview graph that Sonya just showed you, but instead of using your end users, you're running tests with trace test against the API Gateway platform, then you're getting the traces back from your Yeager or Grafana or whatever you're using, and then that info goes back to the API developer that can then fix the issues that were found. Now, with this, I'm just going to wrap up everything that we learned from this last section, which is that we got functional testing and we got performance testing. So you can both validate your behavior, or actually the behavior of your system, so all upstream and downstream services, API transactions, both the ones that you manage and don't manage, you can actually test database performance, you can test cache, you can also test the size of an HTTP response and request, but you can also do very intricate performance testing by validating the duration of every part of your API. And with that, I have a saying where I'm from. We say you're swatting two flies with one swing because I think that's more friendly than killing birds with stones. So yeah, with that, I think that this is the closest we can get to be bounty hunters because we're bug hunters. That was very lame. Anyway, so that's a CU space cowboy reference if somebody can. Thank you for making this. So, and just before we close, I want to say if this is a topic that's interesting for you, we're running an online API observability conference in February. It's going to be called LEAP because it's going to be on the LEAP there. So if that's the topic that's interesting to you, make sure to register. We have lots of people from the API space and observability space that will be coming. We also have a GitHub project about all the screenshot that we showed to you today. We were working on it as a GitHub example. We don't have a link for it, but if you're interested, just reach out to us. Those are LinkedIn. Yeah, I don't like Twitter anymore. So make sure to send a connect and we're happy to send you a link to a GitHub project. You can try it all by yourself at this combination of open source projects. Thank you so much. So we have some time for questions. Yeah, there is one over there. Questions down. Questions down. Go ahead with one customer. Yeah. Okay, so the question is, I have to repeat for the video, the question is, if I have a service that can be accessed by multiple customers, do I want to have one to send the data to different places so to split them out or do I want to have just one year, one open telemetry? And as always, it depends. And on what does it depend? It depends on do you want to give access to those data to your customers somewhere? Do you want to have strict regulation on the data of your customer where you may need to split them by location? But yeah. Yeah, yeah. Yeah. Yeah, that's a very, very, very good question. So the question is, how do I monitor the service level for every customer? So typically you have for every customer, they have, they are authenticated. So you have maybe something like a token. Yeah, yeah, but in production, yeah, yeah. So they're authenticated. So when they come to you, you can put a tag on an information on the trace, and tag will do it automatically if you're using the authorization or authentication from tag, tag. The API, yeah, it's tag. Tag. Yeah, no worry. And so on the traces, we put the information on who is going to API. And with open telemetry, you can then use the data to create your own report based on that information. Yeah. So we add that information on the API call so that you can reuse it for your report. Yeah, it's directly exposed. Yeah. That's a very good question. It's really important to monitor per customers because you want to, some customers have different usage, different patterns, and you want to make sure that every one of them is happy and not just like an average where you don't really understand whether problems. Also, the question is whether Trace Test notifies on errors. No, Trace Test is just a testing tool. You would then need something to automate the test, like Argo, and then you need something to alert on failures as well. And then you can pick the alerting tool that you want. Whatever you're using right now, you can automate within your CI, so you can build your CI within Argo or within whatever you can use Tecton. You can do basically whatever CI tool you're using, and then you're sending errors on that. So think just integration testing. You just get works, doesn't work, then you do whatever else you want to do. Yeah. Yeah. Another question. Observability data for APS, I can take that one. So, yeah, so the question is how do you deal with data privacy? And because in the observability data, they can land a lot that could be considered privacy data. So first, you have to be very aware of that, that observability data could potentially have some data that in your country, in your own regulation could have some impact. OpenTelemetry has a lot of tool for that. In the OpenTelemetry collector, there are kind of plugins that you can define using Yamal and say, that arguments, that thing I want to filter out, I don't want to register it. So you're very flexible in your observability pipeline, but that's something that you have to take care of to make sure that your developers haven't added something that you don't want to store. Sorry. I'll go for it. Go for it. Jack, when I use the data to send the data to the OpenTelemetry, this data is made only on HB8. HB8, the status only. So like a 100, 500 message, the status of the response of the HB8 request. Yes. All on another way is to analyze the response of the request. So the question is, what do we track or what kind of data do we expose with tech? So, yeah, so in tag the gateway, when it's being called, you will get the answer, but the traces, it will export using OpenTelemetry will contain all the data, all the steps, the traces that we saw in Yeager. And you can also extend them. So we have a plugin mechanism where you could, that you could load into there and add even more data if that's more open, extend your OpenTelemetry traces. The question is, where is the effort? So tech make it easier for you because it captured the starts up to the call to the upstream service and it tell you how long it took. And but if you want to get even more details, what happens after that, then it's where you need to instrument your services using OpenTelemetry. And then the beauty of it is when all the services speak the same observability language, they all send the data to the same place, then you have the full picture and that's kind of the operational dream. Thank you. Yeah. You suggest to run that on a trade production? It's right. Correct. Correct. So you wouldn't use trace this in this point of view for your production, you would use it in pre production, where you need sampling to be at 100%. Yeah, yeah, we can also just stand. We'll just wait so you can come by and chat with us. So because yeah, we don't have time. Don't follow up on questions. Yeah. So yeah, yeah, we'll be here. Come here. Yeah. Cool. Thank you.
Public calendars aggregation using Linkal
Hello everyone. Is everyone hearing me correctly? Yeah? Great. My name is Jounia Malka. I am a PhD student at Telecom Paris and I'm doing software supply chain security. I'm also an XOS developer but what I'm going to talk to you about today has nothing to do with this. I'm going to talk about a weekend project I did that is called LINCO and about like deficiencies I see in the public calendar's ecosystem. I'm running with a pretty adversarial screen resolution so if at some point the slides are completely broken I will try to describe what you're supposed to see. Right. So what I'm going to talk to you about today is like what I think is problematic in the public calendar and the calendar ecosystem for collaboration. And I'll explain a motivational situation that made me do this weekend project and then I explain like the two software we came up with to solve this situation. Right. So I think public calendars are sometimes or calendars in general sometimes a bit painful to interact with. And the problem I saw when I started thinking about this is like when you have like a public calendar and you want to follow this calendar on your calendar clients. There's different things that your calendar client can do. Let's say you want to. It can maybe have the capacity to import some ICS files. So even files in bulk but then it will not do anything more with this ICS file than display display them to you and will not for example subscribe to like the updates of these events and will not like. Continue to fetch new events as they come forward. There is like the intermediate player that will fetch the updates. So if you're even get updated like change location or something like that some some calendar clients will update them. And some will be the next player that do everything that you want is basically fetch new events as they come into your calendar. The other problem that I think is is is big like is that calendar providers are not always nice with with the possibility to export your calendars as public calendars that other can follow. Sometimes it's make it very complicated to find the actual option to export these public calendars and make it complicated for people using other calendar software or providers to to actually subscribe to your calendars. And I think also the calendar ecosystem is lacking some nice to have features that would make life easier. So I think like public calendars are not easily composable. It's not easy to like take a few public calendars and merge them into one one collection of calendars which is a nice thing that you can want to do when you're for example you're like you want to follow all events in about for example let's say in XOS because I'm an XOS developer in your region and you have several entities that organize these these events. And they all have a calendar and what you would like to do is maybe do some creation of these calendars and propose a collection of calendars that other XOS users might want to follow to get all the events in one place. This is not easily done. And the other thing that I think would be really nice is filtering of events in calendars. So just like you you are able easily to do filtering of emails. Why not be able to to filter from calendars you follow events that are relevant to you. For example events happening in certain certain geographic area or at certain given date or hour that could be really nice. And this is also very complicated to do I think. So all this thinking came from like a concrete situation where at my school there is a lot of different association that all organize their own stuff and they all maintain some play some kind of of place where they put all the events that they organize. Sometimes it's a calendar. Sometimes it's just a plain web page that you cannot do much with. Sometimes it's just send emails. But there is no there were no central place where you could just see all that get organized on the campus and be informed that way. So we had like a first iteration of solution for for this. This problem the first software that got developed at my in house developed at my school was called Mitis Mitis is a is just a web web service where there is some kind of interface. I don't know if you see it correctly but it's just an interface where it shows all the events from all the calendars. And but it's it's really nice and it was a first step in the right direction. But what you can do is this interface you can ask it to export ICS file so you can import all these events into your calendar clients. But what you can do is ask it to act as a calda server and add it to your calendar client and have on your phone or your computer all the events getting updated in real time and basically be able to follow all these events from all this this situation without action on your part. So what when I saw that I was like I kind of want this to be a calda server. So I created Lincoln so Lincoln is a is a weekend project and it does exactly that. It takes this idea and implement it as a calda server. So the design goals when I try to to think about Lincoln is like I wanted to basically do a calda server that will display several calendars coming from different places into one collection. So for the client it looks like it's one your collection of calendars that you're importing but actually all these calendars are hosted on different places. The other design goal that I want I wanted is like to be able to do some processing locally that Lincoln be able to process in a way or another the events so that we can have at some point maybe like the filtering features that I was talking to you about. Okay so the first iteration of this when I was thinking and trying to implement this my first iteration of my first idea was like okay I'm going to implement this in rest because why not. And actually I wanted to learn rest at the time. And I'm going to is going to be simple I'm going to use some rest libraries that act as calda plans so we have like mini calda for kitchen fridge. And these libraries are going to to perform the request to the underlying calendars and and this this part is kind of like logical and easy but the problem is that you also have to implement all the web dev calda specification on the other side. So you have to implement the HTTP server that's implements all the end point and all the specification of the web dev calda specification and and then you have to get all the calls and rewrite them in terms of function calls of these libraries. So the problem here was like it's kind of like a bit too painful to do because the calda web dev specification is very big and a bit. Complicated and it was a lot of things to do just for a weekend project so I was like this is no this is too complicated to painful there has to be something else. The second iteration. I was like this time I want to implement as little as possible. Of the of the web dev calda specification and and still get something working. And the idea is like we are going to rely on on the the clients so the calda clients they know how to do to format correctly the request. And the calda servers the underlying servers that we are trying to aggregate they also know how to answer this request so basically some somebody did the jobs the job for me and what I need to do is only like forward the appropriate request and appropriate body. To to this underlying calendars get the answer and maybe do some some some kind of modification at some point of the in the answers but we try to keep it as a minimum so what we see is we have the client client collect connect to link call. And then link call for what the request to the underlying calendars and the the answers come back and then we forward back the answer to the clients so we get we kind of act as a proxy and at this point some processing can happen of the request some filtering and some minimal modification needs to be done. Okay so if we if we start to to go in into the depth of the subject. We have two kind of request that we need to handle. So the first part the first kind is like request that the client is going to send us to discover the calendars that are inside the collection and these we kind of to this is the part we have to implement ourselves. Because we cannot forward this this request to to the underlying calendars it would make no sense. And the second part the second type of request is the one that wants to client as acquired all the as a list of all the calendars that are in the in the collection that we are trying to to give it to him. Then it can query the individual calendars and this we can completely just forward the request to the underlying calendars and practically do nothing on them like. Okay so I try to give you an insight on how this this can work. We have like in a in a calda client what you do is you connect to the you write down the the URL of the server and the username and password and it will try to to query the webdav server to to ask what is the calendar home. For this for this user for this principle and so we implement one one endpoint that is called principles link all so link all is the name of the user you should provide to your. Calda clients and and then it will try to to query this pass and. And what what kind of clients are going to send this is called like prop fine request it's property find request and it will ask for this specific property calendar on set it will ask for. A lot of different properties that we don't really care about but at some point it will ask for this property and when it does we answer that it should go and look at this pass slash cal. And so when it behaves correctly this is what it does. So the next column the calda clients do they they go to this past because they now they know this is the collection route. For for calendars and will try to to now find out what are the calendars that are in this collection. So it queries this this pass and then at this point I tried to implement also this. This pass by try to like guess what properties we should send back to the to the client at this point. But it was too also to painful so I took another direction. Instead I forward all the I forward the request that the client send me to all the underlying calendar and the also an answer and I aggregate the answer and this is what I send back to the client so now the client know the. Basically all the calendars that are in this in this collection. And we have to do some kind of hijacking of the answer. So that we modify some of the fields and the most important field so there is a lot of cosmetic fields that you can modify but the most important field that we need to modify is like the URL of each calendar so basically. Each underlying server here when the answer to the request they will say oh fine this calendar this specific calendar at this specific URL which is they will give their URL right so we have to change this so that it corresponds to what where we can answer the. The request for the specific calendar so we we change the URL for each calendar to slash calc slash the name of the calendar and so now the calc clients as a list of URL for each specific calendar. And it will query this URLs to fetch the events. And so now this is the part where we just shamelessly just forward this this request to the underlying servers acting as a man in the middle. And again when the response come then we can. Do some little modification and we can do some cosmetic modification like change the color of the calendar as it should appear on your on your client so. It may be possible that you try to aggregate several calendars that have the same color so you want to do some modification at in the in the. In Lincoln so that when they appear the collection appears on the clients they all have different colors or nice colors. So as a little working example. Sorry. Let's say I'm like I want to offer Nick's US calendar to the user that aggregates several several different. Calendars that are offered by different entities and I have like three entities so for example the genome which is like an association that can offer Nick's US meetups. Let's say a school can offer Nick's US courses and there is some next parties which are like let's say very real things organized by Nick's people. And that is also in this third third calendar and so here I have three different calendars in three different. Hostors. And so the way it works is like I have to create like a JSON file that. Basically states which are the calendars I want to integrate in my in my aggregated collection so I just list them like so. Then I just run Lincoln with this specific calendar that JSON file. And it gives me a Lincoln server so basically if you want to try at some point during the day and tell me that it doesn't work on your specific client. Oh it doesn't that it does work I don't know there is the server is currently live. But what you get if you are using. Mac OS or iOS like I was when I worked on this this project is. You had the Caldav collection and you just specify the URL that I gave you and the user link. And what it gives you is one calendar one collection that has three calendars these three calendars. And that will display basically the events that are in these calendars. And whenever like the underlying entities add new event to these calendars it will update and be. Be available in your client directly. She's also working on Thunderbird and I don't really know about other clients. And now let's let's talk about what I would like to do in the future. So as I told you one of the goals of this project is. Is to also have some some kind of filtering feature where you can say I'm only interested in events happening in let's say this city or happening on Tuesday night or whatever. And currently the way Lincoln is implemented is that you can do that you could go in the rest code base and implement the filters yourself. Which is. Admittedly what not a great user experience so what I think I want to do if if I ever get some time. Is kind of device like a domain specific language where you can write some filtering expressions. And for your calendar so you can you could say you would have the expressivity to express basically the kind of the kind of filter or rules that I just told you about. And then you would way on there like upload this expression to Lincoln. And it would it would do the filtering before the events comes to your calendar client. The other thing that I want to improve is that like Lincoln is currently only able to to serve one calendar collection. And one improvement that I would like to do is have it be multi multi tenancy so it could host as many calendar collection as needed. And and have like some kind of web interface where you could upload this this expression in this domain specific language to define this new calendar collection. And the last thing I want to say is that I think maybe this kind of filtering idea could be also in the future accepted in by KELDA servers and so maybe entering in some standardization. Thank you for your attention. Lincoln is available on GitHub at this year. And. If you have any question I would go to answer them. Yes. Hello. Hello. Hello. Hello. First is someone who has dealt with a lot of counter hell. I appreciate the effort you're putting into this project. And my second question my question is. Is there any sort of right functionality for give me you cover this early. But if you're just passing things in proxying them. If you have the appropriate credentials can you not like could you add events to these collective calendars or is it a sort of read only set up. You mean can Lincoln add events. Can you add can you add events through link how or can you could do that. Yes. There is no limitation that you but what kind of events would you I mean. What's the use case like you're managing the collection and you want one more event to appear to the people that are following this collection. Yeah. Or maybe the people who are subscribing to you know people who are receiving these events say hey I want to have an unofficial after party. I'm adding it after this main event. Other people can see it. So the immediate answer I can give to this is if you really as a collection manager want to add some events you could add your own calendar that you manage and add the calendars. The event to this underlying calendars and it will just work. There is no really there is no real limitation that you couldn't do it directly in Lincoln. But I think like in terms of user experience there is no real interface where you could do this easily. OK. Thank you. Thank you for the talk. Have you considered aggregating from social media like Facebook or similar. Sorry I didn't hear very well. Oh sorry. Have you considered aggregating from social media like Facebook or similar. Would this also work. I have not considered this yet. I know that I mean this could totally be an option. This is Lincoln is currently a very rough prototype. And what I want to do is add some some other ways to integrate events that are not directly from CalDaf servers. Mostly like the priority is adding events from Unpoint that just serves ICS files which is like I know some some people ask for this. But then adding some events from sources that aggregate some events like social media is also interesting and I will consider that. Other questions. OK. Thank you.
Indico: an event management system
Okay. Thank you very much. Yeah. Hi everyone. I'm really happy to be here. I'm Pedro Freira. I'm a software engineer at CERN. I'll be talking to you about Indico together with Dom. He's going to do the second half of presentation. First of all, it's a pleasure to be here. It's our first time at FOSDEM and it's really nice to see such interest. Thank you. So, yeah. As the title of the talk says, we'll be talking about Indico. It's an event management system as you may have realized by now. Well, all of the things that are being presented here today, collaborative effort and open source project and the MIT license, it's developed at CERN mainly with contributions from the United Nations and the Max Planck Institute for Physics. It counts with contributions from more than 70 developers over the last more or less 20 years. So, Indico is probably the most popular event management system you have never heard about. There's something like 300 servers around the world most belonging to educational, research, scientific institutions serving more than 350,000 users. So, it's a tool that you, yeah, as I said, it started out in the research world. Since, as you know, CERN is a research laboratory, but then kind of spread out to different environments and there are, yeah, a few examples of organizations from different domains that are already using it. So, a little bit of history starting in 1999, the physicists working at the Large Hadron Collider, which back then still didn't exist. They were still sort of projecting it, building it. They needed some sort of application which they could use to manage their meetings. So, what would normally happen is that you'd have a meeting, you'd exchange a few emails with their slides and so on, and then this would get kind of lost at some point because it'd be kind of spread around a few mailbox of different people and disks and so on. So, they wanted to have an application which they could use as like a focal point for this sort of event and as kind of an archival platform as well. So, this was the first attempt that it was a CDS agenda back then. Then in 2002, the opportunity came up with a European project which was focused on having a conferencing platform. So, they kind of put the two ideas together and then that's when Indica was born. It went into production in 2004. In 2007, we've added a room booking system to it. Then in 2008, a full interface overhaul. Then 2013, first workshop, word of mouth starts spreading and in 2015, the United Nations adopted it and we started a really nice fruitful collaboration which goes on to this day. 2017, we did a full rewrite of the application. We were working on an aging software stack. We changed even database system, moved to Postgres. So, that was 2017 and 2021, then we moved to Python 3 within the code 3.0. 2023, last year we surpassed 1 million events only at CERN and 2024. So, this year we celebrate our 20th anniversary. So, you may have heard about CERN, the big tunnel which we have underground, the LHC. You probably heard about the detectors and all the things that go, you know, that happen 100 and so meters underground. But a less known facet of the organization, well, maybe not for you because you're all tech people, is that the World Wide Web was invented by Tim Berners-Lee at the organization back then, in the late 80s, early 90s. And CERN is actually producing a lot of open source, also using it but really producing a net contributor to society when it comes to open source production. So, open science is really at the core of our mission. And we have a series of software products which, you know, to this day, I use around the world and which are developed mostly in the organization and then with collaboration of several labs. So, that's Invinio, Zenodo, there's also Roo, White Rabbit, a few other things. There's also the CERN Open Hardware License which, which, yeah, goes on to show how the laboratory was a bit of a pioneer in this whole open hardware movement. And like last year, we also set up our own open source program office. And yeah, as I said, we're also using a lot of open source software. Many of these projects are represented today here in the stands. So, yeah, thanks everyone also for your help. A little bit of publicity, there are three other talks from CERN in this conference. So, if you're interested in, you know, storage or research management with InvinioRDM, you guys are invited to pop by. So, yeah, coming back to CERN, we have around 17,000 people on campus at any time, around 230 meeting rooms, organizing more than 100,000 events a year between meetings, lectures, conferences, all sort of stuff. And many of these meetings are highly distributed. So, yeah, when you come up with Indico, the objective was actually to solve this problem. How do we get, you know, super big collaborations of thousands of physicists to work together in a distributed environment? And, you know, how do we conciliate that with the organizations also, physical presence? So, this is, yeah, this is a science gateway. It's a pretty recent addition to the laboratory. It's a super fancy project by the same architect who was responsible for the George Pompidou Center in Paris. But, yeah, just a disclaimer, we don't work in this building. We obviously work in the Brutalist buildings back there, where is the IT department. So, but, yeah, it's, you should really visit it. It's a really nice place. So, at CERN, Indico became quite popular very quickly. We've been growing year after year. This is the number of new events per year. So, we still kind of accelerating. And these are just examples of a few events, a few meetings, conferences that we currently hosting at CERN's Indico server. There are basically two types of events. There's the conferences, which are a sort of, you know, the more traditional workflow where you have a call for abstracts, paper reviewing. You have workflows which allow people then to interact, do the, you know, the reviewing of papers, refereeing and so on. And then there's the meetings, which are more, a bit of a simplified view in which you can upload, you know, your slides and share it with other people. And you have a common shared schedule. And now, I'll switch over to Dom. All right. People call me Dominic or Dom. I don't really care. So, this is Room Booking. It's a module which is part of Indico. As you can see by this nice screenshot, you've got the leaflet-based map on the right, which shows you rooms. On the left, you've got a timeline of, you know, the rooms which have been booked. Very, very, very simple stuff. But it's not just that. So, we're going to go into the technical aspects of Indico. So, at its core, it's a very, very general purpose. So, just because we use it at CERN to handle our conferences and meetings and also everything else, is very, very, it's not set in stone with, you know, what you can use it for. It's, you can use it for almost anything, pretty much, while in that realm anyway. You can also go through plugins as well. And also, you can customize it with, you know, standard CSS or what have you. So, under the hood, yes, it is a Python application, specifically a Flask-based. So, that handles our back-end. For the database, Postgres SQL, I believe they have a booth here. Then we have other stuff as well, such as a Celery, which is handling our tasks as well. And SQL Alchemy, which is essentially the ORM for Postgres. Again, that is a Python-based. And also, that's for the UI, well, the front-end, we could say. And a semantic UI, which is just the styling of this. And we've got a lot more services on top. Okay, so, as I said, plugins, extensions, so yes, Indico has them. You might be interested. So, yeah, these are just a couple of our plugins. I'll get into a lot more. But yeah, video conferencing payments, conversions to PDF, search via Alasah search, storage and URL shortening and, you know, a lot more stuff, which we can, which Indico handles under the hoods for CERN. So, for example, we've got a nice one-click Zoom join plugin here, as you can see there. Payments, so yes, CERN does handle payments for the conferences. Apologies. Apologies. So, CERN does handle payments for the conferences via its own plugin. So, you can see there, we can handle payments via the post-finance plugin, but also for people running their own instances. There is a third-party integration out there for collecting payments via Stripe. And a PayPal also. Workflows. So, when you come to CERN, you probably might go to a conference. So, we have our own internal workflow for handling your access and other stuff as well. That relates to it. And also, yeah, this is a bit more into the access. So, yes, Indico can also handle printing of your badges and also actually your access onto the site. Recording of events. Again, this goes back into a little bit of Zoom, but also Indico handles the entire life cycle. Conference and events. So, yeah, so here's just a quick screenshot so you can record an event. And on our side at CERN, the event will go to our CDS archive. So, it can be played back on the maintenance, you know, and that is the archive for our events. Okay, so you probably saw a little bit about room booking. This is our internal spinoff called a bureau tail. So, room booking, as it says on the tin, it's for rooms, bureau tail, bureau, it's for desks. So, at CERN, we do provide a modified version of Indico, which only has this specific module, which has been modified, and that is via a plugin. Again, going back to what I said earlier, you can also customize it. So, here is my screenshot of the International Linear Collider Indico instance, which is hosted at CERN. And, yeah, so nice and feel. And it's not just, you know, the front page. You can also customize your meetings with the same CSS rules. And also one more of the conference for Higgs 2020. Now, one last thing. We have a nice checking application. So, previously this was a React native application, but I think around last year we rewrote it from scratch to act as a, well, to be a PWA, a progressive web application. So, basically it's like in any other conference, you might have someone at a door scanning your badges, scanning your tickets, what have you. So, just an application where you can use your smartphone. And then, yeah, it gives you the all the functionality that you would expect from a badge scanner, so a QR code reader. And also lets you bring up details of who's attending. You can check them in. And also, you know, other bits and pieces on top. Okay. One last thing, I guess. So, it's a very accessible event management system. It's open source and we have a pretty nice and thriving community. So, it's a screenshot of our forums. You know where everyone is welcome. And, yeah, so, I guess you have any questions. I'll be sure to follow us as a shout out, I guess. But, yeah, that's all. Thank you. Thank you. I was wondering if you also had some kind of back end for budgeting. Like, when I organize a conference, I want to make sure that all the money that we receive from the thing then pays out for the Dora sun and things that I'm going to spend for the conference. So, should we repeat the question, right? Yeah, so the question is whether we have some sort of back end for budgeting to kind of budget different aspects of the conference. And the answer is no. I mean, you have customizable registration form where you can kind of assign prices to items. I don't know if that's what you need. Then, yeah, in terms of them doing, you know, financial data analysis and so on, then we don't have anything like that. But, yeah, but you can extract everything basically to Excel and do that stuff on a spreadsheet or, yeah. Okay. The question, I think there is some space for integration with the Giante de Nuit or GCNit or Viglovap for conferencing. And is there a way to manage Wi-Fi every password distribution for participants? You... The tokens discount for social events in the night. Repeat the question. Can you repeat the question? Yeah. Yeah, yeah, you have to repeat the question. Well, yeah, so the question is if there is some sort of way to distribute Wi-Fi passwords to participants. That's it? Yeah. Wi-Fi passwords or tokens for social events? Not built-in, but you could probably implement it through a plugin, right? That could be... I mean, this will function as it would be plugin-based. So, yeah, you probably would have to write something yourself or probably hire someone to write it. Sorry? Made it not for tokens and not for Wi-Fi passwords. You have to do plugins. Yeah, no, there's nothing built-in for that, no. Yes. Is there... Is there a time of the attendance registered for participants? So, the question is whether the time of attendance per participant is registered. Well, not the attendance because I think we don't have any mechanism. Actually, if we have people say, you know, I'm attending Nali's talk and so on. But we have the checking time, yeah, that the app that Dom presented before, that one, yeah, if you check a person in, the time is registered and you have like a log of who checked in at the event. But that's more for kind of the reception part of the event, like to give maybe the... There is more to check out or only check in? Only check in, yeah. So, it's like Hotel California, if you want to... Yes. Are there plans to have like a progressive web app for participants or partners, not for the organizers, for example, to schedule what is happening with these... So, the question is whether there are plans for a PWA which targets the participant's side of the event, not so much the organization like here. The answer is yes. We are planning on getting started still this year. There are some funding issues to be addressed, as you guys probably know very well, is often the case. But yeah, it's on the plan for this year. Yes. What priority has accessibility in the UI as you showed? It's a very good question. So, in terms of accessibility in the UI, currently in the code... It is currently going through a phase where we have in collaboration with the UN. There's basically a... We have... The UN has hired a developer to contribute back to Indico, some improvements to the accessibility. And that's about it. So, it is a thing which is, you know, it's a work in progress at the moment. And there are some features out there already, which are going to be released soon, or they're already available in mind of releases. It's already... Many of those have already been merged into our main branch and will be included in the next release. But yeah, there's a lot of work which is currently being done in making sure that we pass the WCAG. Yeah. Yes. What's about developer documentation? Is it well-documented so people can easily access and contribute to the project? Or it's kind of more... So, regarding a question on the developer documentation and how someone can contribute to it, yes, there is documentation out there. So, on... Change of slides. Yeah. So, if you go to getindico.io, we do have a couple of pages on how you can contribute back to the project. And also, we've got a pretty good ReadMe and some ReadTheDocs pages on how to contribute back. And it also covers stuff like how to set up your own developer instance and everything from, you know, how to probably write a half-decent comet when you... Or a PR. So, yeah. There's also some API documentation, Sphinx documentation based on the code documentation. And it's not as complete as we'd like, but yeah, it's a work in progress. Any other questions? No one? No. Well, thank you very much. Thank you.
OpenTalk - Video conferencing secure and GDPR compliant
I need to support me for the in-depth technical details because he is more proficient than I am in these areas. This is a very high level overview of the project. We are not going to go deep into the detail, but if you have questions, let go deeper, just ask them in the end. If you want to have product side view or customer view, you can always use the official contact channels and you will be answered there. So a little bit of background about OpenTalk. There is a company behind OpenTalk. It was founded in 2021 in the middle of the pandemic by a group, so a group is doing since more than 30 years I think already consulting and training for Linux and mail operations hosting and it is also the provider of the well-known mailbox operator MailboxOrg. And the OpenTalk company right now has around 20 employees right now, so it is increasing slowly but steady. So who are we? I am Wolfgang. I joined OpenTalk roughly one and a half years ago and became the backend team tech expert, so more or less the technical lead in July this year, or last year already. I have a master's degree in embedded systems design, but I am much more on the software side than on the hardware side. I am doing Rust since 2015 and I am still in the honeymoon phase and from all the languages I have done, this is the longest honeymoon phase I ever had. And I am the co-founder and organizer of a Linux user group and you can find me on the FEDIVERS. So Stefan. Yeah. I guess I have been like two and something years with OpenTalk now and I am mainly on the media team which is our thing for all the real-time stuff, audio, video, recording, streaming, webRTC. It is kind of in between front and then back end. And yeah, I also have been in university before, long time doing parallel programming, operating system stuff and also some real-time things and software defined radio. So if you are interested in that, just talk to me later on I guess. Okay. Some information about the project in general. So the project is written, or the front end is written in TypeScript, the back end is written in Rust. It is free software under the copy left EUPL 1.2 license. You can find technical documentation online under this domain docs.opentalk.eu. There is also a FEDIVERS account called OpenTalk Meeting. You will find it by that. And there is a Matrix channel as well, hosted on matrix.org. This is, yeah, the Matrix channel is where some of the devs are hanging around and answering technical questions but it is not an official support channel in that regard. Okay. So the user interface, this is what the video conferencing software looks like. So it is roughly similar to what you know from other programs. It was important to make a nice design that looks good and is, yeah, comfortable to use. We also have what we call the dashboard. This is where you can create meetings. You can add start and end date. You can add meeting series and you can also get an email or maybe that's on the next slide. You get an email when you are invited to a meeting or when a meeting is canceled and also the creator of the meeting gets the invite so they can put it into their own calendar. Okay. So short list of the features. We have a lobby with a mic and camera check so you can check that everything is working. We have some interesting moderation tools, one of them being the coffee break which we will show in the next slide, a timer so you can assign tasks to people and say, okay, you have 10 minutes for this and if you want then report when you are ready and when everybody is ready the timer ends or when the timeout is approached. Meeting participants, we have a poll feature and breakout rooms. Screen share, yeah, that's well known for conferencing software. One important information here is that multiple people can share the screen at the same time which comes in handy for peer programming. Yeah, you have the speaker view where you always see the large picture of the speaker of the person who is currently speaking. You can call in from the mobile or landline phones via SIP and we have integrations for a shared editor which in this case is Etherpad and a whiteboard which is SpaceTech currently. Yeah, I already said the invitations end. Right now we are in the course of finishing recording and streaming so you can record the meetings and you can as well live stream it and the idea is to also allow streaming it to multiple platforms at the same time so you can have YouTube Twitch and on-cast stream at the same time if you want. If you are interested in that, talk to Daniel over there, he did one of the work. Yeah. Okay, so here you see a screenshot of the coffee break, that's what it looks like. Everybody gets this full screen as soon as the coffee break is started but you can go back into the conference anytime you like. So for chit chatting up front before everybody is back, just like in real life. And this is another nice feature we have, we call it the shared folder feature. So in the dashboard when you create the meeting you can enable this shared folder switch. It must be configured for this OpenTalk instance but then the system will create a shared folder, it will create a folder on a next cloud instance. This is the part that needs to be configured. It will create two shares to this folder, one of them being read write and the other one being read only. And the moderators of the conference receive the read write link so they can put their material into this folder up front while all the other people have access to this either by clicking on the link in the invitation mail or by opening it through the icon during the conference. Okay, so this is a more technical part, I'll give the word to Stefan here. So that's what it looks like from a rough perspective of the developer or the administrator of the system. So it's not just one big service but we tried as much as we can to use existing components. So what we built mainly is the dark or the dark colored parts and the other services are more or less what you get just from the different projects. So we use Yarnos and RabbitMQ for communication and Yarnos as media gateway but we manage all the video rooms our own using our controller backend and as said there is a web front and written in TypeScript and React here but it's kind of symmetric to what happens on the other side with the, I like to call them back end clients for streaming call and all that stuff. They just have another way of starting the whole process but actually they do the RTC and signaling just as the front end would do and by now they also have a way to authenticate services against key cloak via service authentication so that's also, we can see that later, where you can extend, that would be a way to extend our system in that part. It's meant to be scalable so you can have multiple instances and they just share their data where Redis and so forth is session stuff and for the persistence data like which rooms do we have, what users are in which rooms invited, that stuff that would be stored in the normal relational database and we'll do a lot of integration stuff on that OpenID Connect key cloak side with other like user systems or databases what people tend to have already on site. Okay so this is a sneak peek of Rust code, it's currently not ready yet but we have approaching this. We are right now working on extracting the protocol data types into a separate library which was not the case when I started working with OpenTalk and the idea is to publish the client library to crates.io which is the default publication platform for Rust code and yeah it should be as easy as this, I mean the authentication is usually a little more involved than these two lines but you basically connect to the instance and can do things with the client so this is now the web API for managing appointments and so on so here we create an event with a title that we set and then we invite a user with the email address elizetexample.com and the role of a regular user you could do the same for a moderator as well so the idea is to allow automation and integration in a very easy and approachable way if you're familiar with Rust code. This is also what we will be using for the recorder which connects to the meeting session for the call-in via a landline or telephone and for other maybe future services. So yeah talking about these kind of services that's actually the flow you have there, you build your new backend service which will act like a client to the conference, it first needs to authenticate and get some access token however you set it up and then you usually just go to the backend and say hi that's me and that's the token I got so I'm authenticated and I would like to join this room over there which has this ID and then you essentially and by that you open a web socket where all the signaling happens and you see like the publications of media streams so the backend will just announce when new users arrive and will also announce what screen share and camera stream they have so you can then start the webRTC connection with the Janus and on that signaling channel you just exchange STP and other credentials to get the right stream set up and here you would like in our case we usually use GStreamer as a media stack here which is then that up to get all the streams and for instance do video mixing and when you're done with your recording so somebody tells you on the signaling okay stop now recording you will just upload the file of the video it produced to the controller again which puts it into a F3 storage which is also currently we use for development purposes we use Minio but you can use whatever F3 you would like and there it also becomes available on the dashboard then which also would work with other artifacts so like whiteboard or yeah meeting minutes would be the same thing just another document format right and what I missed out is the other way is when you don't initiate the action yourself there's also the RabbitMQ service where you can just attach and listen for the controller to call you and say hey your service you should do something like start a recording for instance and then just start the signaling session right that's that's basically it yeah okay that's also your part yeah so we talked or we've seen a lot of components which are open source and which we integrate there's also been as we are a company that also been other yeah companies and software developers which we integrated with so I guess that's one of the main things and themes that we and other people have yeah projects and try to integrate with each other and there is like UCS and integration where they basically have their key cloak and they use a management part and we just connect there and there is Innova phone which does mainly zip and has some also some platform and we try to integrate there also wire my D connect and made some adjustments to our zip stack so that we are compatible with them and yeah so it goes on like MGM is like we just started I guess they they talked about how we could do like components where you just would have the video but it's like in the starting phase and not just the whole front end and yeah as much people many people use it right now and this has been a high demand we did outlook plugin but there's also been some talk I know for Sunderbird plugin but it's just not yet yet on the way I guess and so yeah maybe just if you have some some questions or need or want to do something on your own just talk to us and we'd be happy to try to tell you what's going on and to support it as far as we can okay yeah yeah that's it more or less so we try to keep it short so if there's some specific questions and details yeah I'm gonna just go ahead you haven't mentioned entry and encryption yet at all and I know that did see has already some support of entry and encryption and also matrix is now getting into the real-time communication business and I was wondering what is your strategy here yeah I can say a word I guess it's it's not so easy is the starting point the thing is if you want to do end to end encryption you basically don't trust the back end that's that's a deal and we're talking about a web application right now which is like a problem because in the first place you would load your application from the server you don't want to trust so we are looking into how we can ensure that you can really maintain the integrity of all your personal keys and all that stuff and that's pretty hard to do in a browser environment and yeah of course we could encrypt media connections but that's just half of the deal so yeah basically we're in the process that's also a goal for certain projects we're working in but it's not yet a thing I can say okay that's that's a route we we're gonna take right now and here are the details so we didn't put it on the slides yet so if there are question on then topic yeah maybe we can have a have a discussion in detail later on or maybe if you have specific needs and that direction also let me know I'm interested in what do you consider that are still like very important features or properties which is not yet in any open source video conferencing solutions and which you are working on which you also don't have yet but you're working on what is what are kind of still important pieces to come so yeah as mentioned this is a whole streaming and recording a part which will right now in is one of the main things so we can support like bigger conferences with a with a feeling of being in a room so for now we're just doing the the low levels or finishing the low level streaming part and the first UI part to enable streaming but we're thinking a lot about how to integrate like the to have a mode where you have a stage and an audience and the stage would be like a normal web conf and web RTC conference and the audience would get to see the live stream and get a chat interface but it would all happen in our user interface that's something to come I guess but we have no time frame for that right now and the other part we are from the project side in is all the telephony part like zip and 8 3 2 3 I guess which is the old video conferencing standard on telephony nets yeah I guess there's much more but there was another question so I reckon an organization was 100 people but once in a while we host conferences for a few thousand and now I wonder should we then have a very large a Janos media gateway just for this one event per month or is there a way to scale easily down and up the resources because I've heard of Federation of Media servers in the matrix context and I think this is a very interesting concept when organizations have joint conferences so yeah we we also thought about that hard and long and there is like a limit on if you don't cascade Janos instance there's a limit on how many subscribers can be for a single publisher so the speaker in the room and that's for for our experience in the yeah say three to four hundred depending on how you configure load balancing and all the stuff and instead of doing cascading and all that we are right now looking more into the streaming direction then into have it and having it cascading and real time all because usually the audience will not interact heavily and you would have to invest a lot into getting all of the people like fast in there it might be a thing and we are looking also into the matrix how matrix does it with underlying they use live kit as far as I know but yeah we are exploring the other direction was having it on streaming and getting people in and out of the room or you know so into the web RTC conference or back into the stream view that would be my take on that because then you can just have a have it more resource efficient like have a small meeting which is easily manageable and also have a streaming set up which can easily scale lots of people thank you so the question is is there a support for island audio as in in a large meeting where two people can talk to each other alongside with the orator without interfering with the others this has been on the road map for quite some time already the idea is to lower the main room audio volume and have a private talk with a subgroup of the conference but it has not been implemented yet I guess it's already we already have an specification for it but not the time to build it yet
Collabora Online usability optimization
Okay, so thank you for joining. The next talk is still about Collabora, second talk of the day about Collabora, which is about Collabora online usability optimization. And we still have Kaelin, and that was in the previous talk, and also Michael that is joining us. Thank you, Kaelin. Fantastic. This is Kaelin, this is Michael. Good. This is what I'm going to say. You'll see it as we get there. And yes, fantastic. Kaelin did a very good spiel earlier on how this thing works. So if you're in the previous talk, you saw something similar to this, but you have your browser, and then you have a web socket talking to a server on the back end, C++. And this talks to the Librofiskit over a Unix domain socket, which does all sorts of beautiful interoperability rendering, tiled goodness. And yes, this fetches data from an own cloud, an OSIS, a next cloud, a pygmc file, lots of things, any kind of WAPI share point I think we can use even. Yeah, for the good guys, right? And yes, so anyway, so this gets the file, this pushes it in here, it renders it, it comes back out to the browser. And yes, we do all sorts of things to try and cache that. So JavaScript here, good stuff over there. Anything else on there? Nope, nope. Seems pretty silly. And I just want to talk a little bit about latencies. This is an interactive presentation. I'm not going to ask you to put your hands up just yet. But just here are some timings. And the one I want to time is this human eye blink, 100 milliseconds for a human eye blink, okay? Right, so here we are. How good are you at blinking? Are you ready? Okay? So I'm going to press a button and we'll start blinking. And when you see red, stop. But you need to count at the same time, okay? You ready? Silently. Silently. Yeah, yeah, here we go. Ready? Ready? Are you ready? Go. How many? How many did you get? Do you want to try again? Yeah? Okay, so here is reciprocation for beginners, okay? So this is an advanced topic in maths, okay? If you need help. Anyway, so if you're a falcon, you've got like 7.7 milliseconds. So that's pretty good. Me, I'm more about here. I don't know how about you. Six, seven, eight. How many did you get? Do you want to try again? Okay, we're going to try again. It's like, okay, right? You got the idea now, right? Okay, ready? Not completely, okay. So I'm going to click and it's going to go green. Start blinking. And count the blinks you're doing. Blink as fast as you can, right? As many as you can. I want to get a high score here, right? We're going for the Peregrine Falcon 153 in a second, right? Okay, ready? Okay, three, two, you've not started yet, have you? Three, two, one, blink. Okay, that was a second. You had to blink. How many did you get? Five, six, seven, eight. Yeah, okay, fair enough. So this tells you your score. And interestingly, in the UK, they say a blink takes between 100 and 150 milliseconds. In Harvard, it takes between 100 and 400, which tells you something about Americans. Maybe. I don't know. It's slower pace of life is good for people generally. Anyway, sorry. So here we are. So actually, the very interesting thing is that when you start looking at some of these numbers, now on a log scale, so they're a bit more friendly, you know, the blinking is really quite slow. You can go from the Frankfurt to the US east coast and back again in the same time, right? So that's pretty good. You know, the 60 hertz frame time, 16, you know, is also quite long. You can get Frankfurt Milan, Frankfurt London is a similar time to the time it takes to get something on the screen, particularly when you add the monitor latency. So you blink faster than you miss. Lots of people are very worried about latency, and they don't have a good feeling for how long things take. But it's quite interesting to see some of these things. And also, in terms of typing, you know, like the average typist is supposed to be like three characters a second, pro 6.6. Yeah, it's human eye blinkers quicker. But you know, even me typing, not very accurately, it's like, yeah, quite, quite, and if you mash the keyboard, it turns out you're massively faster, like you're 10 times faster than the average typist when you mash the keyboard. It's not, you know, it's not good for it. So yes, there we go. Anyway, I'm going to hand over to Depp, Aquilon, unless you have anything to add? No, no, no, no, nothing to add on blinking. But yeah, the fundamental point that networking is really, really fast and stuff comes from one end to the other and back in a very, very sharp period of time is great. So, you know, don't generally have to worry too much about that part of things. Yeah, so what we do is that we have a bunch of demo servers that are generally publicly accessible. And what I've started, we started in recently is to use perf to sample once a second and record for an entire week what happens on the public servers. And at the end of the week, then we generate a single flame graph from all of that to see what, where, where, where our time is spent over the week generally. That's the demo servers, multi user testing. We have this once a week called some of the people present in the room, join us from that from other people, organizations and, and community members, members. And we just have a general feel as to what it feels like in that little 10, 20, 15 person call for the applications are still responsive or whatever issues arise in testing that can be checked at that point. And that is also profiled and flame graph generated, typically one for writer and one for Calc in recent tests, which are all stuck up in GitHub that you can look at yourselves if you're interested to see the change in time over what we're looking at. We use it internally in clapper, of course, with the deployment that is used daily there and the same week long profile that I mentioned for the demo server is run on the internal one now as well. Yeah, so that's the tooling that we're looking at there. And then interactive debugging, which you have the clapper online, you can do yourself. You just go help about and you trip a click on the dialogue there. And that'll show you up this debugging display that we're looking at here. There's loads of information in it there. The far right inside the tick box as you check them on, certain ones will check on display in the bottom left corner to tell you things. But maybe more interestingly, the one that we're calling the tile overlays. When you type in the documents, you'll get these flashing areas. And that's the part of the document that has been required to be redrawn because of your interaction. So what you're really hoping to see, especially looking at these things is that people are typing and you're hoping to see a small rectangle around the area of change that they're actually making. If the entire screen starts flashing, it means that there's a whole reason other piles of things have been redrawn or been invalidated to be painted to be redrawn later on to avoid that. These are the kind of flame graphs that we look at and the week and just for the purposes of looking at these things, the colors don't matter in these flame graphs or most flame graphs. What matters is the width of the line, the width of the bar, the wider the bar, the more proportionally time has been spent there. What you want to do is you want to take a quick look at it. You want to see which is the widest line and see can you make the wider lines narrower. I mean, it's nothing to the profiling really. It's just make the wide ones narrow. Yeah, so this particular one is in the widest bar there. This whole gigantic pile of boost, spirit, classic, whatever, which is all being used to detect if the PDF that people are opening up is a particular type of PDF, the hybrid PDF that's using LibreOffice where you can embed the LibreOffice document inside the PDF. So when you open up PDF, you also have the original document. It just takes a ludicrous amount of time, especially over the course of a week to collect up that information when it can be done in many orders of magnitude less. Yes. So it's good to see that sort of stuff and disappear off the profile. You should never optimize before profiling, obviously. Cool. Thanks, Will. Storing previous tiles. Yeah, so we've done a whole lot of work to improve our tile rendering performance. We store previous tiles that have been rendered so we can see what the difference is and just send the difference. That saves a lot of bandwidth and reduces latency too. And we've completely rewritten this. Well, how this is done in the last six months to a year. So we've already compressed it, so just a simple run length encoding. Because we're extremely modern, instead of doing stupid stuff like using byte lengths and this kind of thing, we use bit masks. And you'll see why in a second. So the bit mask essentially says, is the pixel the same as the previous pixel? So you end up with a bit mask. We have 1056 square tiles. So in four 64 bit numbers, we can have the whole bit mask for the row. And yeah, it's pretty easy. This removes a whole load of things. Previously, we stored them uncompressed. We compared them uncompressed. Turns out to be massively slower. Touch is much more memory. It uses much more space. And we also did clever things to hash each row as we did that while we were copying. But it turns out this is far better just to use the bit mask and some of that stuff. And, Koel and I did this fun thing with AVX2. Why not? You hear about these processor accelerated things and after shrinking our inner loop down to almost nothing, it's still not as quick as it could be on the CPU. So this is how we do it. We load a whole load, actually eight pixels, into a whole single AVX register, which is just kind of nice, right? Eight pixels at a time. And the problem is we need to compare it with the previous thing. So we shift a bit off the end. We shove the previous one. We shift it along, although actually it's really a sort of, yeah, it's a crossbar switch here that you permute to move things. There is no shift in AVX registers that does that. And then we just compare these guys. And that gives you a whole load of either whole all ones are all zeroes. And then comes Koel on magic trick. Well, yeah, in AVX, there's the AVX2, which is like practically available. But AVX512, which is not practically available, has a particular call that you can do that will compare the two things for you and give you that bit mask, which is not available in the AVX2. And if you look at what's available, though, you can guess if it was done in floats, then the number is basically available for you. So you cast it to floats, and you do this move mask thing brings your top bits in and gives you what you were hoping for in the first place, which is just an individual bit result for each individual pixel that you've compared, whether they're true or not. And you can basically so compress, pull the bits you're looking for out in no time. It's great. Which is pretty awesome. So, you know, you convert this into a floating point number, and you get the sign out of it. And that's your that's your orally bit mask. So the nice thing about this is there's no branch, there's no compare. There's nothing. There's a simple flat loop with about five instructions. At the end of that, we then have to work out how many pixels to copy because it's all very well saying these the same, but you need individual copies of those different pixels one after another. So a bit of a pop count will count the bits in the mask. And then with a clever lookup table, we can also use this. Yeah, this clever instruction shuffling instructions to shuffle the things in that we need to copy them out, stack them up. Bingo, twice as fast, which is nice. And hopefully AVX512, you know, will make it even even faster if you believe that you'll believe anything. So yes, here we go. So this is a real problem here. And if only we can find the idiot responsible for you. We don't need to suggest. Yeah, no, what's sometimes interesting is that, while I said earlier narrow was better, sometimes it can be interesting to see that wider will be better in the sense that when you look at the flame graph, what you should see is individual threads should all be positioned separately. So they shouldn't be, you know, combined with the main thread. So if you're not seeing work that you expect to see happening in a thread on the left hand side, basically, of your flame graph, then it means the threading isn't being used. So it becomes apparent that while there's this code that attempts to do this threading for doing this previous delta stuff, there is no existence of the threads and there's a flaw that needs to be sorted. So when you fix the flaw for the threading and bring it back in, you see then on the far left hand side, because it's rooted in the threading area, all that work is put on the left hand side separately in the flame graph. And while it's wider, it now means it's operating in a separate thread and you've made progress. So it's nice to get twice as fast and then four times as fast on top of it. That's the right sort of approach. Yeah, I think we're going to skip through some of these because we're running out of time. But working out where to do the work, either in the browser or not, and I'm pretty multiplying and the stupidity of the web and having an RGB, un-premultiplied alpha API. When it's almost certainly going to be premultiplied underneath its hood. Yeah, underneath the hood, all the hardware, everything is doing premultiplying because it's so much quicker. You can see the complaints online about people pushing RGBA into the canvas and getting something out that isn't the same because it's been premultiplied and then un-premultiplied. Anyway, there you go. The web APIs are awesome. What else? What should be on your profile? Well, it's very hard to know. This could be okay. Here's a whole lot of un-premultiplication here. It's a very old profile. It's a time, but hey, there's a lot of rendering on the profile. Not very much painting, lots of delta ring, so we fixed that. But actually, it's very hard to know if this is good or bad looking at that. Actually, with lots of bogus invalidations, you start to see lots of rendering and that's not what you want. So everything should shrink and you'll end up with a profile that looks the same, but everything feels much quicker. So we've done lots of work to shrink, I guess. Mr. Enders, do you want to pick a couple of these now? Yeah, just as you mentioned, with multiple user document tests, we have kind of basically monitor what's happening. People are joining documents. We got that full document invalidation we mentioned about happening. Clicking in headers and footers were causing the same things. I think fundamentally, because the invalidations and redrawing on the desktop has become so cheap, while in the past, the very distant past, we might have been pretty good at keeping validations down. In that case, we've become slack in recent decades and now we've treated it as cheap and that has affected things. So let's kind of have a look at that again and bring things down to smaller rendering areas and less invalidations. Yeah, and the good news is that improves LibreOffice, of course, as well. It's more efficient and clean on your PC as well underneath. So good. We've done lots better latency hiding in terms of more aggressive prefetching. So the next slide is there before you switch to it. So it's absolutely instant. Hiding latency in those ways is quite fun, enlarging the area around the view and maintaining that as tiles and just storing and managing much more compressed tile data in the clients that we manage much better now. This is a fun one. But we don't have much time for it. Yeah, well, God, classically, standard list and C++ was always a standard list. And if you wanted to get the size of it, you had to like pass the entire list from start to finish. That was sorted out decades ago. But for whatever reason, for compatibility purposes, if you use the particular Red Hat developer tool chain, then you seem to get the classic behavior or standard list back again. So when we were assuming that you was cheap and cheerful to get the length of a standard list, it turns out to be not the case with this particular case. So you have to go back to a different approach and it appears in your profile like that. But again, it looks normal that it should take some time to draw things. And it's normal to have a cache to speed that up. But if the cache has got 20,000 items in it, and you're just walking this list, you know, point it, chasing anyway. So gone. Oh, fun stuff. Like why not have a massive virtual device in the background that you could render to instead of the whole document every time you do something? Not great. Or another one, why not have a benchmark every time you start the document to see how fast rendering is, allocate a whole load of memory and dirty it, you know? Great. Yeah, trying to cache images. So we didn't bother catching compressed images because they're compressed, right? So why bother? They're small. They're good to have memory, except TIFFs not so much compressed, you know, you eventually have the whole massive chunk of memory there. Using G-Lib C trimming functions on idle to reduce memory usage. Yeah, trying to get better measurements of various things. Yeah, this is a fun one. Well, oh, this is the S-Maps word. Yes, yes, yes, we're reading the proc S-Maps to see how much memory we're using. And the classic S-Maps has got multiple entries in it for many, many parts of your process. So you just read multiple lines. So there's a relatively new one that has it all pre-edited for you. ProxMaps roll up, which is exactly what we want. Same code to read the previous one should work with the new one. Then apparently we're running out of memory, or it's being reported that we're running out of memory, and it's all very, very bizarre. You can't proc S-Maps roll up yourself. The numbers are good. There's something very odd, but it turns out that if you seek back to the beginning and then read again, that the numbers double every time you do this. There's an actual bug in the original implementation. It's not there in the kernel, my version 6 kernel, but it is there on version V18 or 16 that the servers were applied on. So you have to be just the right version for it to appear. So Linus fixed it, thank God. Quillholt found it. Well, it was fixed before we found it. But it's always nice to know you have to check your kernel is the right, you know, is the quality kernel before you start asking it how much memory it's using. Yeah, hunspell in the loop was almost entirely dominated, not by actually spelling things, but by looking at the time. You know, I'm sure in a bad talk, you know, it's quite similar. But that's a little bit unfortunate. So yeah, some improvements there. And lots of other things, graphs showing speedups. We've got to get to usability in the last minute. Let me whizz through this then. Here we go. Accessibility, dark modes, pretty pictures. This is going to be fast. Keyboard accelerators. This is all of the good stuff for people. Screen reading, and all sorts of nice things, videos of that. Better page navigators at the side so you can see where you're going. And lots of just little bits of usability polish, a nice font previews. Was this your page number thing? I forget who did that. Making it easier to insert page numbers so people can see, you know, what's going on easily, better change tracking and showing changes, AI, depot, stuff, and hey, some some. The good news is there's more opportunity for performance improvement. So we're still, we're still having fun. You know, hey, come join us. There's some cool play files to read. Right. Well, yes. At the moment, in Calc, when you're typing the entire row and validates beyond the right hand side of where you're actually typing. So we brought that down to the self in the most generic case, but it's not done for writer. In the writer case, if you're typing, we are invalidating all the way to the right hand side of the screen. So we'll bring shrink back back down again. We have some new metrics that we've included in that debugging overlay thing that give you an indication of, you know, how much of these updates that are coming through are the same data as they were before the update came through and the numbers are staggeringly high. So there's plenty of room for improvement to validate less, send more data down. So what we have now is fix, uh, approval. Yeah. The moment that's always been troublesome in, uh, Lear Office is the treatment of the alpha layer. We picked the wrong direction than everybody else does. Everybody else picks transparency. We picked opacity or vice versa. So we have the opposite direction. Everybody else would want to actually output something in the real world that handles transparency. We have to like reverse our transparency. So that's problematic. That's, that's now fixed. That one is fixed. That one is fixed. But then we've also kept our transparency layer in a separate, uh, bitmap, a separate buffer than an actual bitmap. And if we put them together someday, that would make things a lot easier, I believe. Yeah. It's the Windows 16 API decisions that are still with us. But anyway, we're getting rid of them quickly. That's great. Um, yeah, performance regression testing with Valgrind, uh, pipeline loading. So at the moment, oh, we got five minutes. Oh, look at that. Fantastic. I went too quickly. No, you're doing fine. Okay, right. Fine. Excellent. I think we're nearly the end. Um, so pipeline loading. So at the moment we have, um, we, we essentially fetch a, fetch a webpage that passes all the credentials we need to check ourselves. We'd load lots of JavaScript. We open a web socket. Then do we actually see if we can actually load the document and start checking who the user is? This is really foolish. I'm taking on a first start, we can be, you know, checking the user, downloading the document, even loading the document ready to get the web socket and then have a pre-rendered version. So this, this is very substantially reducing, um, startup time to make it incredibly quick. You already have a huge advantage that you have a real server at the back end and you're not having to jit, you know, millions of lines of code in your browser from JavaScript into something or, you know, web assembly into something. Um, so it should be just amazingly fast. And so this is a great way to, to speed that even further. And, you know, and a real server, you may have a time share, but you know, when you arrived your server, it's probably not doing much. In fact, the CPU cost on most of our servers is extremely low. So, you know, there's suddenly all these threads ready to render your document and get, get stuff to you quickly. Say some good things. And Valgrind, we've done a whole lot of work to get, um, it to run nicely under Valgrind with our privilege model and container model. It's a bit of a problem. Uh, and so we have some code now that turns everything into one process. So you can load and collaborate on one document and automate that, but you can run it in, in Valgrind. And why do you want to do performance profiling in Valgrind? It seems like a retro, uh, poly, right? But the beautiful thing about Valgrind is the simulated CPU. So anybody can run the same workload on their machine and between two runs, it's the same thing. And Valgrind luckily doesn't have a simulated thermal management system that randomly throttles your CPU, uh, performance. And it luckily doesn't have people screwing with your cache memory and running cron jobs in the background and, you know, thermally recalibrating your disk and all this other stuff. So what you discover is that between two identical commits, you're getting fractions of a, small fractions of a percent difference in the Valgrind numbers, which is beautiful because performance tends not to go away in big jumps. Like we can, it can go in big jumps, but it tends to go slowly downhill. And if the noise is bigger than the slow downhill, you've no idea where the problem is. So much better to have a little series of steps going down in one half a percent at a time and go, hey, we get rid of that and that. And did you realize and, uh, so, so this is really vital. And LibreOffice uses this on its perf, um, automation has been beautiful web pages with graphs. Um, and we'll, we'll be applying to, to collaborate online to, to try and avoid regressions. Yeah. Someday soon. Someday soon. Yeah. Neil, Neil Lazzone, we think probably. Anyway, anything else? No, I think we've covered plenty. Well, so, and yes, of course, we can't do anything without our partners and customers that pay for it all, blah, blah, blah, commercial plug. Good. Yes. That's good. Job done. And conclusions. Yes. So, uh, computers are unbelievably fast. I mean, like this is something that you should take home. You know, like the quarter of a nanosecond that your four giga hertz processor takes is just unbelievable in the scale of a hundred milliseconds plus. It takes you to blink your eye. It's just fantastically speedy in a way you can't explain. Uh, the network latency to anywhere almost, you know, you can go three times, uh, London to Frankfurt and back in the time you can blink, right? Like it's, it's unbelievably fast. In fact, you can go, you know, Frankfurt, Milan faster than your monitor can refresh, right? So, so like, it's quite amazing when you start looking at the times of things. Um, architecture is really a bet on CPUs and networks getting faster and cheaper. Has anyone noticed a trend there? I think there might be something in that. And, and we're basically racing the hardware guys. I mean, you know, we, we do stupid stuff, obviously, and then we remove it later. But, you know, the hardware people are also trying to beat us to run stupid stuff quicker. You know, that's their mission. And, uh, yes. And, and we extremely smooth. Don't get the feeling that it's bad. Try it. You know, most of these problems, you'll only start to see them when you have 20 plus people collaboratively editing in document. So, uh, yeah, it's, it's kind of, it's kind of cool. So give it a try and try the latest version and see, give us some feedback, get involved. And there's lots, lots of fun to get involved with. I mean, I don't know. Yeah, I'd like us to play two things. As I mentioned earlier, the profile that we have for Calc and Writers uploaded to GitHub once a week, generic Calc performance profile, generic writer performance profile, search on the online GitHub issues. And you can see all of the, the chats that we've mentioned there in the past. And you can even see with the progress there and the occasional blip during a call where things go horrifically wrong and get sorted out in the next one. So yeah, plenty to see and see what we're doing. There's some links in the slide. You can't see to the profiles and get involved in the Libre Office of Technology. Thank you. That's it. You've been very patient. Thank you.
Document collaboration made simpler: Revealing the concept of rooms in ONLYOFFICE DocSpace
So, we're going to be presenting the rooms in only office doc space and it's going to be presented by Alex. Hello, hello everyone. My name is Alex. I'm with only office, so this is my second time here at Forstum in Brussels. Thank you a lot guys, so the same clusters. That's everything is brilliant. And today I'm going to take you through the challenges of only office and of document collaboration and how only office can help you to overcome all these challenges. So I will also be introducing you to some existing features in only office and some updates. So let's get started. In our today's world is very important to work online and if teams have a lot of documents so the things can get messy very fast. So for us developers it is important to create and for users to pick a good solution for organizing online document collaboration. We at only office have more than 200 unique connections, unique integrations and we have a long, long list of requirements from the people who are trying to integrate our solutions into their services and that's why we are absolutely sure that we know almost everything about the integration. So wondering what is the most interesting cases, here are our points. So first we want to make sure that we save your time and efforts by automating the everyday routine. Having a lot of features is very important and it should be easy to add new features if needed. The next thing to consider is support for all popular file formats. If a software is built on their up-to-date technologies it will most likely be reliable, suit user needs and even have some killer features, so like we do. And of course security will always remain one of the most important questions for everyone who works on their documents online. Cross-platform apps give us ability to work on any document in any browser, from any device and remote workers need to be able to work together and all these challenges can be difficult to overcome without using write tools but only office can ensure your effective teamwork. Talking about usability, we want to provide real-end user experience. If a software is easy to use it boosts your productivity and accessibility, how it is easy for people of all abilities to use your product. If software have lots of bug fixes and updates it indicates that it is well maintained and here big thank you to our community which plays significant role sending us information about bugs, troubles and of course feedback. And last but definitely not least is availability. We are constantly increasing the number of distribution forms or builds of our products. Considering all these factors we at only office decided to create a new product, only office doc space, a product for organizing secure online document collaboration. Actually today is the first time when we are talking about that product. I was talking about that at FOSS Asia in 2023 but there was only better version available and now we have already a ready to go product that can be integrated and is already integrated into many well-known services. So before we dive into each factor I would like to share the history behind that idea. So the journey started in 2021 when we decided to rewrite, to completely rewrite our productivity platform. I am talking about that platform I mean the package with CRM, project management, mail clients and many many other features. So the main idea was to implement infinite scalability and in that same year we released free for personal use app server that made it faster, more stable and more functional and the idea shifted slightly. So we decided not only to rewrite the architecture but also to change the mechanisms of working with documents and in 2023 we released on office doc space. The main point for everyone who tries to integrate our solutions into their services is that there can benefit from our extended experience in the integration. So many office solutions and productivity platforms when working with the files create a mess of unstructured files, folders, subfolders but with only office doc space you are able to create rooms, doc space rooms which allow you to clearly structure your files depending on your requirements on the project goals. And when you have a long to do list every day so it is smart not to waste much time on every day routine like creating sharing or anything with the files. In only office there is no need to work with each file individually. You can create a room and all files within that room will be available according to the access level of the room. So there are few types of the rooms at the moment. Let's start with collaboration room. These rooms are perfect for those who are trying to work on the documents together to co-edit their documents. So here in these rooms you can make use of all beautiful co-editing features of only office software like using commenting, mentioning track changes, using revision controls and many many other features. So we do have built-in chat and telegram plugins right within the editors to communicate and we also allow to make audio and video calls using plugins for Zoom, GCE and Rainbow. So when inviting a user into your room you are able to set the access level. It may be administrator with full rights, power user with extended rights for editor or viewer. So the next type is public rooms. You can invite anyone using public links and what's very important there is no need to register somewhere and you can generate multiple links with different access rights. But you also are able to apply password for all files in the room or for example restricting the copy of the content of the file is available here or just the downloading and printing can be disabled. Yeah and there are also custom rooms that allow you to apply your own settings for any custom purpose you have in your mind. Again here everything depends on your use case. You can create a room for requesting form filling, for requesting commenting or document reviewing. Everything depends on your use case. So on the Office Doc Space includes different viewers and editors for all file types. Let's start with the digital forms. Here you are able to work with your forms with your form templates in DocXF or PDF format. So these PDF forms can be filled in, can be shared with anyone for filling or you can create or work with the files created by alternative applications of course. So you can easily view, create or edit text documents. I hope you are aware that only Office works with almost all text file types. Office Open XML files are supported but if you'd like you're welcome to work with Open Document Format or RTFT, TXT, HTML or anything else. The same for spreadsheets where you can work with your sheets and use more than 400 of different formulas and functions. You are also able to create slides using a variety of different animations, transitions and different objects. And now to PDF again. So PDF is widely used in the document workflows from meeting brochures to contracts for signing. And now you have PDF editor. You are able to annotate your PDF files using only Office editors. You are able to work with your PDF. You can convert your Office Open XML to PDF and vice versa. You can convert PDF files to Office Open XML to edit them. Additionally you are able to work with your electronic books that can be converted. Next feature is integrated media players for working with images or video and audio files. So the functionality of the described solution can be extended by using plugins and AI integration in the form of chatGPT plugins. So you can work with your text, you can make simple requests, generate keywords or images. Everything depends on your license with chatGPT. So if you do have chatGPT license you are able to work in the editors with your paid functions. If not just work with a free version. So Office Docspace is created using up-to-date technologies. We do use .NET Core and .NET Server to ensure reliable backhands and for frontend we do use React to make sure that everything is mobile friendly. Office Docspace is a safe way to handle your documents. So we follow all GDPR and HIPAA rules treating your personal information very carefully. With flexible permissions and JWT you are able to have complete control over your files but you are also able to add password protection or watermark everything you have in your room. For data and transit we do use HTTPS of course and for data at rest we do use industry leading AES256. And moreover the administrators can allow some additional settings like trusted mail demand configuration or session lifetime configuration or for example use two-factor authentication or single sign-on to have control over the login procedure. And of course backups and recovery are also here. So I'm glad to inform you that in the middle of 2023 we have included on the Office Docspace and our main HackerOne program. We received few reports and all these reports have been fixed in a timely manner. So thanks to Ethical HackerSchoolWorks with us in the HackerOne. On the Office Docspace is primarily designed for web-based operations and we understand the importance of using it on mobile devices. On the Office Docspace offer a user-friendly interface with interactive negation between rooms and settings for example. And we conducted several usability tests with more than 200 people from different countries and different industries. So and according to their feedback only Office has overall usability score 4.1 and the main advantages were so simplicity, clarity and modern interface. Of course you can customize your product. You can change the space name and URL of the portal, Docspace portal. You can change a color scheme and to support your corporate style you are able to use your own logo or to change the welcome page. As for accessibility on the Office Docspace and on the Office Editors are designed to accommodate users with spatial needs. So there are few options like screen readers, hotkeys but we do also support different plugins like voice to text, text to voice or translation for example. There are a lot of different plugins. For developers and integrators on the Office provides the ability to extend the functionality and here you can find the information about our plugin SDK. So you are welcome to create your own plugins and we do have few plugin samples on our GitHub page. For example PDF converter allows you to convert your PDF files to Office Open XML and vice versa as I said already. And the next one is draw IOR plugin for working with your professional looking diagrams. There is a plugin available that converts your audio and video files to text and of course you store it into your rooms. Open API documentation allows you to see how to integrate on the Office Docspace rooms into your product and to give your visitors, I mean visitors of your website ability to view and interact with documents right on your web page. Docspace rooms can be integrated into your service as an iframe. So the same we already have with only Office Docs, just iframe and of course data display settings can be configured. And now the main point, so there are a lot of services without document management functionality, without document editing functionality or without any cloud storage functionality. And only Office Docspace allows you to add everything you want here. All these features are available and anything can be used here. I mean it can be integrated into CRM, into CRMess, into any messenger and the next example is one of the most popular collaboration solutions on the today's market. Just an example. So I'm glad to say that we have only Office Docspace for Zoom integration. So just go to Zoom marketplace and look for only Office Docspace. You will be able to install Docspace for working on your documents right within the Zoom meeting. No additional actions like registration are required here. I think that everyone can remember him sharing a document and saying, okay, let's write it down together. So I mean in the Zoom session, but with only Office there is no need to share your document with anyone or to give someone access to your screen to work on the documents together. So just use only Office Docspace application. The same for WordPress. Only Office Docspace can be integrated into your WordPress pages. These are just two examples that show that our product can be integrated into any service. In 2023 released on the Office Docspace 1 and on the Office Docspace 2 with more than 50 new features. For example, public rooms are available now right to left interface and better and system plugins are supported. And for online editors, as the part of Docspace platform, we delivered three major updates with more than 200 bug fixes and about 200 new features. And the last version of only Office Docspace is available now. So release just a few days ago. So in the latest release on the Office Docspace 8, we have few very important features. Again, fillable PDF forms. We have PDF editors right now. Another interface for plugins has been updated and we do have long awaited right to left interface. That's very important and we understand that we have a long, long way to go with that functionality. I mean right to left. This is why we are looking forward for your feedback about that functionality right to left. I mean, and we really need feedback from our clients and integrators. The next point is our performance optimizations. So here you can see some numbers. We have moved some portions of the service to the client side. So and I think this definitely will add some more points to only Office editors. And thanks to our partners from own cloud, we have now the lower testing results for 100,000 of simultaneous connections. Having 100,000 of simultaneous connections, I mean that all these connections are active, sending some information from your client to their server. You can see the details of the infrastructure. Everything is in Kubernetes. So 12 big machines for documents server and two big machines for K6 just to generate that huge traffic. On the Office Docspace, we'll soon include private rooms. We are also going to implement electronic signatures. And there are more features that we plan to add. It can be used on the Office Docspace, can be used as a cloud solution, just look for on the Office Docspace. And if you'd like, you are welcome to install it as an on-premise. So in Kubernetes or any type of deployment. To try in the cloud. To try download in server version. So and thanks a lot for your attention. If you have any questions, we'll be happy to assist. We are here to guys in this on the Office t-shirts. Yeah. Thank you. Thank you very much. Are there any questions? Yes. Just a question about interfacing type of document worksheets. And so also because the documents in database formats and Writers and so on have problems to with Google from Google Docs to Google Sheets, it doesn't work. And so you have this kind of functionality. And also it was very interesting for me to gain a lot of text, a lot of time with converting speech to document. So there's a first question about whether there's an integration between sheets and documents and the second one about converting speech to document. So the first question, yes, we do have. Sure. Yeah. We do have plugins to add that functionality right within the editors. You can work with that. But you need to install an extra plugin for working with that. So and what about the integration between two types? Yeah, two types of the editors. Yeah, I see. But as far as I understand, you are working with X-Wiki right now. And no. No. No. No. Yeah. It's a free question. Yeah. And no, I mean the product. I mean the product. So maybe you just have one of the previous versions of OnlyOffice, for example. And there was your question about interface. Yeah. There was your question about interface. Yeah. Yeah. And I think that your question before, you are before. So you will be able to find a solution in one of the next version of the OnlyOffice. And what about the integration between these two types of editors? Yes, we do have that again in the latest versions of the OnlyOffice. For example, let's try work with OnlyOffice docs 8. Great. Any other question? There was somebody there I think. No, yes, no. Okay, thank you very much. And. And. And. And. And. And. And. And. And. And. And. And. And. And. And. And. And. And. And. And.
openDesk - The Open Source collaborative suite
Okay, so welcome to the talk about OpenDesk, the open source collaborative suite presented by Klemah Obang from Exviki and Vila Lydantal from Open Projects. Enjoy. Hello. Hello, everybody. Thanks for coming. The funny thing is we are not OpenDesk. We're just vendors. We're just contributing to OpenDesk, but we come to this later. Yeah, OpenDesk, what is that? That is an idea of building an alternative to Microsoft Office 365 or to Google Apps that come to the public sectors. They are used everywhere, and public sector wants to have an alternative to that. So if you really want to go after the big elephant that is not in the room, we're trying to create an alternative to that. So probably it's the biggest opportunity for open source software right now. Let's say at least in the realm of collaboration and working together. So OpenDesk is a powerful initiative of the German government with a goal to provide a serious alternative to the proprietary Big Tech establishment. It unites independent open source software vendors to create a sovereign workplace tailored for the public sector. We too here, we are just nerds. We're just software developers. I'm from outside. I work for and co-founded Open Project. So that is one of the parts of the solutions here that we will talk about. But I'm software engineer. I'm not OpenDesk. And this is Clément. Hello. So, well, you maybe have seen me during the room this morning. So my name is Clément Aubert. I'm an XQQ committer and I also work at XQQ SAS to do sales mainly. Okay. So well, let's start by discussing a little bit about the story of OpenDesk and how we got there essentially. So the issue with essentially collaborative suites goes a long way, right? Since 2015, especially in Western EU, so when I'm talking about Western EU, it's mainly French and German government because from more point of view, that's where we have the most information, let's say. Since 2015, what we see is that there are growing concerns when it comes to the US-based cloud offering that exists in order to create collaborative suites. And these concerns are mainly regarding the fact that you don't necessarily have control on your data. You don't know exactly where it is being stored, how it is being processed. There are especially privacy risks. If you are putting sensitive data, maybe they could be accessed in different ways. In particular, since 2018, there is an extra territorial law that exists in the US that allows the government to ask a company to access customer data, even though the data of the customer may not be from a US citizen. Just for, let's say, well, usually to get more information about that customer. And then there is the big question of locking, essentially, when you migrate your data. How easy it is to make it back? Are there any open standards that exist in order to do it the other way around? These concerns exist. And so actually, since late, well, the end of 2010 and beginning of 2020, French and Germany have started to create some rules when it comes to the handling of critical data in their government. In France in particular, there is an initiative which is called Cloud Nubu and Cloud Pi, which is essentially two cloud specifications that are used for public administrations. One is for, let's say, conventional data and another one is for more sensitive data. Germany also started another initiative, which is the Deutsche Verwaltungs-Cloud strategy, which is, let's say, kind of the same. It creates kind of a standard in order to protect the data that is used by public administration from the external actors. So this is essentially, to be clear, this is essentially infrastructures or definition of infrastructure that should be implemented by the state so that on the long run, states have the capability to host some data securely for them. In the meantime, there is the question of having security certifications, basically making sure that the different vendors that will provide a specific service for you, they have the necessary amount of security to provide that service. And so in the same idea, there are two standards that were created over the past years. In France, we have SecNumCloud. SecNumCloud has been created by the ANC. It's basically derived from another security standard, which is called ISO 27001. If you are doing security, you may know about it because it's well known. And what SecNumCloud does is that basically it takes most of the rules from ISO 27001, but it adds also controls when it comes to the nationality of the people and the location of the people that are allowed to process the data. The whole goal of SecNumCloud is to protect from extraterritorialities, in particular the US Cloud Act that I mentioned. In Germany, there is also another certification, which is called BSI C5, and we will talk about it afterwards because it's quite important. The goal is to basically be able to qualify a specific application that you want to deploy on a cloud offering that you want to deploy to be used by public institutions. So C5 certification is a little bit different in the sense that it's not about extraterritoriality at this point. It is more about basically making sure that the application is correctly developed. There is a good standard of quality when it comes to the change that you are applying to your application. You validate its compliance, etc. On the long term, so if you are residing in Europe, in the EU, there is a vision to create one unified standard, which is mainly based out of SecNumCloud and C5, which could be called the ES Cloud that would encapsulate and basically allow any vendor to qualify to this standard in order to be deployed in the different governmental public administrations in the EU. So that's very nice. We are actually introducing new laws that allow us to control what can be put and what can be used by public administration, but in the meantime, we may not be creating solutions here. So that's another issue that has been tackled by France and Germany over the past few years. In France, over the past like in 2021-222, there is a project that was started that was pushed by the DGE, which is the Extranger Neral des entreprises. It's a branch of the finance ministry. The goal was to create different consortium that would lead the creation of an alternative suite to things such as Office 365 or Google Workplace. The ones that Wilhelm talked about. So it's a project that has been started in 2023. The total of the project is around 23 million that has been invested by the state for the three consortiums. The idea is to have results by 226. Now we will not talk about them that much because it's actually not based on fully open source software, so it's not really the scope of this talk. The one we are interested in here is the project from Germany, which is OpenDesk. And the idea of OpenDesk is essentially, so it's a really different approach. In Germany, there is the Ministry of Interior. I will not say the name in German because it's going to be a nightmare. The Ministry of Interior decided in 2022 to create a consortium made of different actors, so we'll see them afterwards. But essentially, Dataport, which is a big service provider for public administration in Germany, as well as a couple of vendors, software vendors, are providing open source software. And the Ministry decided to group them together in order to create a platform which is coherent, fully open source, and can basically last longer and be maintained over time. So to give you an idea, in France, the financing for the three consortium I mentioned is like around 23 million, knowing that there is a little bit of a loan there, so you have to reimburse it. And then in Germany, it's basically orders, so it's not a loan, and the budget into 23 was 23 million, so a little bit more budget. So if we go into the details of OpenDesk, so the project was initially started by the German Ministry of Interior. Today it's being handled over to Zendes, which is also a public organization that has been created in order to handle, let's say, open source projects that have been created by the federal state. And so, as I mentioned, the project is currently co-managed with multiple actors, so there is the Bundesministerium, the Ministry of Interior, the BMI, PWC, which is helping on trying to have, find use cases, basically, find the correct user stories, the issues that we are trying to solve with our collaborative workplace. Data Port, which I mentioned, which is providing, basically, hosting for the project, and is also doing a lot of work in order to organize the different vendors all working together to create a unified workplace and create a product that works. And then we also have Bechle, which is present for helping on the financing of the project. So we talked about the vendors today in this project, in the OpenDesk project we have around a little bit more than 500 people that are working from both the different vendors, PWC, BMI, and Data Port and Bechle. So that's quite a large project in the end. I will get to the name of the vendors afterwards, but this is just kind of a quick view of what we're trying to achieve, right? Basically we're trying to achieve one solution, one workplace that fills in different needs. We want to have email management, of course, you need to have emails, you also want to create events, so you will have calendar modules, contacts, task management, we want to also have the file management parts where we can create new files, collaborate on them, and then you also may want to continue working on your projects, so develop projects within your organization, for that we have a project management tool, a knowledge base tool, and you also want to communicate with your co-workers, so there is modules about chats and video conferencing. So all these projects, the idea of OpenDesk is essentially that these modules should be made of solutions that can be switched easily. The ideal vision of OpenDesk would be that basically you have a software which is providing the email functionality, but let's say that tomorrow you want to switch it, you don't want to use the default version, the default software that is provided, then you should be able to do the switch fairly easily. So in practice today we are providing some sort of a default implementation, meaning that we have a couple of software that correspond to each of these features, and we don't really have two options for file management. Yes, that's a very important part because the German state don't want to get back into the vendor lock, so that's the reason why it's talking about mail and not about a certain vendor. Exactly. So if we look a little bit more in the details, so when it comes to everything which is related to email, agenda, contact management, calendar, and task, OpenExchange is handling today that big part of the project. When it comes to file management, it's mainly in NextCloud, and in NextCloud when you want to collaborate on different files, we have two external tools that have been integrated, so Collabra for which we had a talk a couple of minutes ago, and also CripPad, which we had a talk this morning, so they are used in order to edit basically office files, and for CripPad it's for editing diagrams today. When it comes to the communication, so we have Element which is handling, so Element based on Matrix, which is handling everything which is related to chats between teams, and there is a clever integration between Element and GC, the integration is provided by Nordic to basically allow to have video conferencing and basically rooms within Element where you can start calls and make calls and chats on a specific subject. Then on the project management capabilities, there are us, so OpenProject as part of the project management tool, and NextWiki for knowledge management. I would say that we are rather the, OpenProject and NextWiki are kind of the latest projects in the project, right? We arrived at least in NextWiki, we arrived at the end of 222, so not that far away, and not that long ago. Finally, so this whole portal that you see here, it's managed by Univention, which is another solution that allows to create user portals and it's also handling access management, authentication, user list, et cetera, et cetera. Single sign on. Yeah, single sign on also. So today the development of the project is managed by the hosting of everything which is related to the development of the project is managed by Dataport, so thanks to them. So we'll get to, at some point we'll try to do a demo if we have internet, but let's go a little bit more in the technical details. So today the architecture of OpenDesk is mainly based on Kubernetes, mainly because we are integrating a lot of different components, and at some point it was decided to use Kubernetes because it was the easiest basically to integrate all this complexity into one big package. So we are using Helm charts and GitLab CI to deploy Kubernetes cluster, and then basically each component of this cluster is one application managed by one vendor. Each vendor can provide either a generic Docker image that needs a little bit of configuration in order to work within the context of OpenDesk, or sometimes some vendors will provide really a tailored application with, so a tailored custom Docker image that meets some specific requirements, and well, so sometimes there are also other ways to deliver basically the applications. Yeah, a quick look at the features I may not have completely talked about. So of course it comes with a user directory managed by Inevention. All the applications are connected between each other through OpenID Connect, so that's very easy in some way and very standard, and we also have unified navigation across the different applications, that's something I want to show you in the demo. And finally, the goal of the project is really to make sure that all these components that we are integrating, they are connected together like fully. And so that's going to be very important. I would say it's a work in progress, but we'll see in the later part of the talk that we have, for example, examples of integrations between OpenProject and NextCloud, which are very exciting to see. Yeah, a little bit of a note. Also, when it comes to the distribution of OpenDesk, of course, it's made of OpenSource software, but the whole build itself, the whole project itself is OpenSource. And actually, you can access it on OpenCode, which is a GitLab instance that is used by the German state to publish its OpenSource projects. So there you should find basically a mirror of all the source code used by all the components of OpenDesk. But we will also find dedicated repositories with all the handcharts that are needed in order to deploy the platform. Another thing in order to be secure, to provide some security and compliance, so every release is signed. We create software builds of materials. We audit also the licenses of the different components that are being integrated within the workplace in order to make sure that we are really completely free OpenSource. The idea is essentially to have something that will work for the German administration and working for the German administration means being BSI C5 compliant. So as part of the project, we also have a little bit of work in order to help each component be, well, match the certification in terms of quality of development, for example. Finally, and maybe we'll talk a little bit about it later on, there is a big concern on accessibility. It's one of the things that we see that I'm the most talked about about OpenSource software. It's nice. It has a lot of features, but maybe sometimes it's not fully accessible. Well, as part of this project, we have to match also some accessibility guidelines in order to be used for the public sector. So this is also part of the final thing that we get from the project along with security. On the offering side, so the goal, the long-term goal for the project is essentially to have, today there are plans to create offers for the public administration in Germany, mainly through two entities. One is Zendes that we talked about, which is currently, let's say, coordinating the project at a very high level from the federal government perspective. And also DataPort, which is participating to the project as part of the project management, but DataPort has its own also suite, which is basically a fork of OpenDesk with some extra components that are not necessarily OpenSource that have been added or modified in order to answer some specific needs. So, yeah, today mainly offering for the German markets, but not really for the rest of Europe. So that's going to be a challenge for later. But there's interest from all over Europe in the product, so the French government also is interested. Yeah, Austria, I think. Australia, Sweden. So it's a big thing in Europe, and people are looking from all sides on that project. Okay, so let's try to do a quick demo. Let me see. Okay, so let's do it very quickly. Whoops. So this is, okay, this is an OpenDesk instance that we are using for review. So I just want to show you very quickly the different applications that are available. So let's say that I want to go to my emails. We are fully dependent on FOSDM network, so I hope it's going to load. Okay, great. So email, we are running on OpenExchange. So if you know OpenExchange, you probably already recognize this interface. What you see is that it's, so all the applications have been customized so that they have a color theme that matches, that is unified in order to have like a nice user experience. I can use a user directory which is based on the user that are registered in my OpenDesk instance. I can potentially add, in my email, I can add files, and when I'm selecting files, I have the choice of uploading a file from my computer, but I can also link a file from my next cloud account. And so if I do that, it will create a share automatically and make sure that whoever is receiving the email has the necessary access to see the file. This one, no. Okay, this one, thank you. Okay, so apart from that, in the email application, you will see that we have this little button here, and that's the transversal menu. It allows us to switch from one application to another, and it will change depending on your access rights. So we can look at, for example, next cloud, so integrated for the file management. Here I can, so I can create my file, I can create spreadsheets, and in that case, it will create a document that will be open within Collaborah. So I can edit it. We also have files that can be diagrams that we can edit directly with CripPad, so for that we integrated CripPad within OpenDesk for one specific functionality, and here we are using draw.io actually within CripPad. We can also look at maybe chats, in which case it's going to be a managed instance of matrix and element of the frontend, where I can have discussions with the other member of my OpenDesk instance or potentially other member of the Matrix Federation. And here what you see is that I'm actually part of a room which is used for a specific meeting within Matrix. And so these rooms, they can be created automatically when I'm in the agenda of OpenExchange. I create a new event and I say that I want to have a conference in Matrix, and it will create me a link that will help me to this room, that same format, where I have video conferencing here. I know that was a bad idea. Let's leave. And I have a whiteboard and I can also chat. Finally, we also have project management with OpenProject and knowledge management with Xwiki. So here I can create my new project, I can create my work packages, create some milestones and link them together. I won't go too much into the details because I don't want to spoil you. And here we also have a customized Xwiki instance. Today we are synchronizing users and writes. We don't have particular integrations with third parties with the other applications. So that's a very quick demo. And by the way, that is released. So you can download and try it and play around with that. So it's an open code. It's open source. So about the roadmap of OpenDesk, essentially the goal is to have a stable version like this month. As you said, it's already released. The main issue, I would say when it comes to the deployment, if you want to try it out, is that today there is still a good part of the documentation which is only in German. And sometimes if you're not speaking German, it can be a little bit difficult. On the longer run in 224. We are trying very hard. Yeah, yeah. It's just a few contributions for translation. I'm welcome, I guess. Exactly. You can do a full request. And so the idea in 224 is to have like more improvements in order to improve the BSIC5 compliance. Remember, the goal is to deploy that within the German federal administration and also within some German lenders. So compliance to any standard that exists for the public sector is really important for the project. And it's good because it allows to also improve the open source projects that are behind, that are being bundled in the platform. Yeah. Yeah. Yeah. So, I mean, we are super vendors like OpenProject or XWiki. For us, the perspective is a little bit different on that whole project of the whole OpenDesk because we already have a product. We already have clients. We already have a roadmap. And then suddenly someone says, hey, we want to integrate you, but you should have the same look and feel. We want to have the single sign on. We want you to finally come together and create deep integrations. So that is challenging for us because usually we tend to stay in our own soup because it's easier to build stuff in our own software. And integrations are complex. You need to organize. You need to find collaboration, like meetings with others, line roadmaps and important priorities. That's difficult. And now suddenly someone from the outside comes like, no, we want you guys to work together. And we will pay you, actually. Yes. So by integrating two very, very multiple different types of applications, we are going to build very deeply specialized applications. By integrating them, we create a multiple value. We multiply the value instead of everyone brewing their own soup. So also for us, it's a huge chance because if, let's say, XWiki is integrating with us, then maybe their clients, which are also likely to use OpenProject, would also book OpenProject, like the professional services. So for us, it's a huge, huge opportunity. And with OpenDesk, it comes like, okay, the German government also wants that it's easy to procure so that not every little city needs to go through a tender. Tender processes. Tender processes. So it will be much easier for a small city to book services from us. Right? So, and with that, we can build better software and we can better integrate. So maybe some challenges before we dive into more, like how we create integrations, basically. So some challenges that we see today is integration between the UI and the UX, so the products, of course, it's difficult to, well, basically not all software are created equals when it comes to the capacity to customize them because sometimes it has not been thought out from the beginning. Yeah. Oh, sorry. So there is a big challenge on UI and UX. There is also a question of overlapping features. Sometimes us as vendors, we create features like, we create, I don't know, like a task management feature in XWiki, which collides with OpenProject or a Wiki in OpenProject that collides with XWiki. Well, we have to find solutions for that, but usually we are like civilized, so it's okay. One issue is also like maintaining all these customizations that we are creating outside of the core of our products. So basically, we create an overlay that makes our application compatible with OpenDesk, but then like, how do we get the financing for that? Like, how do we maintain it? And if it's really difficult to maintain it across new versions, like, how do we do it? And so far, we need to find solutions on the long term for this. Talking about integrations, like, of course, these two systems, they don't exist only in OpenDesk. We exist outside OpenDesk as well. And then integration also makes sense. And those people might not have the whole OpenDesk infrastructure. So we always, when we build integrations, we want to build it in a way that is also suitable for other environments where the software could run separately. Exactly. Exactly. And so last thing is a creation of offerings. We mentioned the fact that other EU countries are interested. So apart from Germany, so that's going to be a challenge on the long term to find ways to provide OpenDesk in the public sector, maybe for other actors or even for the private sector. Yep. So. Yeah. To also go a little bit into an example for, I always like to talk about integrations because I think this is where the power lies in collaboration. I want to go into one example that I was working on with my team. That's the NextCloud and OpenProject integration. I somewhat sub-staffed this already presented last year here, but I want to have a different point of view on that. So quickly, okay. So NextCloud is mainly for us here today. It's a file storage platform. And OpenProject is something like, let's say, Jira or something like that. So you create and organize your work in work packages, issues, whatever you call them. And you can have them organized in gun charts or boards or whatever you need. So, okay. So we have a file management environment and we have a project management environment. And the outside perspective, let's say the public sector, they have a different perspective on that. They are saying like, where are the files for my task? Two things in one sentence, right? Does everyone in my team have access? Oh, so we are in project management system. We're organizing our work. The files are in a different system. Access management, okay. The third problem is I want to do the same thing over and over again, like doing the same processes, having the same steps organized in projects, task by task by task by task. And also I want to have the same template files and the same folder structure, having this all again and again. And they both need to go together. So what they don't want is that we, as OpenProject, that we build our own file management system because NextCloud is pretty good at that. And it's also integrated in the desktop and so on. Like people want to work on files like in the NextCloud experience. But also NextCloud might not be the best choice for organizing complex projects. Okay, it has the deck, right? But if you really want to go a bit more professional, probably OpenProject is a good idea. So from the public sector, they don't want these tiny solutions. They want the integrated solutions, right? And also it's not only OpenDesk, like other clients like the City of Cologne or the University of Duisburg, Essen or the Deutsche Bahn. They want the integration. They don't want the separate solutions. For us, it's easier to focus, if we integrate, it's much easier to focus on project management. Why, for example, NextCloud could benefit from focus on file management and so on. So this integration creates a great value in the combination. And also it's interesting, like once already mentioned, if we work together, and it's like NextLogest and Example, but if we all work together, then the sales also becomes easier because we all have clients that the others don't have yet. And with that joining together, joining forces in the integration of sales, we together then can capture a bigger market, get more money to build more open source software. Okay, little examples for how this looks like. So this is OpenProject. You have a work package. And on the right-hand side, you see files that are related to that task, which is baking pizza. I love baking pizza. And the interesting thing is you can see the files that are necessary for this baking of pizza on the right-hand side. But the files are not an OpenProject. They are NextCloud. And in NextLog, when they change their name, this name will change here as well. If they change their location, these links will still work. So this deeply integrated reference integrity, that is what you need in order to get away from chaos. This is what actually organizations want. They want to get rid of chaos. They want to have control over this stuff. Access control. So for projects in OpenProject, you can have something that's called a project folder. So we, OpenProject, we create folders in NextCloud for which we manage the access. So members of a project, this is the scope of a team, right? They need to have the access to the stuff that they want to have access to. So we say, okay, here in NextCloud, these people have access to it, fully automatically managed. That helps people to keep the data where it belongs to, the files where they belong to. So we are working on this project. Here are the files. Put them there in that folder. Okay, don't put them anywhere else. And if you leave the company, they're still there, right? If you're in the organization, they're still there. And then on the NextCloud side, also deeply integrated, we can show you which task of work packages are actually relevant for that file or where this file is used in. Let's say you have a template file for an employment contract, right? So where is this used and in which contracts is that file used? So you can find them on the right-hand side, directly jump into the work package of OpenProject and find the processes there. So the bottom line is like integrated. We are much, much stronger. Exactly. Exactly. Okay. Thank you. Thank you. I think we have some time for questions. So... Yes, I'll do it. Are we going to have to figure out the way to answer that? Yeah. Oh, you're going to give him a question. Thanks for the talk. Are there any license requirements in order to integrate into the OpenDesk infrastructure? And second question is, which vendor, so to speak, is kind of the product owner of the dashboard and the top bar we saw, and what are the requirements for them, which all apps share? So the first question, okay. So for the first question on the license requirements, there are some requirements. So basically anything that you have to commit on OpenCode needs to match within the list of authorized license by the German administration, by the admins of OpenCode. The list is not fully compatible with the one that is provided by the OpenSource initiative. So it's a little bit shorter. Basically we found that the hard way, essentially when, for example, if you have a software, you package it as a Docker image, and then when you have to upload the Docker image on OpenCode, you have to provide an SBOM for it with the license, and then you found out that in the base Docker image that you are depending, there is a Pearl library with a weird license header, and so it creates an exception, and you have like three months of review to make sure that it's okay to have that in OpenDesk. So it's a little bit of a mess. There is a list available on OpenCode for the license. When it comes to product ownership, I'm not 100% sure. I believe that it's, so the design of the navigation bar, it's managed by the, I think it's managed by either Zendes, which is like handling the project at a high level, and Zendes has been helped by consulting, by PWC, which is doing consulting and usability tests on top of OpenDesk. And the same goes for the portal, I believe. Thank you. Any other questions? And technically the widget port that you saw, the answer is the Univention. Yeah, thank you again for the talk. I have two questions. One is very specific. I'm from Lassen, Germany, and I've heard of Project Phoenix. You wrote Project, or D-Project Phoenix. Is there some difference, or is it just that Project Phoenix? And the second question is, is there... Is there a data port over there? Okay, can I just phrase the second question? The second question is, if you're in the context of a company which does not have an own IT and stuff like that, but likes to keep to an open source software where you can switch vendors, are there vendors just providing this setup where you can get an account and use it for your company? There's the idea of the job of data port, for being one of the potential hosts of that D-Pheonics suite. So then you could get that product from there just by renting it. But I think they just offer the services to public administration. I'll try to answer it. I'm a part of the Project Phoenix and also have a little insight of OpenDesk. The thing is Phoenix is a branch of this OpenDesk, and what was your question exactly? They just wrote D-Projects. Yeah, it's the same. There was some name changes on the way to the product. D means data port, they dropped it, and now it's Phoenix, so it's basically the same product. Is that like the second generation? No, no, no, it's just renaming, you know? And is there a possibility just to rent this somewhere for small companies who don't run an own IT? Not to my knowledge, but if there is a high demand on that, that will be possible. They already do it for some customers, so maybe it's a question of strategy and how much this is asked. It's Helm charts, right? So the idea is that it's easy for any host to host it. And then just to protect... Ah, there's Markus. You have a mind set? I'm going to take the questions in order. But the idea is that it's easy to host it, right? It shall be easy for any organization in the public sector to simply say, I have a data center, I just rented somewhere, and I just pull up the hand charts and off we go. Okay, you mentioned that this German strategy was to put this 23 million in year 2023. So my question is how does it go forwards on the funding side? It depends on the German farmers. Is there kind of a guaranteed maintenance for this code, or like who is taking care of the boring stuff, the security patches and all? I don't know. I don't know. So the budget allocated to the project depends on what's being voted in the parliament. There are budget cuts nowadays. I think so. I'm not 100% sure, but globally there is still budget on the project. It's about half of what we had last year. The budget repetition is another issue, right? It's basically like around 30 million dedicated to open projects in 2024 to be validated. Open desks, sorry. And sorry, what was the second question again? Who's handling the long-term maintenance of the security patches? So the goal on the long term is essentially to find a business model so that whenever you are deploying an open desk, there is a team that is managing basically the packaging of open desks, making sure that the Helm chart are up to date, et cetera. So this team needs to find some funding, and the idea would be to have, like, if you are taking professional support, basically the team would get a part of the funding. And then the idea is also to redistribute this funding to the vendors themselves in some way. The specifics of this distribution here, they are not fully defined, basically, because we are really right at the point where we have this default implementation of open desk that is just going out. And now there is a second step of finding the first clients and making sure that it deploys properly, basically. Okay, any other questions? I'm going to take some people that haven't spoken just for, like, distribution. Thank you. First of all, thanks for the talk. That's a really interesting project. And I just wanted to ask if there is an interest of at some point adding any repository tools or, for instance, I don't know, pipelining CI CD tools to the platform? I will repeat the question so that it's registered. I'm so sorry. So that it's registered. Whether there's any integration of repository or pipelining? Right now, yeah. Right now not. I personally think it makes perfect sense. I would very much welcome that. And I guess it's just like knocking at the door at vendors and saying, hey, we want to have this. Hello. Hello. Okay. Yeah. So you said there is unified procurement so you can buy license for all the different softwares if you want, like professional support and stuff. Is there also like a single point of contact for support? If I want to self-host this and have some issues with any of the softwares in the suite, who do I ask if I have problems? And is there someone who can help me and do I have to know which software has the problem? And also second question, what's your favorite pizza? Thank you. Okay. Good question. So the question is like, is there central support for the whole product? And actually, I don't know that much. I think it's not different yet. Yeah. Something that needs to be developed. It's part of what you said about the whole package. It's part of the discussion on the business model, basically. But I think it's more important to be first built the software now integrated and make it open source and available for everyone for free. And second question, the favorite pizza. Oh yeah. There are many. Okay. How are you? No, how are you? Okay. Maybe one last question because it's about like, it's about 40 seconds left and then we have to go to the next question. Let me go back there. Not just not a question, but a remark. Hi, I'm Renee. I'm with Zenders for two days now. And I'll be happy to take any questions or feedback on Open Desk with me. I'll be out to talk. Thank you. Okay. Okay. Really cool. Sure. So my question is there are huge parts of software stack that are still, I think, vendor tied outside of, for example, just office software, I guess. So is there a way to get the software stack to be open source? Yeah. So I think it's a good idea to have a lot of people who are interested in the software stack. So I think it's a good idea to have a lot of people who are interested in the software stack. Yeah. So I think it's a good idea to have a lot of people who are interested in the software stack. For example, like GitHub is one of the biggest ones. There is an alternative in GitLab. And operating systems, BIOS, hardware. I mean, I physically have problems to understand the question. So what's the question? Like if there are other software going to be integrated or? There is a huge part of the computer science ecosystem such as going down from hardware all the way up to operating systems. So the different layers. Different layers. Is there, you know, movements to free those? Yeah. So OpenDesk focuses on the desk, the working desk. So the tools that you need on your machine in order to work together. That's the current scope. There's not the scope of controlling the hardware or the operating system. That's a different story. Thank you. Thank you very much. Thank you.
Using Generative AI and Content Service Platforms together
Very much. So our next person is Ahel Boroy from Highland that's going to talk to us about using generative AI and content service platforms together. Thanks. I was on. I was on. I was checking the microphone. Okay, yep. Welcome to everyone. So this is another view on the same topic. So we are going on the technical side now. It's not like a final feature for a product, but it's a framework in order to help you to build all the features that we were seeing before in the context of a content service platform or a document. Okay, so we are going to review some in a stack. The next step we are going to use that is including also LLN on premise. We are going to review all the options. We are going also to describe the features we can build with this stack. And then we are going to review how to integrate that with your in our case because I'm working for Highland and we are building an open source product with the name of Fresco that is related to content management. So we are going to see how to integrate that with that content management platform and also just looking a few to the future. And obviously I need to include some AI picture because it is what it is. Anyway, so this GenAI stack that we are using includes mainly three components. So the first one is Olama. Olama is a service that is able to provide an API to interact with different LLMs. We are going to see later all the list but you can download your LLN on premise and this layer is providing the interaction with the LLM. You can even interact with different LLMs at the same time. The second one is Neo4j. So Neo4j is the vector database when you are using rack, we augmented reality and so on. Then you need to increase the information for the LLM. So you are storing all this information in this database. And finally we are using land change. So this framework is providing land change that is a framework to communicate all these different elements. So this framework is in Python but if you are not comfortable with Python there are many other languages that are including this kind of piece. Okay, so mainly what we have, if someone doesn't like Docker there is no problem so you still can deploy that without it but it is oriented to services. So you have Olama that is the one providing the services for the LLM that can be used in GPU or not. So you can, we are going to run this without the GPU. Just using my regular CPU on my computer. This is lower. I recommend you to use a GPU but you can do that. And we are piling all the models that we need so we can just use more than one model. With that we can increase the information for the operation with that Neo4j database and we can develop an API with this string LLM, with this framework. Okay, so these are the pieces. You have the project Docker.gen.a stack. Mainly these pieces is a sample. This sample is oriented to prompting. You have to reply questions. So we are going to do something a bit different from that but the first sample you can try is this one. Okay, so all the LLMs that are able to manage Olama today are these ones. Likely as this is growing every day there are more. But this was like last week. Okay, so this is what you need to understand. Obviously the larger the better but you need to take into account your resources. So these are very small. Your 4 gigabytes of RAM and 2 gigabytes of storage. So you can run that on a computer and then if you want to use something that is better is also larger in resources. And you can even use these LLMs that require, I don't know, many different computers once I say, okay. So today we are going to use this kind of LLMs. Also it's relevant to look at the license. So just was talking before about the license. So this is also something relevant if you want to build something commercial or something open source or whatever. You need to take care of the license. Also you can look that there is some weight license there because you have this LLM2 community license agreement that some people say that this open source, some other people say that it is not. So it's something different. So better to check if you don't see a patchy license or something that you can recognize. Better to check the conditions. So you have a lot of them to choose. We are going to work today on the demo with Mistral, CementB, that is a French company that is producing this kind of LLMs that are more or less the same performance as GPT 3.5. So it's good enough. And so what is open source, the LLM is free to download and to use, but the training data is not free and likely it has some copyright material on it. We don't know because it's not free. So on the next law ethical AI writing we have, sorry, yellow. I thought it was orange but it's yellow. Okay. It's more or less fine. So we are just only missing one. That was for text and for pictures. We know some LLM with a visual encoder on it. So for this part we are going to use lava. And lava really is granting all the different requirements. So we are using a green LLM for this other sample. Okay. Perfect. So all the demo is running on my computer while I'm there in the presentation. So I have everything running inside is 32 gigabytes of RAM and is AM64 architecture. So it's not AMD. It's MacBook Pro two years ago, something like that. Okay. As we were also reviewing the previous version before this GEN AI momentum, we also had some data section, test recognition, test classification, content analysis. Anyone is using content analysis for a real use case? Okay. It was not me. So it's something but you saw. But we have all the things, right? Some kind of automation. But now with the GEN AI, we have also a power classification. We could classify in the past. But now we can classify better. We can also, and when I say translate, we are going to see later the demo. Obviously we can translate. But we can also interact with the LLM in one language and to get the response in another language. Right? So that is the difference. We can also summarize a test. This is the most common use case and we can describe a picture. Prompting. Obviously we can use prompting. We can read that. So we have some new features that we can use in our documents. Okay. We are going to see some of them implemented. Okay. So what is this project about? It's not yet. Okay. The project is at some point of the slides. Okay. If not, I will give you the link. So in this project, what is created is a API by using this, all these infrastructure in order to provide different services. What we are using is some LLM embeddings. So we are just trying to avoid hallucinations. Just giving some additional information to the database from the document. So we are working with a document. Right? So we are not going with search. We are not going with some other applications of GNI. So we are focused on features of a document. So we are adding all that information so we can get a better response and more suitable to the document we are dealing with. And for that we are using Mistral. And if we are talking about a picture, then we can use the other LLM that was Java in order, for instance, to describe or classify the picture. We have also some, so we can choose the LLM. If you want to choose some other LLM than Mistral, you can do that for text. You can choose some other LLM with a vision and color enabled, like Java or some other on the list. And we can also choose the language. So we are going to see that later. We can just drop a document in Japanese and we are getting the summary in English or in the other side. Right? And also you can choose some numbers like the summary size or the number of tasks and so on. So these are parameters. Okay. So this is the API. Right? Pretty cool invocations. But let's see that leave. As always is better. Can you see the, better? Okay. Okay. So for instance, I'm going to work, let me find the, I'm going to work with this document. Right? I could be using an English document, but it should be easier for the AI. So we are using this one. And I'm also going to use this picture. So for your reference. Okay. Okay. Perfect. So for this document, we are going to ask for a summary. So give me a summary of this document that is in Japanese. So with that, if I'm able to. Okay. So this is running on my computer. So I have this ENAI stack running in this Docker deployment. And I'm getting the request. Okay. And with that, I'm getting the answer. Okay. So the test, this is a problem with kindergarten, in Japan, blah, blah, blah. Okay. That's fine. So I'm giving something in Japanese and I'm getting the summary in English. The second one, come on, note this one. I did it. Okay. The second one is just to classify. Classify a document that picking a term of a list of terms. So I want you to classify this document according to Japanese, Spanish or Vietnamese. Again, it's an easy example. Right. But you can choose whatever list of values. So if I say just classify this document into one of these three categories, the term is Japanese because the document is in Japanese. Okay. This is also a Revan for classification. And finally, we can also make some prompt on the document. What is the name of the zone or this document in Japanese document? The name of the zone is Musoku. Okay. So three different features that we can use on this, on a document. You can build more. Again, it's a Python, Python program with these three specific features, but you can grow up to include something else. And if we move to the, to the pictures that was for text, but for the pictures, we can describe this, this picture. We can also extract some, this is a person, this is, but that was done before. But describing is the, the, the new thing that GNI is providing for us. This is a bit slower, but in the end, they made so some man posting for the camera. He's wearing a green beanie, glasses, a black hoodie. And the land yall says air fraked. Well, no, it's a fresco, but more or less. Okay. The picture was not big enough, but it's fine. It's, it's something that is, is useful. And it's not that consuming internal resources, because it's running in, on my machine. So it's, it's fair enough. Okay. Once that we have all these features, and we have this, Python, just let me show you a bit. So this is the project, right? You have the Aeboroi, a fresco GNI, and you have the GNI stack, and mainly it's a Python program. Okay. With all these endpoints described, classified, prompt, and somebody. Okay. It's no more than that. Okay. If we go back to the original goal, is to integrate this kind of operations with our, with our product than in our case is a fresco. So a fresco, we can deploy that also in Docker or whatever you want. And we have two different APIs. So the first one is the classical press API. And the second one is a messages API, synchronous and asynchronous. So if we have existing content in the repository, you have a folder with 100 pictures, and you want to describe that. So you can use the recipe. Yes, to get the document, apply the operation, and update the document. And that's fine, because you can make a batch with that. Okay. You have all the operations available. And if you want to create that like more dynamically, when the people drops the document, yes, perform the action, then you have the messages API, the asynchronous API. So you can listen to the event, okay, there is a new picture, and this picture needs to be summarized. I'm going to summarize the picture, and that's updated. Okay. So these are the two different patterns we can, we can apply for it. What we are going to see now, again, live, everything is running on my laptop, just believe me, is something that allows us to classify a document. So we are going to upload a document. We are creating this rule. The rule is the same just for you to make the similarity with what is before. So we have a list of languages, Japanese, Vietnamese, English, whatever. And we are creating a rule to move the document to the right folder. So you draw a path document, and the document is moved to the right folder. Okay. Okay. So let's do that. Okay. Let's open a fresco. So there is a folder at some point. And this folder has a rule that is classifying the documents that I'm dropping on it. Okay. So if I, for instance, come for classify, no, for classify things, we are going to try with a Vietnamese one. It has to be a bit creative. Okay. Okay. So at this point, a fresco is listening to this new document, and it's classifying the document. So it's just selecting a term from the list of terms, and the document has been updated. So it has been classified. So if I refresh, what I find is that the document is on the Vietnamese folder, and you can do that with invoices, with whatever you want. And we can track that it was mistral, the LLM, that created this classification. Okay. Pretty easy, right? So you can integrate also all the other operations in that to get some automation. Okay. So I guess that I was running out. But no problem. So we have more time for questions. So again, this is a simple framework. You can deploy that on premise. You can choose your LLM. You have an initial REST API for operations. Public works are welcome. And then you need to integrate that with your product, with your organization, or whatever. Right. There is also some interesting hackathon with more use cases. So I presented you some use cases, but you have more of them on this hackathon. The slides will be, they are available on the, on Foxen. Okay. And also I'm using Olamma, but there are many other alternatives. You don't need to choose Olamma. So you have GPT4 all locally. This solution is the one used by, by next cloud, second state, high-end phase is the most known probably. But again, just, this is an initial framework. Take it as it and try some things with, with the NAA. Okay. That was all. Thanks. Thank you very much, Angel. Are there any questions? I'm going to do it in the order of the rule. Thank you, Angel. It seems to me all these operations are on one picture or one document. Are you also considering me asking a question on all my documents? No. So this, this sample is only for a single document or a single picture. But, but that is as easy as you have the database, the Neo4j database, then you can include as information as you want for a single document or for a single query. Right. So what I'm doing in the source code is to remove the previous information. You have to create something that is only for a single document. But you can modify that in order to add more than one document to one query. But on the sample is only for a document or a picture. While summarizing the Japanese PDF, why did you need to provide for context your picture Sorry. You showed the summarization of the Japanese PDF. Yeah. And then you provided for context the picture. No, no, the picture was for the last operation. So the three first operations for summarize, for classify and for prompting were related with the document in Japanese. I could use some other document. I know, but I love the document because I'm using this for testing for 15 years, something like that. So it's like my, my precious document. And, and the picture was there for the last one. It was the description of that picture that is more or less like, like yours then. Thank you. Similar to the previous question that I had, but for a single document, right. So the summarization for very large documents. Yeah. So, the problem is that again, I'm running on my lap. So I cannot use like a very large document, but I was just trying to summarize, for instance, books. Do you know the Gutenberg project? On the Gutenberg project, you have all the classics of Alice in Wonderland and so on. So I was trying to do that with that kind of documents. And it's able to do that, it takes a while, like minutes on my machine. Again, if instant adjusin, the regular CPU, you use a GPU, the tiny slide, I don't know, 100 faster, something like that. So I don't know. I need to make serious test with that. But having the right infrastructure, I guess that the, the performance is enough. It's not something like very instantaneous, right? But you can work with it. Thank you very much. Any other questions? Yes. Hi. A follow up on the previous question. Was the insertion into the vector is database taking a lot of time or was the actual query to the LLM? Because the insertion into the vector is database has to be done once, whereas the query can be done multiple times if, if you already vectorize the document, right? Yeah. So again, I was not trying to deliver a session on how to develop AI, right? It was just to create a framework. You have the AI track that can reply to you better than me in relation to that. But yeah, obviously, you can use the database. I'm not, I'm only using the database for a context of a single document, right? So you can create categories, you can add more than one document. You can add also the, the links to the response and, and so on. So yeah, sorry. Maybe I didn't understand you. Maybe you misunderstood my question. My question was when you added the Alice in Wonderland book, was it the vectorization that took time or was it the query to the LLM? No, no, it was vectorization, vectorization of the chance of the document. Okay. Sorry. That was the only one question. I'm not an expert, but I know a bit. Any other question? Okay. Thanks. Okay. One more question. Last one. I'll be around. So if someone just wants to, to catch me. Can you say a bit more about like the biggest use cases you see and if there's any open source setups of this that are out there for us to look at? In my opinion, the, the main use case of that is searching. So but this is a different world with different beasts. So but for searching AI, it's really quite relevant. So again, this is just to create a framework and then it's just to apply your imagination. Thank you very much, Angel. Thanks. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you.
Cristal: a new Wiki UI to rule them all
Okay, so thank you everyone for joining. We'll soon start the talk about crystal. One week you have to rule them all with Ludovic and Manuel. So I'm going to give you the mic. Okay, we're good. So hello everybody, welcome to this talk. So we're going to talk a bit about the new project we have at Xwiki. And so we're going to present the team first, who we are, the product vision, what vision we have for this product. And since it's a new product, it's not ready, so it's not something that's usable today. It's something that we're going to build with a lot of energy. And then we're going to show the design proposals that we have for the UI, which is from our belief very important, then the technical architecture, which is another part that we believe is very important. And then the current status and the roadmap. We call this project crystal because we want to both... We want that it's both beautiful on one side, but that it's also like the chemical structure, like it has a very nice and very well-done architecture. So first, who are we? So I'm Ludwig Dubost, the CEO and founder of Xwiki. We're going to talk about Xwiki just after. Manuel. Manuel is the tech lead of Crystal. He's going to talk to you about the architecture. The project has also Vincent Massaud, our CTO, Thiago, our designer on this project. And we have also the support of the whole Xwiki team on this project. So Xwiki SIS is a company that is established in 2004. For 20 years, we're working on Wikis. So it's quite a long time. We have friends in this endeavour of Wikis. We believe Wikis are very important. Hence our tagline knowledge is power. We are a self-funded company. And we have reached 4 million euros revenue. We're also building the CripPad software. We have done 50% growth in 2023. We have 60 employees in France, Romania, but also Germany and anywhere. As I said, we have two software, Xwiki, CripPad. We engage in digital sovereignty. We believe open source is really important for that. And we have a business model to really try to fund the software. We believe it's also very important to find a way to find open source software. Open source software cannot be done with... Microphone is not working. Is it okay now? Okay. So I have to be careful. It turns around. So the product vision. So first, there are lots of technological shifts that have happened in the latest years. For example, JavaScript is getting more and more mature. Like better development methodologies come in JavaScript. We come from the Java world. And so we're very keen to have great development methodology. And for a long time, JavaScript was like a bit in every direction. And now we see that it's also getting more organized with development tools. It has better frameworks also for developing JavaScript applications. Standards have evolved. You also have web components, JavaScript modules that are working much better. You have also technologies such as JSON-LD or SOLID that bring new capacity. There are also new paradigms. Paradigms, real time, is becoming something that any application should kind of have. We believe offline is important also. And that the technologies allow it. We also see a convergence in the field of wikis. Not only between wikis, so the features of wikis are getting similar. Like there's a better understanding of what the features of a wiki is. But also we see that there is a convergence between wikis, for example, or even drives. They're getting closer. So there are also questions about how they could be similar applications. For example, we have always had attachments in wikis. Maybe you could consider attachments or documents like their wiki documents. So there are convergence there in this area. We believe there is a model of future. So Jitsi's founder, Emil, in last year, he mentioned at the end of his 20 years Jitsi talk that in open source building on layers is going to be an approach that matters more. And it enables tremendous innovation. So if you look at Jitsi, you have a Jitsi library that makes the video conferencing module. And you have the Jitsi application. Like I really believe in this and open source can really reach all of its power if you can do a lot of reuse of anything. So you need lots of modularity. We had reached a lot of modularity in xwiki with everything's components in xwiki and Java. But now applications are way more client side. And so you need the same level of modularity on the client side. And we also need integrations between open source tools. There was a talk about OpenDesk in which we are partners where we need to bring open source applications together so that the whole suite of open source applications could replace Microsoft or proprietary applications. And we need to be able to integrate tools much tighter. And for this you need, again, the modularity. We also got an opportunity to fund that work. We have one with other companies as part of consortiums. We have one actually three projects, but two of these projects include the work of Crystal that we have included in the projects of building this new UI. And so we had this opportunity to fund it so we're able to get money for that. And so we have this big opportunity. We also have the opportunity to have the collaboration of partners. So actually the partners of this project, unfortunately they're not open source, would be also users of that Crystal module for their application storing data in their own system. We'll come to that when we talk about the vision for the product. So what's the vision of the product? It's actually one UI, one wiki UI, a modern one that brings all the features that you have in wikis today and that can support multiple back ends. So you would have an application that is web, desktop, mobile, and this application would be extensible, very modular based, but it would have a common data model behind that applications, which support offline, real time, and then it would be able to connect to different systems in terms of back ends. Of course it would be able to connect to Xwiki. We built Xwiki and we wanted to connect to Xwiki and support all the features that Xwiki has, even the most advanced one that we have in Xwiki. But we also would want it to do a basic wiki based on a file system that you store locally on your computer. We also would want it to work as a nice wiki with the next cloud back end using webdav or git. And we also would want it to support a wiki storing data in an end-to-end encrypted system such as CripPad that we build ourselves also at Xwiki. And this application, as a whole, where you can activate and deactivate modules, you could decide that these features you don't want them, you can change modules, you could replace modules in this modular application, would also be embeddable. That means you could put it in the next cloud server and serve it from the next cloud server. You can put it in the Xwiki server, serve it for the Xwiki server to access Xwiki data, or you could put it in any other application. That's the vision of the product, Crystal. The key concept is that we want it to be a slick UI with modern editor and slash comments, multiple back ends, as I mentioned. So, slick UI means it needs to be as good as what Notion does today in the world of wikis in terms of UIs, or the Notion competitors that we see coming in. So, we believe the Notion competitors are nice because they support a lot of nice UI features, but they don't support the modularity that Crystal will have. It's going to be offline in real time. It's going to have accessibility by default, support web components, and also be sustainable. There was a very nice talk before about sustainability of software, measuring the consumption of software. We want to try to do that also with Crystal. We want Crystal to be a UI that is built in a way that will consume less. It's going to be available as web, desktop, later mobile. It's going to be extensible and configurable, and it's going to be a strong editor. So, I'm not going to go into details of what a strong editor is. It's going to support markdown, but it's also going to support the Xwiki syntax, but it's going to be state-of-the-art UI. Lately in Xwiki, we implemented slash comments. It's going to have slash comments in Crystal too. It's going to also support structured data. That's one of the big advantages of Xwiki that our customers and users have loved in Xwiki compared to other wikis. We have a whole system around structured applications, structured data, and we're going to support that in Crystal. Some use cases, we wanted to be a UI for simple storage, markdown. So, it should be a simple wiki. The idea is that it can be a local note-taking app. You use as an offline with a storage that is local, and that would be really interesting. It's going to be a modernization for Xwiki because Xwiki has a UI that's quite old now. We have done a lot of things with this UI, but we wanted to be the modernization of the wiki UI in Xwiki. It's going to be embeddable, as I mentioned, and we wanted to be a wiki view on all your wikis so that as an individual user, you would have multiple wikis in your Crystal UI. You could even create a wiki of wikis, so you could create your own tree of pages, and you could be navigating different pages in different wikis that you have in the back end and locally, of course. It can be an end-to-end encrypted wiki for CripPad, which is a feature that we would love to have. So, we can summarize that as a new wiki UI to rule them all. The design proposal, I give it to Manuel. Is it working? We only share the results of work by Ciego, UX engineer, we hired a few months ago. Since we have some experience with Xwiki, we are able to part from a blank state, but using the experience we have already to redesign a more clean and modern UI for wiki. That's one example, but we have some documentation online where you can find other wireframes. Of course, everything is community-based, so you can come to the forum where we are openly discussing design ideas and contributions. One important aspect we want to work on is, since we want Crystal to be unbedaible, it can be with its own style. It needs to look like the application where it's integrated. What we want to do is, as a developer, when you design a part of Crystal, to design it with some abstract components, UI components, by configuration to be able to say, I want to use shoelace for the actual visualization without much code to say, OK, now, for this application, I want to use Vue.define on... to make it easy for the developers to switch from one design system to another. It can even be convenient, for instance, the French government has its own design system, so if you want to have some knowledge base for the French government, it should be by extension to define a new concrete design system and to use it for their own needs, for instance. We can imagine some other use cases, like in Nextcloud, they have their own set of components, and if you want to have Crystal inside Nextcloud, you want to have it look like Nextcloud and be seamlessly integrated inside the ecosystem. So, a few notes on our technical view for the future. Starting Crystal was a very good opportunity to be able to try new things, so I'm working for a few months into studying a lot of libraries that we can use for Crystal. That's a snapshot of things we have settled down for now. I went to the JavaScript room this morning, and now I have like dozens of new technologies to check. We have this page where all the choices we made are listed, and we maintain it over time. And so, in terms of architecture, it's starting easily with two main components, the web one on Electron, all the ones with Dash as the platform where the most work to do in the future because they are the most challenging for integration of Crystal inside XWiki because it's 20 years old project, so as you can expect, a lot of features to make us compatible with. Reach editing is very challenging. We need to choose a new technology for the editor which is compatible with offline editing with real-time editing, so that's a lot of work ahead, but we have plenty for our next roadmap. So, the key aspect we have in mind is that we want to preserve from what we already have in XWiki and that we deem as important its accessibility and sustainability. That comes with artifact size, of course, measuring performance, making the Crystal locally usable, modular with inversion of control, based on standard as much as we can, for instance JSON-LD web components, to keep having documentation for users, for developers, to have a broad idea of the artifacts we want to publish, so the abstract design system library for others to develop design systems on top of Crystal, a set of connectors to different sources, as we said, and a JavaScript syntax formula to have offline editing with a rich experience, a software development kit to be able to develop extensions, and a set of components we're considering web components in particular because that's independent from a particular framework, which I believe is better for the future, a long-term future of the project. So, on users, we have this electronic application for desktop not taking as a replacement for the XWiki front-end. So, I'll get back. And so, the tricky part now is what's the status and where are we today? So, the first thing is we have a prototype of the extensible architecture using IOC and the version of control. That's actually a very important part of the way we've designed the applications. So, people coming from the Java world understand what components in Java are and what inversion of control is, and this is actually something that is not used that much in the JavaScript world. It's used by frameworks, so ViewJS or AngularJS, they're frameworks that are doing inversion of control, but when it comes to JavaScript libraries, this is not something that is used that much. So, the key feature that is really important for the extensibility and modularity, if you want to be able to replace one piece of the system because you want to change the way it behaves, you need to be able to replace any module for which you have defined an API. And inversion of control is a key method to be able to do that. In the prototype we did, we've been able to load dynamically a module by configuration that is coming from the Internet. So, in the configuration that you want this module instead of the other module, and from a static build that has been built as a standard crystal delivery, you can add an extension that will replace one of the modules of the system. And this is key. We have designed the architecture, the basic architecture of plugins, skins, and user interface extension. In X-Riki there is a great feature that is called Skin and UIX. Skin is a way to replace the UI. UIX is a way to add an item somewhere in the UI. So, if you want to add a feature in the product by an extension, you need extension points. UIX and X-Riki is the way we do it. We have replicated these methods in the crystal prototype so that you can add things in the extension. And we'll also replicate the fact that you can replace the skin. So, in addition to what Manuel explained about the abstract design system, which allows to reimplement the basic view and the basic components that we're using in the whole application, we can also replace pieces of the user interface. We have implemented X-Riki and Markdown renderers. One difficulty was to bring a JavaScript renderer for X-Riki. If we want to be compatible with X-Riki, we want Markdown to be a first-class citizen in crystal because that's the standard today, but we also need to support our customers that are using the X-Riki syntax in with X-Riki. We've also done prototypes of client macros, rendering in ViewGS of a macro. So, new macros. We've done the choices of design system libraries. The first one we want to spend time on, Shule's Vutify. One thing we had in the previous slide that Manuel didn't say. Shule is one of our performance tests. Actually, twice as fast as Vutify, and Shule is a web component library. We were quite impressed by that. Vutify is a pure Vutify library. Shule is a cross-platform library of components like supporting React or Angular, etc. Really interesting work. We have done design work. We have a prototype UI for basic view. We have first test of the editor UI with Markdown and TipTap, and we have the project infrastructure. You can check the code at the link I gave, crystal on GitHub X-Riki contrib. Basically, what we want to achieve in 2024, we want the first version for basic wikis. You can browse your, you can actually take notes in Markdown with an electron system. You can access a git on the other side. You can access a basic X-Riki with not all the advanced features of X-Riki. Maybe about 50% of X-Riki's current features. By, during 2025, we will achieve 75% of X-Riki's feature, including structured data. We want to bundle it in X-Riki by 2026. We want to have also plugin repository. We'll probably have that earlier, but we want to start having more plugins. We want to also have done more plugin development and a Crip at release. We probably want it as a default UI for X-Riki. Also, if we have done our work properly. That's it. You can look at our website, crystalx-wiki.org. There is also very interesting information there for anybody that does a JavaScript application, an advanced application. We're not necessarily the biggest killer in JavaScript. We come from the Java world, as I said, but we have done a lot of studies of what are the good technologies, because we have a lot of experience about choosing libraries right. We're trying to really make tables. We have tables about libraries and so on, about technologies. Don't hesitate to look at this. X-Riki is also hiring. If you find this project interesting, you can also join. If you're interested in what X-Riki is about, I have a beautiful conference at 9am tomorrow. If you like to wake up early in K, and we also have a party, you can scan the QR code if you want to join our party tonight. There's no room left? You can still try. You can still try. It doesn't matter. There's a risk. Thank you. Questions? APPLAUSE Any questions? Do you have an example of an extension you're imagining or planning for? First, any macros are extensions. If you want to add macros in your wiki, it's going to be extensions. If you look at X-Riki, we have 650 extensions. We have at least 50 high-quality extensions that we're not bundling with X-Riki. Lots of them are macros. Macros can be extensions, but it can be just adding a feature. Structured data would be an extension. We would not bundle it in the basic one, in the basic crystal, if you are not using crystal as a back end, because we wouldn't support anything. Everything is going to be X-Riki supports. For us, anything will be an extension. The difference is some will be bundled and some won't be bundled, but storage system is an extension. Access to Tiki can be an extension. Access to GitHub is one. Access to Git, access to file system, they are all extensions. Thank you. Another question? No? No question? OK. Thank you very much. Last second. Do you have a specific library for JSON-LD? What do you want to use? Can I repeat the question? Is there a specific library for JSON-LD that we want to use? First, when we look at storage, there are two ways to do the abstract storage. One way is to hope that the server application will support JSON-LD by default. We'll actually do that for X-Riki to try to make X-Riki give you JSON-LD by default. We believe that will be better because we'll do conversions of structured data of X-Riki and JSON-LD. That will be very interesting. In the Java world, we have found a Java JSON-LD library that is widely available. On the JavaScript world, at this point, we didn't feel we needed the library. That's just JSON that we can manipulate. At this point, we haven't seen the need for a library because we're just storing the JSON-LD data as offline right away. Sorry, I forgot to say. The second way is to do the conversion to JSON-LD on the client. That means the storage module will use the standard API of the backend and then transforms things to JSON-LD to give it to the other crystal modules that will understand JSON-LD. The conversion would be on the client and then you store as offline the resulting of that conversion so that you can do anything in the application. We didn't see the need yet, but we're not there yet. We did some tests of how X-Riki.NET to JSON-LD would display when it has structured data in a page. We've been able to replicate things we do in X-Riki on the client side in a similar way. We're not there yet. For now, we're focusing on the editing experience, which is the most important part for the beginning. Thank you. Thank you very much. Another question. Sorry, we'll take it outside.
Pushing Tiki to its limits
Hello, so Tiki provides a very powerful and flexible database abstraction layer. Thanks to a concrete example which expanded for three years. We have learned a lot. As we start a similar project, we have time to reflect on lessons learned, pitfalls to avoid. And why not share everything with us, with you. So first I described the context, what the project was about, how we did it, what the challenges were, and what we learned as a summary. So I'm Jean-Marc Kipps. I discovered free software last century. I'm in the Tiki community since 2006, live in Strasbourg. I'm alone in front of you, but I don't want to believe that I did all that alone. It was a team project. It was headed by Evoludata, and a lot of people helped. Some of them are in the room. The customers were the peak team from the Institut Nationale et santé publique du Québec. The end users are medical testing laboratories. So that's the website. As you can see, everything is in French, but I'll translate as much as I can, and I translated before I did the screenshots. It's a way of cheating. This is the team, it's quality control, actually. And what I do is that every year they produce by medical samples. They ship them to registered labs, so peak ships, and the labs have to register. They have to register because not all the labs do the same analysis. It depends on the machines they own. They have to be certified for all the analyses they can do, and so they have to choose them. Then they do the test, and they send results, and peak analyses, the results, sales reports, and recommendations. And this is what they call one campaign, and there are many campaigns which are linked to group together in the program, et cetera. That's one of the processes. They used to do that using faxes. So at first you think, hey, how hard is it to be better than fax? It actually faxes hugely flexible. So for example, different medical disciplines did things in different ways for totally valid reasons. So we had to adapt, but there were also clever people. So they also used that project in order to kind of streamline and make their processes. So we met in the middle. Everybody improved. And of course there are other processes, but I don't have time to explain everything. That's just an example. So yes, they also have every year to draft, review, approve, and publish the programs that the people can register afterwards and manage those registrations. So in general, that's the website. If you don't have an account, and if you're not involved in a process, it does not match in it for you, even if you understand French. What's in it is this is, for example, the example of what I mentioned, that's the management of the program. So they have all the interface they need. They can edit it. They can view it. They can go and edit the campaign which are linked to the subprogram. They can go to other pages. There's a lot of it. Every table, as you can guess, is actually data linked to this program, but in other tables and sometimes in all other tables and other tables. So it's not simple. This isn't too hard. That's the process where they approve. They discuss on it. That's just comments, and then they validate it when they agree together. This is actually the same subprogram, but that's the end user view. So we have that flexibility, and also that's where they actually click when they want to register, as I said, they would. But there are lots of variety. Here is another program that you have plenty more combined. You can click on them because it's not the time of the year when they register. So as I said, it's rather complex. How we did it. Tiki, in case you don't know, has plenty of features, and you have to choose the ones you want for each project. So basically, we use the wiki pages in Tiki in order to embed widgets, which we call plugins, in the wiki pages, and that's where the logic is. Well, you can also use them for documentation. We have file galleries. We don't use them a lot. But there are some documents to share. Trackers is the huge thing. Trackers is the Tiki name for the database abstraction layer, because it's starting it, it's starting, started as a bug tracker with grew and grew and grew, and now it's a full-fledged database abstraction, but it's hard to rename things afterwards. The fact each tracker item still has a status, open, pending, et cetera, and we use it. The categories are useful because that's what we use for the permission system. The scheduler, I'll get back to it. It will be simpler. The performance-related features, the main one is that, well, when you have a lot of data, the important thing is how you set and index it, and the default is MySQL full text, but you really need to install elastic search for that. We really had to because there are too many limitations, especially in the number of fields that MySQL can do. And the rest are basically, we had to raise everything, and it's easy to do because we do it within Tiki. It's just configurable. So all the time we doubled some memory limits, et cetera. So trackers. Trackers are basically, you can think about trackers as tables. Each tracker is a table. We have 86 of them so far and still growing. This is the tracker admin view, which end users don't see, but the customers love it because they feel empowered. They can see what's going on. They can edit stuff. We have activated inline editing. So when you see that, you can click on any of those little widgets and edit what's there, correct typo, a filter on what you want to see, sort on every column, et cetera. So that goes out of, that allows to do a lot of things without bothering to set up a whole workflow and it's really useful. So I said that trackers are like tables and tables have fields. So there are plenty of kind of fields and these are also, you can just add them, et cetera. So the useful ones or the auto increment was, is really practical because it allows to access and display the item ID of each tracker item. Item link, it's super powerful. The item link, well, if you're familiar with SQL, think about foreign keys. The item link links to another tracker. So when you edit the item, you have a selection of item from the other tracker and you can link track, well, two trackers. You can link tracker items from one tracker to tracker items to another tracker. Once you manage to link those, items, it is super useful because it does the other thing. For example, as we said, these are the campaign. Each campaign is linked to one subprogram and once it's linked to the subprogram, I mean, the subprogram has a year of the subprogram. So the campaign just gets the year from the associated subprogram. You don't have to do a double entry, et cetera. So you get those data. It's all indexed together and when you display the campaign, you have all these, all these values from other trackers. So when you start to link trackers, as you can guess, yeah, it starts to look like, you know, database schematic. That's a schematic I did. By the way, in a wiki page, in a tiki page with a draw widget, I needed it for a workflow because otherwise I couldn't figure out what to do. And so this is, I linked all the item links and the item lists and I put color because the color is about the fact that, about the fact that when you link tracker items and you delete or change the status of a tracker item, you may want the related tracker item to be also deleted or its status to change or not. That's configurable and that's why I wanted to keep track of it. Yeah, still not 86 trackers. That was, yeah. So how it dealt with source management. We had three, not four environments. We set up a dedicated private GitLab repo. We had our branches and we stuck dev and test on those branches. So every commit would instantly update the site. We get from one to the other by merging and terrific and production is not locked in production. Then the staging environment we called test is approved. We create a tag with the date and we run that in production. And so that means that, well, we have auditability about our versioning system tells us what we were running at what time, how it evolved, what we had in production at a given time if we want to recreate production at a former date when you hit a bug and you try to figure out is that a regression or is it something we missed last year and it was already there. All our commits were very careful. We put that we do not edit the Tiki files as much as we can. We add our templates in our theme or in the custom translation. That means that when we do a merge and we want to get the novelties from the Tiki community called and the security improvements, we do not get merge conflicts. The database management is just the opposite flow because the reference is in production. That's where we have the real data. Some of this data has been entered by end users. Some of it is those wiki pages we edit in order to show you later why we have code in our wiki pages. The nice thing is that we can try that. We synchronize test and dev from that. Then we do experiments. Then we get that to be validated. If it's okay, that's the approved edit. Then we synchronize. Tiki takes care of keeping a history of changes in the wiki pages, in the tracker items. There's an activity log and that's how we get our auditability for that part. I just said that all our environments are running the same database. You may get how this is an issue. What we do is that one single file here is not a versioned. This one is specific to each Tiki because this is the one which has the database credentials and it also has a link to configuration file which can be versioned because we have an item for a section in the configuration file. That means that in the same configuration file, each environment uses another section. In this section, we can override any Tiki preference. This has two very big advantages. The first one is that all the security preferences and others can be set in that file and cannot be accidentally modified through the Tiki admin panel. The other is that we can have different things in different sections. That allows us, for example, to ensure that only the production server can send email notifications. You do not want your end users to get notifications from a test server or a dev server that I'm not supposed to know about. What else? Yes, and you can change the browser's title. You can change the theme options and end up having your browser tabs like this to have different colors when you are working in production or in staging or in dev. That avoids big mistakes when you are editing a site where you want to be sure that you're not editing fraud when you want to do stuff in dev. So there's still the part about how you do that. So Tiki has a no-code, low-code approach, but at some stage you just have to accept that the project is really complex and abundant. The great thing is that there are options for doing really complicated stuff. These are basically the list widgets which we call plugin because the list widget is super useful because that's what allows you to display stuff which come from anywhere in Tiki, but here we are only interested in the tracker items. List execute is very similar to list, but that's not for displaying. That's for listing stuff and doing things for on a whole bunch of tracker items at the same time like deleting them or changing their status. Custom search is also closely related and this is for allowing people to do searches, to filter, to end user have control in this case. So that's a list widget example. You are not going to understand how it works like we don't have the time. We ourselves have that documentation page. We spent a lot of time. It's plenty of info. Everything is there, there are examples and all that. We spent a lot of time in it. Basically the general idea is that this is something we can put in a web page. There is a section which says what filters, what we are going to display. There is a section about how we want to output it. The more there are predefined templates, but if we want full control, you just give a smart ETPL file and then you can code whatever you like. You can even change the formatting before it gets to the template. And if your filter doesn't match anything, there is an alternate in order to an alternate section. So that allows you to do all the pages you saw before. You have to realize that when I say you can do whatever you like in the template, one of the things you can do in the template is call another list plugin. The syntax is slightly different from the web page. And that allows you to collect information to trackers which are linked to other trackers, etc. And you can go on and on and on and on if you like. There are no limits at this point. So that's basically what we used nearly all the time for all the pages, for all the workflows. The scheduler is also really useful because sometimes some processes are just too complex. There are two special cases and all that. And we had especially like the scoring system. We just wrote a script which was directly doing the calculation and updating the values in the database. And the scheduler is our way of ensuring that things can run whenever you like. For example by night because luckily neither our customers or the end users really wanted to work outside of working hours. So we can run everything we like during the night, especially nightly script for calculating scoring things or index rebuilds, whatever. So what were the challenges? One of the challenges we had was that page because that page was awesome because we had lots of information. It doesn't show here but actually those columns have related information which are in different trackers. So that's one of the cases where you have those templates which call another list plugin, which call another list plugin, which call another list plugin. So obviously the first year everything was great and you had everything here. We were using table sorter. You can sort on any column. You can filter in these places and you can move the pagination around. It was all client side, meaning all the data was in the page and after the third year it starts to get you cloud player timeouts. So we have rewritten the templates in order to optimize to do some caching ourselves in the code. And then we had to raise the memory limit because you know, trade-offs. But that's not a solution which is going to last forever. So it's solved for this year and they want to have five years of data, I understand. We'll see that. So basically this will need to be rewritten using custom search and just paginate. Or here they have the download button because they will all those information in CSV so that they can do more data mining. So we will provide, we will rewrite that but let them download subsets of the data. And that should solve that. And figure out, I'll talk about CSV extract here. That was another issue we have. That's a, every link here generates a CSV file which again gets data from plenty of trackers. So for the big labs we did have some timers and we were about to do the same thing, you know, rewrite the optimize, the TPL, etc. But luckily we had another idea which is to talk with the customers who explained that, well, that those data hardly ever change. So the solution is not to calculate them when you click on the button. We are just going to use some caching and have a nightly mechanism or, you know, generate the caches at the right time and just link to a file. Yeah. So mainly our lessons and improvements meaning that since I said that we are going to have a similar project which is about to start. We wanted to see what we, well, essentially it worked. The customers were happy but we can still improve stuff. So what we are going to improve is use more sophisticated Tiki permission mechanisms which is called templated groups. That's for the permissions about what people are allowed to see depending on the groups they are in. We just used a simple way and then we had to add another layer of security in the smarty templates. We want to avoid that in the new project. Make sure all the layers of data are present in the design. Well, that's always hard because, well, it's always hard to realize that there is a missing table or tracker. It makes a lot of extra work to discover that too late. Then again, I'm totally convinced that it would be even worse if we were working in real SQL. The other thing is, as I said, the other lesson is, well, table sorter is not a tool for data mining huge data sets. That's the summary of it. So you have to get your customers to accept that sometimes they have to use pagination and not have everything available and there are technical limits. Same thing for identifying huge CSVs. And also we have taken advantage of this. We are going to improve the list plugin which will be expanded with sublist sections which basically will allow us to do joins without having to do that in TPL files. And that's about it. Thank you, Jean-Marc.
How to get rid of Confluence: Comparing Open Source Knowledgemanagent Systems
So, hello, I'm Markus Feiner. I'm, some of you may know me. I've been around the Linux and open source world since 1994. Started really early with Linux and had three operating systems at the time on my computer. And since the early 2000s, I'm an open source journalist working for Linux magazine. And the Hi-Zi-X Big Jump Tech magazine also. And super, thank you. So, wonderful. Perfect. So, I've done a lot of things. I was also team leader at SUSE, at documentation team lead. And yeah, having done lots of things, we don't have that much time. So, I better be fast now also. But within this talk, I have a lot of links and hints for you where to go to get much more information. Because all of you that are here, probably are only here because you heard the term conference and you like it a lot. And like everybody does. And it's, we started six minutes late, but we have 15 minutes left, I guess. Good. Okay. You ping me five minutes before we're done. Thanks. So, this is a presentation that is kind of typical for me. I'm going to rush through a lot of topics, but there's a lot of links inside. And you can go and find a lot of things. If you're not a German native speaker, you will find some articles that I wrote that you have to translate. But the best thing is that something we did in December or November last year is a large tabela chart of lots of open source alternatives to Atlassian. And I'm not going to dive deep into the things that happen with Atlassian and the 15th of February coming up. I'm sure you're all aware of that, that the support is running out. So, that's a different thing. I'm going to talk a little bit about knowledge management and about a concept that we found in SUSE in the documentation team that I called Agile Recursive Documentation. And generally the problem in knowledge management, and that is what Atlassian actually is about, is that we all sort of mis-underestimate the problem. I'm doing this Bush reference by Bonn-Perlman. But it's like an iceberg, and this is probably the iceberg that hit the Titanic, an original photo that I found. And I've been using it in knowledge management presentations because, like icebergs, there is implicit and explicit knowledge in companies. You have a lot of knowledge that is documented, that is fine, that everybody knows, but there is a lot of Rumsfeld reference, unknown unknowns. And in companies about, yes, there is. And about 80% in companies today is assumed to be implicit knowledge. So, knowledge that is there, but nobody knows that it's there, and people just do it. And just announcing doing knowledge management now will not help against that, and will not mitigate that in any way, because you have to take the people. You have to, oh, I forgot something. The implicit knowledge is also referring to, for example, people that go into, yeah, in a longer holiday, or that go into old age pension, and that are gone. And there are stories of, yeah, I can tell a lot of stories about that, what happens when the people are gone, and you have to find out what actually did they do, and how did they do it. Some people that I know, they had to understand, and had a Pearl Programmer in the company, and the Pearl Programmer had called all of the processes and scripts of his whole big setup after figures from the Simpsons. So, they found a process that was called like APU, and they were like, what does it do? What does it do? And then they found out, oh, it forks a lot. And only once they knew this, they could figure out what the whole Pearl thing does. But there's many stories like that. But in the end, you have to inspire and motivate your people to follow you. So, you need solutions that work in the processes and software that works, and that the people like in your company. So, as many as you are here, there will be lots and lots of different solutions in the end. Because documentation and knowledge management is teamwork. That's one article I wrote for the Linux Magazine together with my former colleagues from Susie. It's always teamwork, and you always have to work together, and you have lots and lots of different people in the company from the... I once heard that we could... When I was a consultant, we are the pathfinders, the mountain guides. We are the ones that find the trails, and then we tell others how to go them. And as we have different people in the mountains, you have the locals, you have us, the pathfinders or whatever, you have the tourists, and the locals and other beings. So, you have to make sure that everybody understands what you're talking about when you give them a description about something. And for that, we have the engineering part or the scientific background. And this is all... Every one of these is so huge and a large topic that you can do university studies on some of them. Knowledge management itself, organizing knowledge, process management, quality assurance, and then all of that basically combined into some things like knowledge process quality management. That is K-P-Q-M... K-P-Q-M, yes. And at the end, there has to be a presentation layer. That's what the people see. That's basically the editor with which we work. But in the background, you have to do lots of ordering, indexing with metadata. Anybody here who knows the term RDF? Still, the semantic web and all of that stuff. Yes, that's the background. And then you have taxonomies, terms, terminologies, registers, tables of contents, notations, catalogs. And it's a huge scientific realm that you can read books on each of those. But for a company, it's important that you do the needful, not all. And the representation, showing it to the readers, the customers is actually the mapping of the information. How do you do this with models, glossaries, how-to's, encyclopedias, documentation? And you see there's already the type, the form of how you present things is coming in with glossaries. And well, of course, what I'm going to tell you is the way, the right way, and found this yesterday in Brussels here, and I really like that. Maybe some of you remember Magritte. This is not a pipe. So, and somebody pointed this on his garage door. This is the way. And I still have to dive into the apple. I think the apple is also a Magritte thing. But the cloud, and I really like that. So, of course, what I'm presenting to you is the way how to do it. No, it's just a suggestion. Because if you don't have time for all that scientific research and all of that, maybe, and that is the usual, that is, then you're not alone. It's usual that companies, we don't have time to do the documentation. We don't have time for all that. And then we were at SUSE in the situation. We had five people in the doc team, and we were told to grow up to 10 people fast. And the people in the doc team, documentation team, they said, we don't have time to teach the new ones. So, what we did was we created this agile recursive onboarding. That means we have one new guy in the team, and this new guy will be, for example, the mentor for the next new guy or girl. And we created a mentor who is in charge of teaching the new guy. And the new guy would, for example, as an article in Linux magazine that I did. And some of the, why we call it agile is because only because we're using agile tools and methods. Like a Kanban board for it. So, this is what the new guy saw. Not at this one, but it's structured. This is an easy tool in next cloud that you can use, for example. For the start. This is the task that the new guy had. Or this is the tasks that the mentor had to do before the first day. These are the tasks that the new guy would have to do on the first day. This is in the first week, and so on. You know, Kanban, very easy there. And out of this came the first documentation of the job, of the documentation shop, the description, what he's doing. And this is an individual board for a new team member. And this new person was from the start involved in making the team better, making the documentation everything better, and yada, yada, yada. Until, and it was recursive because the next one would again start at the same point. But with the new improvements from the last one. And that is exactly what you can do with documentation and knowledge management in a company. You can start documenting things and have people be a part of it. That is the most important, take them with you. And there is other things that are really cool, but only larger companies usually can apply it. There's things that Stack Overflow and Reddit, those companies do really good. You can have your customers, your readers, the signpost, a triage by user pane, so the important documentation items first. And what nobody needs will disappear from the list and topics that are interested will go up in lists and documentation. So if you're writing on something nobody ever reads, maybe you could have invested your time better. But I'm jumping over to the tools because that is what we are talking about. And yeah, so in decision making, decision making is also very clear. You have this important, regular, not important ones, and thereby you decide what to document on this scale. But it's probably not you at FOSTA who decides that. It's usually the management because there's constant risk involved. So this is the stuff that we need to document because if this guy is run over by a car, we won't know how to do this process and this will be a lot of money and customers will be angry at what we're getting. The team has the expertise, the knowledge or documentation team has the expertise, but the management knows about money and risk. So that is why they have to actually define what. So if you're not, yeah, and the tool, oh, super, good. The tool that you're using should be the last decision that you take. Otherwise you end up in technology-driven development or design. I don't know if you've heard that before. That is, for example, happening with AI a lot. We want to do something with AI or we want to do something in Rust. We want to do something in what is this new framework that we bought. Oh, in Go. Let's do something in Go or whatsoever. That's not uncommon that a company buys some development framework and then they think, okay, what can we do with it? It should be the other way around. You should, as in project management, you should see what you want, what you need, all the needs and then the risk and the money involved and then in the end look which product does match my requirements the best. Well, with Atlassian, that's usually not the case. Also because it's just simply there and for a long time it has been there without competition. And now in the last years, Atlassian did these moves to the cloud. This increase of license cost and, well, and they have, of course, they have a good product, a viable product and it's highly integrated with ticketing and you remember Trello. And I like to call them something like the Microsoft of Knowledge because everybody in the development world uses it and it's hard to get around it. And most people are not very happy about it and especially since I learned that the price increase that comes now or came now for small companies can be up to 1,000% that they have to pay because the small bundles start at 500 users or something like that. And it's not the usual, you know, the usual increase is something I think like 10 times more than before. So not that bad. And so they did also an article on that that I don't have a screenshot in there but they're forcing the users into the cloud. They say no we don't because they can buy a data center license. That's the one that's like 100 times more expensive than before. But they reacted also to that but there's other issues also. But them being an Australian company and thus part of the Five Eyes which is the five countries that are very closely working together in terms of the NSA stuff. So there is GDPR issues. They even told their customers when in 2018 when we were working with Trello at SUSE they let a mail came in that told us we shouldn't put business secrets into Trello because that might not comply with data protection rules in Europe already then. We're like what the fuck are you doing? Knowledge management tool and you're telling us not to put business knowledge in there? And so yeah and then recently there have been more issues. There have also been security issues and also but that's okay that every software has that. But then there is also the fact that they are more focused on a global market than for example on a German or European market. So we had the situation that we had severe issues at a company I was working with with Umlauts, the German Umlauts A and they said well there and we opened the ticket and the answer was oh yeah most of our customers are using English. We are very sorry about that. We're like okay good but that's just basics and so last year and that's actually the core part of this. I did with Tim Schurman together we did two articles in the IT administrator where we took a deep dive into the open source alternatives for Atlassian. Both of them make five pages and a large chart that I have here yeah that's two pages. That wasn't in the printed that's only online and the link is this is the this is the link to it. It's in the presentation that will be on the FOSTA website. Yeah and um nice take the photos tell me when you're done then I go back good so and we came to the solution that there is a lot of alternatives and a lot of them also are facing a boom. Some of them say we have so many calls right now from people that want to get rid of Atlassian that we can't handle it all and one of the companies is a friend of our friends of mine from Regensburg where I come from in Bavaria and they say it's most of them are customers that don't want to use Atlassian anymore they don't want to do the move and now they have this deadline coming up on the 15th of February when their support is ending for their old product so yeah so many customers are turning away from Atlassian the priorities have shifted and as you see in this table I'm going to show you a picture of each of those and hopefully if my brain is good enough I will say a few words to each of those five minutes great so in this list we have blue spice just don't worry I'll tell you more about the names just name dropping now blue spice bookstack it's alphabetical docu wiki fos wiki media wiki open km outline p m wiki wiki js and x wiki that's the ones we compared in this list I'm sure there's more out there but we tried to address those that are open source and have enterprise support so that's what we this wrong laptop and here you the first one of course the big the biggest knowledge management software is of course wikipedia but wikipedia with media wiki software has one big flaw it's not well it scales obviously very good yeah it can run the seventh largest website in Germany I think but um what it it has only one use case and that is not the enterprise clearly it is has the use case of making wikipedia run and work forever and really good and so I mean you all know wikipedia and you all know that this is already the editor of you I don't know if you've been to wikipedia recently last year or so they integrated this wikipedia editor not you don't have to work in in markdown whatever it was anymore you can really type inside the text and from them there is a there is an enterprise distribution that's called blue spice that's the guys from rigs book that I've been talking about disclaimer I they're a friend of mine and I blog for their website so um but that's you can make your own image about them they have a lot of things that enterprises need for example something like privacy administration so they add usability and at enterprise features on it yeah and they're open source they base they're based on media wiki and they have several versions up to cloud and farm and sars and stuff and um we have then we have x wiki there was just a talk so I hope I'm not saying anything that's that's wrong x wiki is also very old and they are in my opinion really interesting because they do a lot of innovative features that go way beyond the wikis like their crypt pad and there which is sort of like a um end-to-end browser based collaboration office thingy space that is actually if I understand it right serverless yeah and so they are so and if and then there is doku wiki doku wiki is um something that is often found in scientific and or educational realm I have been okay I've been working with blue spice I've been working with x wiki I've been working with doku wiki and um doku wiki I found at a company I was consulting that did that is from the space agencies from German aeronautics space uh it was the gaf the yeah so they this this world somehow yeah and so they have a lot of people that are experts and I call that that's where the expert systems start I think I think that both x wiki and blue spice is something that you don't need much expert knowledge to work with yeah but with doku wiki it starts but doku wiki is actually really cool because you have some features that others don't have for example you have a shortcut and then the page that you're working is a presentation just one thing and that's that's really cool and uh then we have t wiki we just heard of I think was it t wiki that you just heard or x wiki the previous t and that is um well it came from well the t wiki was the old project and first wiki and q dot wiki are folks from it they have a lot of extensions and a lot of features also but I I haven't really seen them that much in companies and they are very um I can how do you say they um you have to look you have to see if they work for you but they are not like for me they are not like a valid atlasian replacement because you but these you need to be an expert to use them yeah and other ones there's there's a lot of more I found these four and I think I'm good in the time there is for example there's bookstack which is another very interesting project that works with books and shelves so this German here so bookstack has this imagery of um use the knowledge is stored in books in bookshelves and chapters in books so they always use those those metaphors from the from from books so they have pages chapters books and shelves and this is the page of the um uh the access rules and uh
The Challenges of Creating a FOSS Fact-Checking Platform for the Brazilian Community
Okay, so thank you very much for staying for the last talk of the day and last talk of the year for the collaboration day room. So thanks a lot. I see people that stayed a lot during the day. So thank you. So the last presentation is about Matheus, which will be the challenges of creating a fast fact-checking platform for the Brazilian community. Thank you. Thank you. Yeah, can you hear me okay? Yeah. And thanks, appreciate everyone here. I thought there would be less people, so I really appreciate that. It's my first time at FOSDEM as well. Let me introduce myself. I am a senior product manager at the Wikimedia Foundation working on MediaWiki. But this product is a different hat that I wear. I'm a volunteer on an NGO in Brazil and we are trying to combat fake news and make something that is open and in software, in data and in knowledge. So what I want to talk about here is the mission of this project and I'm going to go over the challenges of trying to do something against misinformation in Brazil. Brazil is like a very fertile soil for misinformation, information, especially recently. The reason for that to arise in my will was basically because seeing how my family was sharing and spreading misinformation, it all started. So I started imagining, so I'm a co-founder, I'm not the only one that volunteer on this project, but we started imagining a society where everyone can freely access and engage with true and reliable information about autonomy. And I kind of stole a little bit of the Wikimedia Foundation mission here because we share similar values. So the mission for this project is to encourage ad-edu communication. I'm not going to talk about it because the other co-founder, which is a journalist, is the one that kind of coined or uses that term. And basically it means that we can only achieve our mission if we actually educate people and we need to communicate that. So the platform and everything, all the product what entails, it all focuses on the specific pilot. But the idea is that for the values on accessibility, credibility and autonomy, we're creating individuals or we're making autonomous individuals in Brazil being able to access information or at least question information without losing the credibility. And so when we started, so this is a study from Caspersky. And so more than 70% of Brazilians with internet have believed in the fake news and 62% of Brazilians failed to recognize false news. So this was a study that at the time was what motivated us to keep going. And challenges were immense because it forced us to tweak planning and change and pivot a lot in the years of the foundational years, which is kind of the timeline, which I call the foundational years. So I'm even excluding here all of the technical exploration that I did since 2018. And in 2020 when I thought that, okay, we have technology and maybe we're ready to proceed, we signed up for the Mozilla Open Lab. We participated in receiving product mentorship and preparing an ideation which came to be Alete Effect. And from that exploration with Mozilla it was very interesting because we decided, so we came as dreamers like we want to do something multilingual for everyone. We want to fix the world because the logo of that program that we participated was fixing the internet. And then we learned that it wouldn't be like that. So we would have a focus group. We would only focus in Brazil. We would look into people that is already engaging with fact checking or at least a reading about fact checking somehow. So active readers or independent fact checkers, even professional fact checkers. And then we would even look into a specific demographics, which would be from the age of 18 to 29, would represent like 16% of Brazil. And we set a goal that if we get 0.1% of that, 35,000 potential independent fact checkers, which is an increase of 7,000% of professional fact checkers in Brazil. And I'm going to talk about this a little bit because, so we did this exploration on Mozilla Open Lab. Then we started working on the infrastructure and launched the infrastructure and tried to experiment more. We participated on the trust, it's the TTL, it's the trust and truth online, yeah, conference and so we introduced the concept of the democratization of fact checking in Brazil, which is something that we are looking forward. And in the same year, we started a residency at Projeto Comprova, which is a group of news outlets and independent organizations that combat fake news in Brazil. And by participating on that, we actually engaged with our personas or professional fact checkers. And from that point, we started exploring a platform focused on professional fact checkers only. So we wanted to do something to speed up their process, make sure that they were, the process were optimized and they could actually chase fake news and combat it. So with that in mind, we thought that we are ready to formalize and then we started understanding what that would entail. So with these learnings, we defined that with process transparency and the didactic of representation, we could enable a fact checking manual or operational guideline and then we could replicate that and create autonomy of the individuals that we wanted. So the methodology should be accessible and understandable and that's the requirement that forced us to be aligned with the Creative Commons license for our data and also everything that we create from courses or workshops are all open and available. And we also had some, in the last year, I'll also talk about that, but we had multiple workshops and partnerships with universities in Brazil and all of that was free and based on the knowledge sharing proposed. And the platform or the product that I mainly worked on would be just a facilitator or a place where people could use and engage with. And our main goal was that, so we were here in 2021, we wanted to reach the Brazilian elections with something that could be used for good. So then we formalized and then we participated on the Global Fact Nine conference, which is organized by the IFCN, the International Fact Checking Network. And when we got there, we went to validate some of the use cases that we have built and it was like a bucket of ice on us because we understood that there is a different dynamics happening on fact checking community. So one of the things is they are very worried of being open, mostly because that can be weaponized by bad actors and create more problems to them. So every software that they write, it's not always open, sirs. It's not always open. Licenses, because they are mostly tied to news outlets, they don't always follow the Creative Commons license. So everything that we have built to create a shared space, like a public digital space for people, was not going to work with professional fact checkers. So we understood that, we pivoted, but we kept the same model, the same values and we went forward and launched a platform with a few people, like 15 volunteers. So in the platform, we created a process where we would listen to the debates during the Brazilian elections and do live fact checking. And it was good. So this is just a screenshot of some of the parts of the functionality. So here, if you have something highlighted, it means that it was fact-checked by someone and the experiment was very good. Very small as well, because if you look into our views, you're going to see that we only have like a thousand views. But the impressions from the people and what we were able to achieve was a good data to proceed forward. But because this is a very small project, it's a very small organization, we were dreaming big for a presidential election, trying to do an impact and of course, the stretch goes there, but what we learned is there is a use case, we can do that, but we need to begin small. So we took a step back and looked into what we can really do with our resourcing. And from that, we decided that we would have three product pilers, the fact checker productivity, reduce to credible information and reduce obstacles to product. So talking about the last one first, because we were very small, it doesn't work like our reference, which is Wikipedia, it doesn't work like that. You cannot just go anonymously and do a fact-checking, you cannot go and create an account and start doing it because we don't have a governance model or we didn't have at the time. We didn't have a governance model, we had a lot of obstacles, so we put obstacles on purpose so we could only test with a few people and understand what we need to do. And then we need to now remove those obstacles because now we are becoming more confident that now this can be used by everyone and we are going to have a procedure, a code of conduct and a governance model that will help. Access to credible information is the one that we are not focused too much right now because we believe that from the model that we have plus the productivity, we are going to create the credible information, but access to it is a little bit different. So we need to be able to make sure that the audiences that we serve are actually being able to access this and they can access the platform, we also provide goods SEO using the claim review schema and to be able to be searchable without people having to go to the platform. However, in Brazil we have a lot of people that, so there is an inequality or disparity of resources in Brazil that includes internet access in a sense that people can have data plans that can only work with WhatsApp but they cannot access internet. They can access Instagram but they cannot access Wikipedia. So the access level is a total different gain that we are still studying how we are going to approach, but because these are not our focus we should not lose sight as well, so that's why it's a pilot. And the productivity was because of an opportunity that we had because in 2023 we started a partnership with the University of the State of São Paulo which I actually came from and did my bachelor there and we had four interns and this is the other co-founder, our SEO, I like to call her my president and these four students were our focus group, we had a group of students in a sense that they are very fresh like first year journalists they know nothing about life more or less, maybe, I don't know, maybe they can learn from Tik Tok now but the idea is that they should have no knowledge at all and be just willing to work on. And we had a very smooth process and it was very refreshing because they finished four months of internship having the same level of productivity than news outlets when we were delivering fact-checking material. So of course different levels of comparison but in any way as I mentioned the platform was launched a while ago this 2022 is when we did the experiments with the debates and then we see that at least from the platform we are increasing the engagement of the functionality so we still have pretty much the same unique visitors but now people are using the platform more because now we actually have productivity for the team. So these are just a few screenshots of the platform, the code is in GitHub and you can access the link in the end of the presentation. And so this is an example of a fact-checking report that is available after the fact and as I mentioned so we focus a lot on the productivity of the fact-checkers so we started putting in place tools that would be tied to a specific workflow which is flexible. The way technically speaking we are using state machines here to control everything and we adjusted the processes and adding different steps depending on how we learned with the team. And the idea is to have a visibility of the productivity and also be able to collect data and see is this actually improving because once we, if we actually go, if we actually reach the goal that I mentioned before like 35,000 people checking facts in Brazil it's going to be a very different thing to administrate and make sure it runs smoothly. So yeah, this is I think that the learns. A few things that I think I would like to mention from the experiment, it was in Brazil the open source community is very spread, not so much well organized so the whole period of creating the software, trying to test the software only captured six volunteers like actively working on it which it looks like well this is actually pretty good but I forced most of my friends to actually go there and hey come on you know how to QA things, can you help me QA this, you are a DevOps engineer, can you create a pipeline to me so this was kind of a best effort from a community but after all of this what happened was that a lot of entry-level engineers were starting to looking for something to work on and they were, because we had partnerships with universities we started having people just coming in and they were trying to learn with the software which was a very good experience and something that I would like to explore moving forward it requires more management on the technical side being able to actually provide good feedback to them but it has, we now have like two or three active volunteers that they are 20 years old and just learn how to code and now they actually provide good development for the platform of course there is a skill to consider but so yeah there was this challenge so the time frame when I look to this project like five years is a lot for the stage that we are at but there were multiple factors there like it's a product for a very specific area it's a problem that no one solves yet and the only way that I see moving forward is doubling down the educational effort for the actual goal which is being able to provide credible information and stop the spreading of misinformation there is no other way around there is no other software there is no AI there is nothing that can help other than having humans understanding what they are reading and being able to have something that is accessible so I also put here multi-generational because in Brazil the disparity on misinformation spread is based on age as well so the older you are the bigger the chance to be a victim of misinformation and what we are also going to be looking at the general TV AI so and I write here with all the truth in my heart that I have no clue what we are going to do about it and the reason that I put it here is because it's not that we need to use the general TV AI but we need to defend against it which is I think very different perspective to look at of course maybe we need to put it here and make pieces with the devil who is the same tools to combat at the same level but the concerns are different because we are already seeing on multiple elections around the world the usage of deep fakes and general TV AI to manipulate public speech and so this is going to be a very difficult thing to do however because we are losing the battle we need to consider and that's it the code is open source it works specifically for an audience in Brazil it has been tailor suited for that but because since the beginning we were concerned about having this for multiple audiences so it allows internationalization the code is the reason that so the stack that we chose and one of the things so the stack we chose was no JS and React because the ability to find more people to join the effort but that of course we are now considering if it makes sense to keep the same stack or rewrite some of the stuff because if we want to be lean and optimize for some use cases we might also consider performance and other things that right now the platform doesn't provide and I forgot to mention something very important everything that we do is integrated with Wicked data right now and we have efforts to integrate with the whole public data infrastructure the idea is that we only have what it needs to be done for fact checking but information about personality more information on Wikipedia all of that should be included and integrated with the ecosystem and in the end we encourage other people to build their own communities and be part of the movement fork and change toss it off test the same things that we did I think that because it's something very very important and it's going to change a lot in the next few years I believe that we should double down the effort as much as we can that's it it was supposed to be a gift but it's a PDF so thank you thank you very much for the attention thank you Matheus any question? Hello so I was looking at the website and I see the personalities and the declarations and I see the reviewers but then how do you define or who puts the new declarations in for the fact checkers to really check it? Yeah so one of the things that we learned from the procedure of the fact checking is that monitoring has a specific operational guideline so one example that a fact checker might find to is that they receive this information but they look into how is this happening on Twitter X how is this being spread into Facebook how is this being spread on WhatsApp is this a big effort does it make sense to to actually check because there is one thing which is called the strategic silence because if someone checks it makes more public and it spreads more so the the decision on what to put in there is for the volunteers that have been have the capability to operate the platform so right now in order to operate the platform you need to go over the training understand how our code of conduct works sign that you understand and you are going to vouch for it and once you understand the whole process then you are able to operate the platform and you will be you will be responsible for monitoring so these volunteers are the ones that also select what is putting there but we receive we receive suggestions from community in Brazil from people that follow us and we use that into consideration to decide on if we are going to put on the platform or not and because it's a small group it's going to be small data now but the idea is to streamline this process and grow and possibly monitoring will be we will need to evolve based on on on that does that make sense thank you it's the last one from the UK we have a few fact checking organizations there do you connect or are you intending to connect with other similar organizations from the world yeah so we did when I talked about being part of the global fact nine which is part of the the international fact checking network we learned a lot and met a lot of the the I think full fact is one the biggest one and one that we get in touch we got in touch and now we are in the process of being part in part of this network there's a lot of criteria it's if we do it we are going to be the first open project that actually enters so we are having some trouble to actually match on some of the criteria that they ask but we can we have some connections with India we have some connections with the Latin America network as well and recently we joined the the the network that is only on Brazil so every out Nils out in Brazil are also part of this network and we are connecting with them as well yeah we need that okay thank you other questions no okay thank you very much and well that was the collaboration they've room for 224 so thank you for staying until the end
How do you change the governance model of an established open source project?
All right, awesome. Thanks everyone. Thanks for joining us today. So my name is Ruth Cheesy and I'm going to be talking a bit about how we went through the process of changing a governance model in the Mortec project. So if you haven't come across Mortec before, it's an open source marketing automation. We've been around for about 10 years. I'm going to talk much about what Mortec does, but we've got a stand in H block. So if you want to come and chat with some of the community will be over there. So yeah, I'm project lead from Mortec. I'm also co-founder of the Women in Open Source community. You can connect with me on LinkedIn by zapping that QR code, but the slides should also be on the FOSDEM website afterwards. So if you need to check something, you can check all the links that I mentioned and everything is up on the FOSDEM website. So let's start off by talking about what we actually mean by governance. So in open source for me, governance can be something as simple as like a few paragraphs on a piece of paper or in a bigger project. It can be a lot more complicated, but ultimately it's about talking about how power structures operate within your projects, how decisions are made, how interaction happens within the project and also ultimately talking about steering the human collaboration and software evolution in your project. So where do we come from Mortec as a project? Well, we were originally what I call a corporate backed open source project and what I mean by that is one company backing the project and all of the governance is around that one company backing the project. So we were founded in 2014, GPL3 and in 2016 the founder created a SaaS company providing the software to Enterprise. In 2018 we had our first ever community lead release, so it was the first time someone led a release who wasn't an employee of the SaaS company. 2019 the SaaS product was acquired by a company and rolled into their marketing suite along with the brand, the trademark and everything to do with the project and community. And then in 2020, so soon after that, we started to make a governance model to make it clear what actually the company involvement was, what the community involvement was and how we made decisions collaboratively. And this is what that first model looked like. So you can see at the top here, these pale blue are, they must be a member of the company, the dark blue colors are, they must be a member of the community and then the gray ones here are anyone. So it could be company, it could be community. So there was quite a lot of corporate involvement in there mainly because the company wanted to steer the project and support the project. This was developed in collaboration with the community but very much designed by the company to make sure that they had still had a say in the project. So the key decision making structures we had here was the company. So the company actually owned the trademark and they gave the community the ability to use those trademarks. And they also employed the project lead, which was me at the time. And they chose the company representatives of the council. The project lead was hired by the company and the job was to kind of steer the project in the right direction to organize the community, to remove any roadblocks but also be the bridge between the company and the community. And then we also had a community council which I showed there. So four people from the company, four people from the community dealing with issues that went across the project. So they weren't just to do with one particular team but there were things that were slightly more complex or maybe needed a bit more thought before they were interacted. But for all intents and purposes those community representatives were the team leads when we first started. So we didn't have enough people active. We just kind of said if you show up then you can be the team lead really. So in April this last year, the company informed of the company that they weren't actually able to support us at the same level that they were supporting us to that point. And so things needed to change basically. And because of that we needed to find a way to go forwards that wasn't going to involve being backed just by one company. So the first thing we needed to decide was what's the fiscal structure going to look like for the project? How are we actually going to organize ourselves? How are we going to manage the governance? Things like that. So the way we made this decision was initially going away and doing an awful lot of research, looking at what other open source projects are doing, what other projects have changed their governance models over time and how did that work out for them? And bringing that all together in an idea of some proposal that I would take to the council. And at this point it was only me who knew that this was happening with the company. So some of the options were maybe looking at joining a foundation or joining an umbrella organization that could support us. But what was important with this is that we were able to still be autonomous, that we still had the ability to decide how we did things and what tools we used and so forth. So there were pros and cons to that approach. Another option that was in front of us was we were at that point using Open Collective to manage finances. So if we ran an event, we had somewhere for the money to go. We were only using it for finances. So there was also the option of expanding what we were using them for to provide some of the services that the company gave us. So like holding trademarks, holding our assets, employing the project lead, providing legal support. So that was another option that was open to us. And then also creating our own nonprofit organizations. Creating something ourselves that was maybe a 501 or a nonprofit CIC in the UK that would deliver all of those things that I just talked about and we would be able to do that from our open source project. Put them on nonprofit organizations, sorry. So some of the resources that I found useful in this process that are up here. So governing open is a really great starting point. If you're having to think about governance, there's got lots of resources and links off there that can get you going. There's also a really great one from the Python organization, PEP 802, which explains the governance models of lots of different open source projects. How they've changed over time. What went well, what went wrong, what was difficult. That was a great source of, they're not all the same kind of projects as us, but they were encountering similar kind of problems. And FOS governance, if you need any kind of document, whether it's a code of conduct, a privacy policy, what don't agents we accept, a governance model, there's absolutely loads of awesome resources there and you can also upload your resources. So you can share your resources as a to do for me to actually upload the new governance model, come done like that. And also don't underestimate if you're going through thinking about this, the power of the network. There's just so many people who took my calls when I was like, I need to speak to people about this to get some ideas. Gave me some good contacts, pointed me towards specific things that would help in this process. So if any of you are those people who I spoke to, thank you so much because it really did help. So once I'd kind of come up with, well, those are the three things that we could go for. And as project lead, these are the pros and cons, I think, for those things. I shared it with our council and then later shared it with our team lead. So the council and then the assistant team leads as well. So we're at 10 of us at this point tossing around the ideas of what are we going to do and what do we think is going to work for the project. The challenge of course with anything in open source is reaching a consensus. People had views on what was going to be best for now, what's going to be best for the long term. But ultimately we were able to come to a consensus together. And that consensus was that we wanted to actually become an independent open source project to use open source collective more and to refactor our governance model accordingly. So that news was shared in April. You can read the independence blog post there. And actually it was one of those moments where you hit publish and you're not quite sure what the response is going to be because you all believe in it, but you're really hoping everyone else is going to too. And it was a really positive response. So some of the things that we learned from this is language really matters. We're a massive international community and we are invited people who we trusted from our main communities to translate that important announcement so that people in the local communities could understand actually what that meant in their local language. And they really valued the fact that we'd taken the time to do that. So major communications it was really helpful. We also had a lot of people who either did not care at all, which I couldn't really understand, but some people don't care about governance. They just want to use your product. Some people who work the other end of the spectrum who really cared a lot and were extremely passionate. And then some people in the middle. So I guess I'd say that the lesson learned is like you've got to be prepared for all of them, not just the positive, but also the negative criticism that comes with that. And also being available. So in this stage it was really helpful to have opportunities. We had webinars with a translator for our Brazilian community, for our German speaking community, where people could actually hear what the changes were, what it meant for them. And then they had the chance to ask questions. It also was really helpful to have an open door office hours where people could literally just drop into a Zoom call and talk with me or talk with the team leads directly about whatever they wanted to talk about. Okay, so one of the features we had to think about when we were actually creating this governance model, once we decided what the structure was going to look like, was do we actually need a hierarchy at all? So someone in the community was saying, actually, I think we should have a completely flat organization structure. I don't think we need to have leadership and councils and things like that. We did a lot of research on that. We couldn't actually find any organizations in larger open source projects that had that structure. We didn't think that was going to be practical for us over the long term to not have some kind of organizational hierarchical structure. So we did investigate that, but we actually decided, yeah, we do think we still want to have structure. But we decided that some of the structure we already had was actually working alright. So like the teens and the working groups was working alright. The council was working okay, but it wasn't democratically elected. It was chosen. And so we wanted to change that so that it was actually chosen by the community. We didn't have a step in between the council and the teams where the community got to discuss and debate changes, which then go to the council to be introduced. So that's what we introduced with the general assembly, which is a collaboration of members who can decide and debate, and then things go to the council to be enacted. So that was the structure that we kind of came up with for the project. But the next step was if we vote in a council, how do we make sure they don't all disappear at the same time? Because we're going to be doing this at a specific moment in time. And for this, we took inspiration from the Python Software Foundation. So we did an election. We had people voting, and then we ranked them. So the top three people did three year terms. The next two people did two year terms, and the next two people did one year terms. So that worked really well. The community found that really positive. We did have two people right on the border who got the same number of votes. So we just had a conversation who wants to do three years, who wants to do two years. But that seemed like a really good way of us kind of making sure that we have fresh blood coming into the council as well. And then who actually manages the project lead? Because they were employed by the company, and now they're employed by the community. So who manages that? And ultimately, we decided that that would be managed by the council. So that would be like they would be reporting into the council, basically. Some of the things we also had to think about was how do we make decisions? Because although we made decisions, obviously we've made decisions. It wasn't really explicitly clear how long we give for what different types of decisions and what methods we use. This was also a subject that we did lots of research on. We did need to find a way to do voting and to make the voting fair and to make it a system that we could easily roll out a vote for anything, basically. So we ended up using an open source tool called Decidim, which we've implemented at community.mortic.org, which allows you to have a voting system. It also lets you do events and meetings with transparent notes using EtherPad. And that's actually worked really well. So that's the tooling that we actually implemented to do the practical voting. And then once you have voting, it's like, well, who gets to vote? And this again was quite a contentious subject. What we decided was that we would have different ways of you being eligible to vote. One is financial. You throw some money, you get to vote, $100 a year, or you can use the Big Mac index, which proportionately reduces the amount based on the comparative cost of a Big Mac. You can Google it as by the economist. We already use that in other places in the project and people find it helpful. So we just use the same system we were already using. Contribution based approximately five hours a month, consistently over three months. They can apply to be contribution based member. Corporate based where we have tiers from $1,200 a year up to $30,000 a year. An honorary membership for people who've done extraordinary contributions to the project. So those are the membership types that we decided on. So once you've got the types and what have you, people then started saying, but I do more contribution than him and I want to have more say. So here be dragons, like this is a really difficult thing to get your head round. It can get very complex very quickly. It can be exploited very easily. So we just decided one member, one vote. So whether that's a human individual member or a corporate, they get one vote. And that works because they have one account on our community portal and there's one member in our membership list. And the membership list is who has the ability to vote. So that kind of simplified it. People wanted to get really complicated, but we have to start somewhere. And then how are decisions made? This one we actually decided, well, trivial decisions, we don't want to rat red tape random. If it's trivial and you don't, it's not going to impact many people and it's reversible, just make the decision. Talk about it amongst yourself, make the decision. If it's non-trivial, like how many tracks should we run at a conference or who should we invite as a speaker? Or if there's a code situation where there's a few different options, but they don't have major impact if you take one or the other and it can be reversed. Then we say that's a 36 hour time box, taking into account holidays and things like that, but generally 36 hours. And if it's a significant decision, which impacts several teams or the whole project or it has financial impact or it's not easy to reverse without there being significant consequences, at least two weeks time box. And those decisions happen on the portal. So that everybody who's on the portal sees things happening, they see the discussions, they can be involved in the decision making process. And then ultimately we try to get to a point where we come to a consensus. We default to lazy consensus. So if nobody has given an opinion and the time box that elapses, decision is made. If they have, we try to find a way to bring their feedback in so everyone feels like they're on board or they can at least disagree and commit, you know, the best thing. So how do we come to the final version of the governance model? Discussions happen very, very fast. So we had a channel on Slack for the governance discussions. I could go in there in the morning and there'd be like 250 more messages in a thread and you're just like, how on earth can I keep up with this? If you come in completely fresh, it's really hard. So we tried to summarise this in a Google doc and each day someone would take on to write who had given what views and what the discussions were. So it made it easier for someone coming in to actually get an overview of where we were at. When we got to a point where there was a first draft, I posted it up on the forums. I explained that this is a first draft of a governance model. Anyone else is welcome to submit another one. This is the one that we've been working on. And the important bit is that we could see down here, we chunked each section of the governance model, which was quite lengthy, into separate forum topics. So you could go and discuss the membership section or you could go and discuss the council section and provide your feedback there. And then based on the feedback and suggestions, we could update that initial thread and people could see where we were at. And then we collated all of those back. So this was time box. We did actually have to extend it by two weeks because people said there was too much to discuss and too much to make decisions on in two weeks. So we extended it to four weeks. And then once that was done, the positive thing about having it on the forums is our community are marketers predominantly. So on Slack, they won't be following it. But when they go and say, my mortar can sense is broken, or I can't send this email, they're going to the forums. So they're coming past this post in the forums. So we actually got more people involved that wouldn't normally be involved in these discussions. Then we posted the final version basically for two weeks for people to review the whole thing. And if there's still things that they were worried about, they could respond to this thread. And I highlighted all the bits that had changed from the first draft and why they had changed. So some had changed from the forum. Some had changed from a panel that we did at our conference, but it was easy for people to check. So in this stage, long live the time box. I think it was Angie Webchick who was like, time box everything when I first started as community manager. And that's so true. Like giving people a fixed window and saying, we will make a decision at the end of this time box. Delegating the research as well. So delegating research, if somebody's really interested in something, ask them to go and research it and bring it back. And then you haven't got to do it yourself. So we've had some people who are super passionate about decision making, and they went and did all of the research on that. I am the worst person for complicating things. So keep it simple. Yeah, with governance, it can easily get really complicated. But we kept on saying, what's the core of what we're trying to achieve with this? And how can we get rid of some of the fluff that doesn't need to be there? And also these ones, go where they are. So as many places as you can, talk about this governance stuff that you're trying to do. Social media, sending emails, talking at conferences, talking in person. We actually had some code of conduct infringements because of this, because people got so emotive about something that they really believed in. It doesn't mean that you don't obey the code of conduct. And I think modeling the behavior you want to see is really important. So when someone was disagreeing with something, one of the most useful things I learned to say was, you know what, I'm about six out of 10 that we keep this because x, y, z. Or I'm two out of 10 about this. I think it's kind of nice, but I'm not too worried. And then people have the language to understand and communicate themselves how passionately they are thinking about this thing and why. So you can then kind of get into dialogue. And yeah, draft early, iterate often, be ready to chuck it in the bin, but get something on paper because otherwise it just turns into this big nebulous discussion that never actually becomes anything. And it can be very frustrating. So where are we at now? It's been a longer process than I would have hoped for, mainly because of the community engagement. It takes time to get people to engage, to get people to give you thoughts, and then to kind of go through that process. But actually we've done all right. So we published the final draft at the end of July. We launched our membership model where people could become a member in August. In October, the community portal came out of Beta. So it was in Beta for about a month where a couple of teams were using it. And then in December, we had our extraordinary general meeting where we inaugurated the council who had been voted through the nominations process and we adopted the new governance model formally. So far we've had about 150-ish people joined the portal. We've had 44 financially contributing and 14, actually it's more like 48 now, 14 practically contributing who have joined us members who have a practical contribution route. We've also got people who've paid and they're eligible practically, you know, but whatever if they want to pay them, great. We had the voting on the portal which was really successful. And also what we do is all of our meetings. So team meetings, working group meetings, everything happens on the portal. People can join on the portal. They get the link. The notes are taken there so people can see the notes from the meeting when they finish. And it's been really good actually. It's really been like a central place for all things community. So going forward for us as an open source project, what's next is financial stability. This is the biggest thing that we're working on right now because we don't have the backing of a big corporate anymore. We need to do this all ourselves. So we're exploring lots of different revenue streams, membership, but also having a trial system where people can try the software for two weeks. And if they wish to continue, they go into a contract with a provider, but we get a 40% revenue share for the first year and the 30% for the second and so forth. It decays down. So we're trying to be creative in exploring ways that we can offer value and also get the money. We're very much focusing on product adoption. So our kind of adoption is like this, which is great to see, but we need to continue. It is a competitive sector in the proprietary world. There's not much competition in the open source world, but we're still kind of moving forwards. And also from the product development process, we're 10 years in, but we're dealing with an immense amount of technical debt. So it's also kind of making the product more stable and introducing many more features. And then finally, what we're really trying to move towards is transparent by default. We do do that quite well and we have done that quite well since 2019, but basically every time a leadership role expires, it's voted on through the portal. Every time we have to make a decision, let's take that debate to the portal instead of having it in Slack, on GitHub, wherever, have it on the portal and then it's centralized. And also, yeah, making use of voting. So any time we need to actually practically have a vote on something, we now have a system that we can do it through. So that's me done. I think I'm just in time. Hooray! Yeah. Thank you. If you have a stand, as I said, on HBlock, so if you want to know anything about Morte, come and chat. Questions? Any questions? I'll come back up. Oh, Lord. You're going to get your steps in today. So thank you for your talk. I would like to ask you, how do you manage like liability against law? And how do you, who is deciding the salaries, like the levels, the salary levels and stuff like that? So one of the biggest expenses we've had in this whole process is legal. So we had to, an open source collective who was our fiscal host, have legal experts who are specialists in open source. So we use their services to get the right contracts for transferring the trademarks and all of that stuff. And they also review all of our trade, all of our contracts that we sign because they have to be signed by the fiscal host, not by us. In terms of what was the other question? How do you deal with? Salaries. Salaries. Okay. Yeah. How do we deal with salaries? Yeah, thorny subject because I'm paid by the community and I set the salary. And I did that like three years ago, not knowing this is going to happen. What we did, we did lots of research at that point about what open source projects paid as an hourly rate and also comparing them with what we could actually afford to pay. It was when we were migrating from symphony four to five, we had big projects that needed a contractor to do because we couldn't find people in the community. And we just set an hourly rate and it's very low compared to big projects. It's $40 an hour, but that's what everyone gets paid in the project. We want to use the sliding scale at some point. There's a proposal being put to the council soon to investigate that. But yeah, with that comes a warning because I live in the UK and that will probably end up being a lot more for the project. So do we really want to do that? But that's how we've done it. So yeah, anyone else? Hello. Thank you for your presentation. I was wondering what is the emotional impact of going through a process like that? And if you have any tips or tricks, how to navigate it? The emotional impact? Yeah, because I'm guessing you will have to have some difficult talks. Yeah. Because I think you care about having a fair governance. I think you need to have your own house in order if you're going through this kind of, like in terms of you need to be able to know yourself well, because it does get emotive, especially if you are the founder or if you are involved. In terms of dealing with other people, in dialogue with other people, I think a lot of it is people are very passionate. So it's trying to understand what it is that they are getting emotional about and why they're passionate about it. And how can we find a way for that to come into something if it's not constructive, come into some constructive way of taking the bits that are really helpful. But yeah, just trying to be mindful of your own stuff and not projecting that onto other people when they're coming to you with ideas you don't agree with. I don't know if that is, it's kind of a non-answer, but yeah, sorry. Scott, one more question. I'm fascinated by the voting system that you have. Projects have problems with people coming in and leaving quickly. You said one person, one vote. How do you make sure they stick around? Do you have any way of like saying, hey, we're doing this, but like, can you speak more to the voting process to it? Because projects always have a problem with that type of system. Yeah, so part of the thing that we've done is like for voting, you need to be a member and that is linked to a financial benefit to the project because you need to either pay or contribute so the project's benefiting. Do you mean in terms of how do you get people to care enough to vote? Yeah, I kind of like, I mean we put money into it, but sometimes it's cool that I don't really care, but I need your voice to suggest or know. Yeah, so like people have said they've joined but then they don't really care enough to vote. A lot of it is to do with one-to-one engagement or not one-to-one but one-to-small engagement, making sure that people are aware why it's important to vote on that thing and you've got to accept that some people won't care. But it's, I think it's like using that emotive language and trying to explain like this is your opportunity to have a say in this thing. We actually had probably 20 people become individual members because they wanted to vote for their favorite candidates in our election for example. Another thing we're going to do is a template contest where you have to be a member to upload the template for example. So we're trying to do things like that to get them into the system, understanding how it works so it's very easy to use. So thank you very much, Shavu, they really appreciate that. And if you've got any other further questions I know you gentlemen do, but they'll have to be outside in the hallway after the transfer bit to spot over the room. So thank you very much.
Please Make It Make Sense: Product Management Methods to Make Your Project's Purpose Clear
I think are we good? Okay, so next up we have some product management coverage content by Loria. Hi. Yeah, so I've been a little bit of an AV disaster, so I'm going to have to look at my slides because I can't see them here. But here's the title of my talk today. My goal is to help you get more structure around your open source projects, hopefully save time and ideally do less. Okay, so about me, I'm an American living in Germany since 2015 and I mention this because I came to Germany with a very live to work mindset and now I have a very work to live mindset. And you're going to see that mindset shift in my talks, like the messaging I share with you. Among my many open source activities has been contributing to Kubernetes, particularly SIG release and also more recently the open SSS security scorecard project. I have this link here where I thought I'd highlight it because you can find a lot of management and leadership guidance there. It's a collection of resources, blog posts, videos, templates, things like this, including some things I'll show you today. I've worked in places. I'm not working now. My company shut down at the turn of the year. So if you like what I have to say and think I could be helpful to your organization, let's talk and there's my LinkedIn in the meantime. I'll cover basically two branches in this talk. First is some observations from my time in open source. I'll sprinkle some helpful hints and examples along the way and then I will focus on some tried and true traditional product management methods that work in a company setting. You've probably encountered them in your day jobs, but they also work in open source with a little bit of creativity. So some of those observations, I see contributors taking on so much work. Just lots of issues, many times even multiple leadership roles and it just seems like a sure far way to burn them. Because they're so overstretched, they don't have a lot of time to do a lot of research and gather data. Also, that's a skill set that not everybody has and not everybody needs to have. But the end result is often that a lot of development is based on assumptions instead of data. Another thing I've noticed is that what exists today in a project isn't well-defined or documented or mutually understood by the project team. This represents a pitfall because you maybe don't have the shared understanding of what your project is and does and should be. And lastly, there's often times of vague strategy or even none at all. I would say that the most acute manifestation of this issue is that there's often a boundary between what goes in a project and what stays out that is lacking. This can lead to a lot of work being done and that work just kind of expanding. So if you take away anything from me today, it would be this message which is I really encourage and invite you to do less if you can. I know your manager may not want you to do less. There's always very specific conditions around that relationship, speaking from experience. So I'm happy to talk to any of you after the talk if you would like to have a sound and pour like ways you can manage your manager's expectations around what you can do in open source with your limited time and availability. But if you are the pressure source telling yourself to do all of the things, then I invite you to ask yourself at first like, does anybody even want this? I mean, maybe they do, but maybe if you're the only person or you don't have a very clear sense of how many people might find value in your project, maybe stop and collect more data before you move on. Also keep your personal backlog light. I know some people really enjoy working with them, but they take on so much work that they end up becoming the blocker for other people to make progress. And you don't really want to do that, right? You don't want to impede your fellow project contributors' efforts because you're like the decision maker on 10 different things. So that leads to delegating. Delegating not just to reduce your workload, but also to empower others to gain skills that you have. And I know that's rather time consuming, but oftentimes what I've seen in open source is that a little bit of upfront onboarding and knowledge exchange saves everybody time in the later stages because you have multiple people who can work on something at once. And the last tip is something I've used over the years because I would just take on work too. I love it, like, let's be busy. And then I would find that the work that I took on actually involved a lot more than I bargained for. So I highly encourage you to unpack a task before you say yes to doing that task because you may find that it's going to take you a significant amount of time. Here's an example of that. So this is a project board that I created with collaborators from SIG Release and Kubernetes. The initial idea was to rewrite a tool from scratch. And I looked at that and thought I heard that and I was like, you know, we may not want to do that because that sounds really, really intensive. So what we did is over a couple of sessions we figured out some real things that we didn't know about this particular tool that we wanted to, you know, talking about rewriting. And what we had was a lot of questions, like what is it, what does it do, what do users want. So you may not see all this text, but just the TLDR for you. There's a lot of spikes in decision making and documentation, like proposals to write to get community feedback before even setting to write code. So this is what I mentioned earlier, like the assumptions that we often take into our development plans. We had a lot of assumptions that we just had to rewrite this tool because it's just too broken and, you know, we just do it over. That's often not the case. And so I just want to point out that I didn't come up with the idea of assumption-driven development. I found a term that someone else created, and in my search to find out exactly who, I came upon this blog post, which I found really interesting. It's a developer who basically described his own failure trajectory because he was operating with assumption-driven development. And what he did was he decided to just take on a lot of work on his own. He didn't talk to anybody around him. He also didn't understand what he was working with in that day, like the tooling and all of the different tooling relationships, and also the knock-on effects of making changes. And he kind of went in like, I'm going to do this, say, and like it's going to be done. And that also didn't turn out to be true. There was a lot more work involved that he had expected and planned for. So I thought it was a really great summary from the developer's perspective of why assumption-driven development is often not the best method to use. I'm going to give them a talk, and you can ask questions after. Thanks. So basically, what I'm suggesting here, like a way to conquer assumptions, is oftentimes just listening to your environment. And that starts with the people around you. So there's this thing called active listening, and I found a nice resource from the Center for Creative Leadership, and they give you some behaviors that you can adopt, or adopt rather, to start listening more actively to your colleagues or co-collaborators and others you work with. They say, first of all, pay attention. And we take this as a given, but in our world of smartphones and lots of distractions and multitasking, we often don't really fully pay attention to each other. And one way that we don't do this is that we sometimes can't wait to, we don't wait for the person to finish what they're saying, before we just like, oh, I want to get my point out. We have to go, and then we end up missing the latter half of the sentence, because we're too focused on our own sentence and what we want to say. So active listening means that you don't do that. You actually let somebody finish, and then you ask. And you also can do things like clarify what the person is telling you by asking them questions. Did I think, I think I heard you say this. Is that correct? Or can you tell me more about what you're trying to say to me? And then together, it starts to become a collaboration, because you're inviting them to also clarify their ideas for themselves. And you're also getting higher quality information, because A, you're taking it in, and you're also engaging with it in a team context to work out new ideas. In addition to listening to your colleagues and people around you, you should also listen to your code. So I mentioned a few slides ago about this idea to rewrite a tool from scratch. But if you don't really listen to your own code from the beginning, you may end up doing a lot of work that you could have avoided by just optimizing and selectively choosing what to work on. So having artifacts like docs and diagrams will help you to better reason about the work you truly should do. Optimize, find the points where you can make things better, and also plan accordingly. So here's another example from Sig Release where we applied this principle. We had this tool, right? And we were going to rewrite it. But I said, first of all, let's actually document the flow that the user follows to use this tool, achieve a job, go from point A to point B. And so an engineer in Sig Release did this, and then we gathered around as a group, around his workflow, and talked through every step, figuring out what was really hard, what was taking a lot of time, what wasn't working. And as you can see from the results, the first line there is the overall flow. And then I blew up this section toward the end, where you see a lot of anger, and then there's this little clock, which means it was really time consuming. And you could then see in the full landscape of this project's flow where the pain points truly were. And we were also able to use these posts to document exactly where the code existed that was executing these steps. And so what we walked away with was a much more focused plan for what we needed to do. And we can then start there and then decide after collecting a lot of information about these weaker points what we should do next. Maybe we rewrite parts of this instead of the whole thing. When you have a workflow like that in place, it really helps you to put, it puts you in better control of your project. Now if you have no projects, that's fine too. What we're going to cover next are some tools that you can apply as you start working on a new project. But you can also introduce these even if you have something that's several years old. It doesn't matter. It's never too late to understand your work and then organize yourself to do the highest value work in the future. So I'm going to cover having a strategy with a doc template, doing user research and surveys, including an example of a survey which is the NPS, making a roadmap and giving you a template you can use, and then prioritizing and refining your backlog with some methods and tools you can apply for those activities. So here's a strategy doc template that I just worked with the security scorecard team on to actually fill this out. And I know these little lines here are small and you can't see them. I'll get to that in the next slide. But it basically introduces the concepts of the 5Ws that journalists use typically to write a news story where they need to have the reader know the facts of the story right away and then if the reader wants more detailed information they can read on. But it answers who, what, when, where, and why as well as how. The goal here is that you have an asynchronous tool that you can use so you don't have to have a meeting around this, although I advise it because you'll find that more information comes out when you actually discuss your strategy. But you can at least start with a template like this and people then can contribute their comments and ideas to it. This is Miro by the way. When you actually have this template filled out and you've gone through it with your team then you can dump it into a doc, refine it a bit more and then publish it in your repository for the public to look at. And then of course you can continuously revise as your project develops and you discover new information. So those small questions in that template are here basically. Not all of them but some key questions that are quite useful for getting a sense of where you're going with your work. So who are the users as long as the contributors and the maintainers? But really who are the users? Who are the people deriving value from your project today? And who do you want to derive, who do you want to have value derived in the future? Like who should derive value in the future? What does your project do today? On the flip side what does it not do? I mentioned earlier that boundary about what goes in a project and what stays out. When you can clearly explain what a project is and what it is not and what it shouldn't be then you can get a clearer sense of where that boundary lies. You can also think here about what the UX is like and what quality concerns and constraints you have. It's really just like what is your project essentially? What is your project useful? So what are the conditions to trigger a user actively coming to you, you're solving their problem? Another way to look at when is like how long does a particular stage of your project's workflow take to be completed? Where does your project fit in the ecosystem? So I'm not going to go over the ins and outs of doing a competitor analysis here. There's lots of templates online that you can look at to do one. But I highly recommend it because when you take a look at other projects in the space that are doing similar, solving a similar problem, you can then assess the resources behind those projects. Maybe there are even products, so maybe there's like a company doing what you want to do. So they have a lot of money and they can work quickly and then you can consider like what you actually have in your time budget to actually pursue. You can also see what those projects and products strengths and weaknesses are and then use that information to distinguish and differentiate like what you want to provide. Maybe it's a niche that you want to really get a handle on and provide a really clear good solution for that no one else is providing. Maybe it's just because your project is community-based and other projects and products out there are like for money and so like you're going to be able to serve the community whereas those alternatives will not. So thinking about where your project fits in that landscape is really quite helpful. That leads into why your project exists in the first place. What value does it deliver? Then that puts you in the seat of the user who is actually trying to use your project and solve those problems they face. Another question I like to ask around why is the cost of delay. So if we don't develop this project now or if we don't iterate on it and provide these features of functionality, what bad things happen? What bad things happen to our goals? What bad things happen for users who continue facing this problem without any solution? What happens to innovation in general? There's really a lot of interesting conversations you can have around cost of delay. Then finally how does it work now? This question is also a really nice hook for you to think about the future and where you want to be in 12 months or 24 months with it. How do you want to build this to provide different features? Maybe redesign the architecture to be simpler. How do you want to be and how is a good frame for that? I pointed earlier, we're going to cover some more tools and methods. The next one is user research and surveys. Having as much data as you possibly can really pulls you out of your own biases and what the developer with the assumption driven development blog post was describing. I only listened to me and it didn't work. If you're listening to your prospective users, your current users, other project leaders, you start to get all these different perspectives that can ultimately help you develop the right most valuable thing and not develop a lot of other things that are going to take up a lot of effort but maybe won't have such a payoff for you or for anyone else. Surveys should be kept quick and easy. I tend to use Google forums. I mean, I know it's not open source but it works. I don't ask people to write a lot because you don't want to read at all. You probably don't have time to read lots and lots of survey responses. The survey respondents also probably don't have a lot of time to fill out lots of forms. Using check boxes, multiple choice, rating options from zero to five or whatever you want to set is your endpoints. You have numeric data that you can quickly turn into charts like this one which was from a Google survey and it's just easy to make a chart out of the results. Another thing I like to remind people of is Please Buy by GDPR. Be careful about how you're collecting the data of the people who are filling out your survey. Make sure they offer their consent before you offer them a chance to give them, to give consent for usage of your data before they move on. Another great way to collect user data is through discussions. Like on GitHub, you can post a question and see people respond to it. That can be a little more time consuming because you're going to have to read through all of those answers. But it can be quite useful too because you get broader context. If you're in a hurry and you just say, hey, community, I want to know if you want us to do this thing or not. You can send out an issue and have them give it a plus one or not. You can use emoticons, this like votes. There's other tools out there that product managers use all the time like AHA that offer this kind of voting functionality for feature ideas. And finally, interviews which are really can be quite time consuming. But if you have the time to do them, you can even just do a few. You can learn so much about your own project. You can sit and watch somebody try to use it and see where they get stuck, see what's confusing to them, and collect all of that data and think of ways to optimize and improve. Oh, I forgot. This is a really important point to ask them. With the results, a lot of times when people fill out surveys, it's numbers, so it's all scientific. But it often isn't because our users may be giving their feedback from a limited set of data points themselves because they may not be aware of all the alternatives, all the directions that your project can take. They may not have a full understanding of the functionality because they don't have time or maybe you didn't explain it well. So always be aware that just when somebody tells you what they want, they may not actually want that thing. That may be the best guess that they have that would solve their problem, but actually in the broader context of other types of users, it wouldn't solve the problem in the best way. So just keep that in mind that data can also be a little bit of a trap if not used carefully. I want to give this example of a survey that you can run very quickly. If you don't have time to set up a forum yourself, lots of questions, you can still do an NPS survey. This is used by lots of companies, but it's quite useful in our context because it just consists of two questions. Basically would you recommend my project in this case to a friend or colleague? And then can you please explain why you gave that score? So the number is very easy. You have to put it in some kind of NPS calculator, so I gave you a link to one. It's also the image source. You basically put in all that data and then you come up with your NPS. And then there's different analyses online for like what is a good score, usually it's 20. When you're 50 to 80, you're doing really well. So that's from the way that the score is calculated. It's a pretty low overhead way to collect feedback. Are we on the right track or not? I mentioned also the next type of tool I want to show you and that's explained with this roadmap template which you can adapt to your own needs if you'd like. I cover some of the who, what, when, where, why questions that I covered with the strategy doc template. But the roadmap is more of the short term. What would you like to do in your next, say, three to six months? It's taking a slice of your strategy into getting you more focused around what you want to develop now. My strong recommendation is to keep it to a page or less so that people can actually remember it. Keep the number of deliverables and goals low, like one for three max, using a metric to justify why it's necessary. If you don't have a metric, like a baseline to say like we're doing this deliverable because X number of users want it, then you can also think about the metric that you want to apply to then be able to measure the success of your feature. I always like to include risks, like what is known, what is unknown in a roadmap, just so that with the unknowns you can plan that it might take away time from the future development. So it might be a bit of a distraction, but you at least are aware of it and you're going to have to work it out in the future as you go. And then technical goals. And this is like to make sure that quality, observability, testing doesn't fall by the wayside. I see this happening in a lot of projects and products as well where like all the stuff that actually makes the thing run gets pushed to the end and then the engineering team is stuck with a very patchy problematic system that they want to really fix, but nobody has a lot of time for them to do so. The next last couple of slides are just covering prioritization. So this is a matrix that I like to use because it allows teams to take a stack of issues and then plot them on this matrix. The matrix asks them to assess tasks, ideas based on the amount of effort along with the value that they expect to provide for the user once they do the thing. And then this allows the team to see like if they have a lot of things that are high value but also high effort, then they either need to maybe focus on one of those because they're not going to do like 10 high impact, high effort items at once or break them down into smaller bits so that they can then go into the do it now column which is really where your quick wins and your low hanging fruit should go. It's really important to plan for those quick wins to have them early on so that you can collect momentum and the team doesn't feel like they're just in some long slog that they're never going to see the results of their work. If you have quick turnaround for impact provided then that's nice because they can celebrate those wins early and keep going. There's also this nice, this is my favorite box, the don't do it box because that's where you just like close the issue and forget. Here's where I use this matrix in action. This is also a security scorecard recently. We haven't done this exercise yet but I'm really hoping we do it soon. This is basically all the bugs in the backlog and just putting them in specific buckets like some of them weren't bugs so that was just really categorizing what's a bug what isn't. Then the goal here is the team will plot the bugs on this graph and then we might find out that some of the bugs were solved, maybe some of them are relevant now but it's really to kick stuff out of the backlog and then just have the focus on what is really important, what's really valuable, what are people really being hurt by right now like we should fix right away. That's basically the steps for how you would apply such a matrix. I also encourage using a scoring model. There's a lot of different scoring models and you can find on Google or Ecosia, my favorite search engine personally is Ecosia. You can go in there and see what scoring models can do to help you assess things like reach, impact, excitement, effort and have a weighted scoring option so you can stack rank your backlog items and then do the top items first because you've decided through data and analysis that they're the most valuable ones. This is another template for your strategy. I just found this on Miro. It's by Lou Coleman and basically if you're rolling out an MVP for a new project for the first time, your center of focus is obviously the tree trunk so making the purpose of that really strong and solid and then over time you have more time to build on your tree trunk. This format allows you to plot your plans basically on different bands. So maybe the future band might be something that's high impact and high effort but it's just going to take a lot of time so you don't project that you're going to have it done right away. I just thought it was a nice visual I like trees too. Last slide is probably something that's very familiar to you. It's a standard campaign project board but this really helps with asynchronous collaboration because if you're running your board really well you'll only have high value work in it and then your contributors don't have to have a meeting to figure out what to do. They just pull off from the board knowing that you've clearly vetted your work through the tools that I've shown you so that they know that what they're going to deliver is ready to go and it's going to make a difference. My experience people are really motivated by purpose. They don't want to just do something for busy work. They actually want to know they're making a change. So with your really nicely refined backlog you can help your contributors along by giving them valuable work to do. I suggest making a triage work in group or having some mechanism in your team but just make sure that issues are triaged regularly so they don't pile up and that's a really good way to get non-code contributors involved as well. Making valuable high purpose work. Hopefully I have helped clear your path and helped you clarify your purpose. This is a nice trail in Amsterdam. It's quiet and friendly and inviting so hopefully that your open source development can achieve some similar aesthetics and that's it and that's the links to the resources that I've shared earlier. Not a question. So this goes back to the assumption driven development that made me wonder especially since you pointed out to stake the work first so you know what you're getting yourself into but if I do that, if I had done that then I would have never started any effort at any time because I would have been too intimidated had I known what I would have gotten myself into. So what do I do to still get stuff done? I think it depends on the number of factors. If you have a lot of time to build something out and really focus on it.
Kickstarting an Open Source Culture: A Guide for Mentors
Welcome folks, we're on the final leg of Faust tomorrow, so hopefully we keep you awake. So I'm delighted here to be chatting today with a good friend of mine, Phyllisys, around something that we're both quite passionate about. I suppose from our experiences out in the open source, you know, when we first got involved and we got through it as we go along, and then, you know, just working with the community, collaborating with folks, and then seeing, realizing how we can bring that into our companies and bring that culture to help, you know, things get done better for one to a better world. Okay, so, so I'm Arton and I work over, I'm a developer over at IBM, and for the last about eight to ten years I've been in the cloud native space, and I've later started getting involved in AI as well. Yeah, and I'm Phyll, also a long time, we're both old guys now, a long time working as software developers, but, you know, done a lot in the, again, cloud native space, that's even though Martin and I worked at IBM together, that's really where we're connected over open source. Now I'm at AWS, still focused on open source, and we thought we'd start with really just kind of how we got our start. Again, we started our careers a very long time ago, we're not involved in open source for a good long part of our careers doing software development. For me, it was around 2014 that I had joined a new group in IBM that was focused more on open source, doing some work in OpenStack and Cloud Foundry, and this new thing called Docker came out, and I was asked to go check out this new technology, see if we could get involved, I became a contributor, and in essence I got hooked. I loved open source, I'd been a long time Linux user, but this was really my first experience contributing, making pull requests, reviewing code, helping others in the community, and that's led to the last 10 years of working in the OCI and CNCF and the Container D project where I'm a maintainer as well. So yeah, similar to Phil, I was working on Cloud Orchestration product that built on top of OpenStack about 2013, so we were downstream, we were building on it, and then I got an opportunity to say, okay, can we extend Horizon, which was the desktop at the time for OpenStack, it probably still is, and I remember the first conference I went to was over in Atlanta, I think someone fell off the bus, to be honest, because my manager came to me on a Friday evening of a long weekend to say, do you want to go to Atlanta to a conference the following weekend, do you know what to head over? So I went, and I was just blown away, and I think it was the whole collaboration of folks and all that, and then as I went into the community and I started then contributing into the Neutron, which was networking, and if you've ever worked with networking folks, they really are into the black arts, and they really take networking seriously, and I always felt like I'm going to get funged out here, because I don't care about IP4 or IP6, I just record, but it was a great experience, and they really made me feel welcome, I remember we had a meet-up, there was meet-ups at the time over in Rochester, Minnesota, and the work I worked on, they came up to me a lot and they maintained the same, we really liked what you've done, we really liked the fact that you took it on the chain when responses came back to you, you didn't get upset, you just moved on, you made the changes and go again, and then going to Hallways forward a few years then on to getting involved in Kubernetes and in the Helm community, and being really welcomed in there, being part of the Helm Tree release going out, and then becoming a maintainer in the community, and getting to actually talk at San Diego, which was a fabulous experience, so yeah, it's been great, you know. This is where his fancy clicker does the work. Yeah, a clicker's being unresponsive, sorry. Use the buttons. So why do companies need to cultivate a culture of open source? And I suppose the key one here is, and you get lost a lot of the time, I know we bang on about it in the community the whole lot is that, you know, nearly every company today producing software is, or are probably built on top of an open source stack, so if you're consuming it, you know, you're really involved in communities, sorry, you're involved in using communities, but you need to look at, you know, how do you feed that back into the community as well, because if you're using the stacks, if something goes wrong in the stack and you haven't been helping in those communities, then you don't really have a leg to fall back on. The other part of that is the amount of, when you're building on these stacks, you're building on the shoulders of people who put in hundreds and hundreds of hours. So you're getting a real lot of value here to build your product on top of it, where you can concentrate on your product, where you can drive it forward and that you may not have all the people in the community to do the good work for you like you're getting from the community. As you can see up here, there is so much open source out there, and that's coming from Linux, over the last 100 years, definitely from the Linux community, because maybe prior to that in the 90s, there's a bit more niche, the amount of people that were involved in open source, but definitely the Linux foundation community, up along through open stack, up along, it has really opened the door for people to contribute into communities, and it's created a real momentum and shift that, you know, if anything we, I suppose, that came out of log4j was that we realized that it's open source software is in every product around there, and we need to be aware of that. The final one then is, and this is very important for your customers, is most customers don't want vendor locking anymore, and open telemetry is a great example of that. It took a long time to come up with, there's been multiple standards in the telemetry space, but open telemetry has been probably the fourth standard where the different vendors have bought in and decided to work together, and a lot of clients know they want to be able to write their telemetry metric generations, their telemetry generations, or sorry, their generation of data once and use whatever back in they want. They don't want to be coming back again, having to change code and so forth to look to do observability and maintainability. They want to be able to have that, path finger Newton, and then use the particular back in they want after that. Yeah, so it's coming into the community dev room, it feels maybe like we're preaching to the choir, many of you here, you know, fully agree with the why, you know, why do we do open source, why do companies need to do open source, but I thought one extra data point on top of what Martin was just talking about is a report that just came out a year and a half ago that had this amazing stat that 82% of companies are looking for a vendor who's not just like Martin said, and like we all know, everyone's consuming open source, but 82% said they'd like to select a vendor who's actually participating upstream in an open source community. And then there were a bunch of responses about why, you know, oh, because, you know, they're familiar with open source processes or they're helping sustain a community of something that I'm depending on. And we definitely have experienced that, you know, working on container D for myself that was used in IBM's Kubernetes engine. It's used in several AWS container compute offerings. And AWS and IBM wants people who are active in that community so that we can fix problems so that, you know, like this last response from vendors, 46% said, I'm choosing a vendor that contributes because I know when I hit a problem, I can depend on that vendor because they have people in the open source community. And I think Martin, you had an example of that. Yeah, I have a little example, I can touch them because I didn't want to go near the stats that I feel was thrown out there because was it 46% of people didn't want it or they did want it. Sorry, I was a bit confused. No, on a serious note, the final point there is very telling about a year and a half ago, I was working with a partner and they were getting involved with us at the time. And they were really, really technical. They knew their stuff and they were a dream to work with being a technical person where, you know, they told us they told me what they were looking for. I had them alone. But one evening anyway, they were using the operator SDK from the operator framework, which is in the CNCF. And they found the bulk. And before the engineer and that engineer was on North American time, so he'd gone home to bed. He raised the issue and then he came up along. So I came in the next morning and was one of those lovely mornings, I didn't even have the coffee and he was like, oh my God, I'm doing this, you know, but I thought he was brilliant. They put it out there to say, that's great. All I went, I worked away and he took me on maybe two days to narrow down the bog, get the fix in. But for that partner that I was able to jump out there, make a fix. It wasn't a big issue. The big issue was just finding where the thing was as always. Once you find this, usually the solution isn't so bad. And then just working with the community getting in. And I think they really appreciated that fact. And you know, most of our customers out there and clients and so forth, they're very, very technical. They know their stuff. So they're not going to be hoodwinked. Yep. So yeah, Martin, you're going to take, you know, we've talked a little about companies, but why do employees care about involvement in open source? And this is, this is a lovely thing. And it's from my experiences. And when I talk in a while, we've a jumpstart program to help people get involved in open source is it's an amazing kind of, you know, I hate to use the word organic and just throw it around there, but it's a great way for somebody to get opportunities that they may not get within their own company. Because sometimes, you know, in companies or in teams and stuff, you know, things are rigid or certain ways or maybe, you know, sometimes it's like a public service for one do a better work where they know I've been there 10 years, so I'm entitled to do this or whatever. But for me, I just the ability to get opportunities either to speak at conferences, to meet people on different topics, to suddenly be involved in conversations that, you know, you taught were for, you know, somebody who is way more experiencing you is just amazing. It also gives people that ability to work on, you know, you may work in certain technology in your company, but then all of a sudden you're exposed to these technologies that are out here. And Phil said it there, like, when we first go out in the communities, you know, everyone's, get up, there's no problem to that. When you first go to get up, you're playing around with it, or you go out into USB IRC, but you're open to Slack now, whatever, like, it's a bit, you know, it's a big challenge when you first start out there and you're trying to engage and so forth. But it really gives you an opportunity to know how to collaborate with people and work with people, because it's not always about the technology. It's not always about contributions. It's a collaboration as well. Because at the end of the day, in your own company, you know, Bob or Mary beside you, they're paid to, you know, to work with you. When you're on the communities, you know, if, you know, people will only work with you if you're a decent person to work with. So you get that, you get those opportunities. And the funny thing I'd say is, just the friendship you made, as Phil said there, we worked in the same company, we met each other at Open Slack. And over the years, we never worked together internally, but we'd meet each other at, you know, KubeCon somewhere or some other conference or like Fosdom here again, and we get a chance to talk together. So I think that's lovely. Yep. And we usually take an old man selfie together. You make me do that. Yeah, he just wants it because he's got hair still in, I don't. All right, so we've talked a bit about the why. Just a few points. You know, what does it mean? What does it mean that a company has an open source culture, some kind of way that they're they're doing things to encourage open source involvement? One is just the simple fact that you're contributing back in some way. You know, you have employees who have a pathway. And I know there's probably a bunch of amazing Ospo leaders here or have been active in this room. You're making, you know, policies and capabilities, making it possible for people to do that in a sort of a clear way. You may create open source projects. I've had the pleasure AWS to be involved in creating two new open source projects that we've shared. We've gotten other people to collaborate. We're continuing to build those. And then, you know, there's the whole aspect of not just that you're allowing it, but there's some kind of encouragement. There's some way that employees who do open source don't feel like they're they're sort of stuck on a different track than everyone else. Like a promotion is harder because I'm mostly doing open source, and I'm not, you know, providing for the bottom line. And really that connects that there's some value, some, some incentive that employees think, you know, choosing to work on open source is just as valuable as, as, you know, being on a product team or working on a service. And then, you know, I think one of the cool things I've seen both at IBM, you mentioned the partner story, we have a group of AWS focused around actually collaborating with other vendors and customers and partners trying to not just do things between ourselves, but say, hey, join us in this community, and let's work on this together. And so, you know, these are, again, there's probably a lot more. But really, these are some of the keys that you would look for is, you know, what does it mean to even have an open source culture? Just going back in that last night to finish there is, you know, generally, partners are really, really on the ball technically, and they're really got the air to ground with their customers. They want to give the customers exactly what they want. And for them, open source is always that easier path to do and stuff. And it's the way they want to do it. So, you know, it is in your benefit to be able to engage like that. So how can you do this? So I've been very lucky in that a number of years ago, two great colleagues, Matt Rizowski and Anne Graham, came up with the idea for a junk star program. And the idea was that, you know, for early professionals, we'll get a chance to do a course for about nine weeks where there'd be an intro for the first two weeks, for so I'll tell them we're open source and how to contribute to open source and how to use open source and then particular projects and to pick one. And the goal being to get to push a PR out there. Now, you're probably, you know, if you've been open source, well, you go PR, but you forgot about the very first time you try to get that PR rate, especially if you took a while, you're probably looking at your GitHub, you were probably on your phone going, come on, review it, get it in there, you know. And it is still like we've all had that experience, I wish that we'd get in there and then you get to a stage and you're like, sure, if they leave it in, they leave it in, if they don't, I'm okay. But you know, and it's just giving people the confidence and as I say, we started with early professionals and now we've gone to experience folks because we realize they want the chance, especially, I don't know, maybe if you're, you know, as old as I feel there and myself, you know what I mean? You may have got caught in the rutter work or you might have got opportunities and I've seen people that come and see this and go, I wish I'd seen this years ago. You know what I mean? You know, they see the potential, they see the opportunities, they see where, you know what, I can take off here, you know, and it gives people that go. So the biggest thing is informing your company, tell them about open source and the benefits of it. The next part then is introducing that into your company, the tools and practices because things don't work in open source if the practices and the way people work and the tools are using are chunky or awkward because you have to remember here, it's people all over the world and all different companies, all right? You know, you have to find a common ground and a common way of working and a lot of companies know you can hear inner sources coming out all over the piss and you know, sometimes you hear, you'd swear inner source was something that's started, you know, that just grew from the sky, whereas to be honest, all you're doing is taking the first word of open source and changing it. You know what I mean? So the value has been seen here by companies that it's the collaboration, I think, more than anything. And you know, if your teams have been struggling or they've been finding a hard to get stuff out the door, that when they really start buying into this, they realize here, look, you know, one boat lives on. It's not about the individual, it's about the greater good. So I think that's important. The last two here, educating folks, okay? And like I said about the jumpstart, we had, look, some people, and I always talk when we do the jumpstart, you know, we have, we have weekly stand-ups for a half an hour or an hour and I said to folks, look, this is not like school. If you don't have the stuff done or you haven't made progress, please attend anyway, and we'll help you on block. But I always make that story. So when they come in and they'll have a PR push literally in the first week, and someone else is struggling because, you know, their kids are sick or they're gone on holidays or work has been really busy, but giving them the opportunity to be able to say right. And I always use the story of the hair and the tortoise, okay? You know, everyone gets there on their own time. And the last bit then is around and Phil touch on it. You really need to have a pat in your company that when people contribute to open source, because they're doing serious work out there. It's not someone out there having parties, even though people do go to parties, you know, I see John Willicky up there at the USSF party last night. So but on a serious note, you need to be able to give a sanctify and say to people you're doing really good work here. Well done. Yeah, and I just one thing to add to that. You know, Martin talked about the jumpstart program at IBM. We have an open source talent task force at AWS that just kicked off in the last year. We have an amazing Ospo and Nithya Ruff, many of you know her. And just trying to think about how to actually include HR in these discussions about, you know, what does it mean to have open source maintainers on your staff? How do you treat them differently than, you know, other parts of the company? How do you incentivize them in the same way that maybe other employees are incentivized? And then, yeah, just a lot of the practical education parts. Is there a way for open source, you know, newbies, so to speak, to get mentored to get, you know, and I do a lot of mentoring, we've built a small container runtime team at AWS, where I mentor some of those younger engineers. And with them, we've created like an open source hour, actually, I think it's two or three hours now, where the there's sort of an open, you know, video call, and, you know, the guy that's three weeks into the, into the job, you know, he's just created his GitHub ID. And he's like, I don't even know what to do. But he can join this call. And there's others on the team, who are like, here's an issue, you know, go read the issue. Let's help you figure out, you know, how to get your get set up and clone the repository. And, and so, you know, these are the practical sort of nuts and bolts of like how to get people involved, how to get them educated, how to get them incentivized. And again, I'm sure there's there's ways, you know, your companies are doing that. And, you know, I think this is an area where I think it'd be awesome to see more sharing of practices. You know, what are you doing in your company to incentivize and educate for open source? And just one little thing on that is one size doesn't fit all. So, you know, I'm way a laugh at that is I saw I know if it wants to fit all or one size fit most, I think is what it said, but no, I was mentoring a person work a couple of years ago, just one to one. And, you know, he noticed I was in helm and he said, all right, I want to get involved in him. So the very first meeting we had, he said, right, I want to get in the helm or whatever. And then I said to him, I said, do you know what you do? Pick five projects in order of your preference and come back to me. And I'd say he was a bit stunned. He told me afterwards, he said that was the worst he taught at the time, the worst mentor and he ever got was he comes in, and I tell him come up with five things goodbye. I'll meet you next week. So off he went. And he came back with the five things. And lo and behold, helm was not in the list of five things. He had interest in other stuff. All right. But I kind of knew that I wanted that person to know what they wanted to get involved in. So they went away and got involved in tech on. And they made a couple of contributions and they were doing a bit of work. But as the months went on, I didn't notice him jumping to the committer, you know, getting more a serial contributor, reviewing more and more. And I eventually said to him about six, seven months in, I said, look, what's the story? And he said to me, said I was afraid to tell you, but I don't like tech time. Now that's nothing against tech time. And if you're in tech time, do not attack me on the way home this evening. All right. But he was honest. And we found out afterwards he was more interested in K native. And once he got into K native, because he wanted to do it, he flourished. And he's doing unbelievable well after all. And you know what I mean? Every known again, he meets me says thanks for helping out. And you know what? That's relatable. Give him someone to help and listen to him, not telling him what to do. Yeah, Phil, you do it. Sorry. No problem. Yeah, we got a couple minutes left. We thought we'd connect. You know, we talked about how we got involved in open source initially. You know, kind of the where are we now? For me, you know, I've been now 10 years in, you know, almost spending the bulk of my time focused on open source as a project maintainer as a technical oversight board member in the OCI, a CNCF ambassador, and then focusing, you know, all the things I've learned trying to help others at AWS similar to what I did when I was at IBM, being a subject matter expert, helping other teams were figure out, hey, we have an open source project. We like to launch. Can you help us think through what that looks like? So it's kind of an exciting point for me to feel like I'm almost more focused on helping others now than so much, you know, trying to get involved in open source myself. Yeah, a bit like Phil, I didn't put the specifics in our general digital shows, you know, left hand, right hand doing different things as we fill it in. But, you know, for me, I think it's been, you know, people believe in me, helping me in the communities and know a chance to help other folks do it. That gives me great joy when I when we do the jumpstart and I come in on a Monday and I'm pissed off for whatever reason, because it's a Monday, maybe, you know, it's just to get the joy of helping folks and then also being able to help teams internally if they need a hand with open source. So to just finish out, okay, there are no free dinners in life, my dad, you saw with saying. And he's right. If you're going to consume something, give back, because it's the best way of driving things forward and knowing what's going on. What we've learned from working in open sources, and for me, definitely, is collaboration, the ability to work together, no matter where we're from, who we are, it doesn't matter. As long as you're a decent person, you're willing to work away, you know, you will get things done. And that's what team works about. All the best teams, especially sport teams, I'm going to land at Lorna, don't worry, especially sport teams, all right, they work the best when everybody is willing to do the job that they need to do, not they don't have to be the heroes. And finally, it's a great place for people to grow in their careers and their life. And if you're a senior leader or someone who's in the community that's done really, really great stuff, please help other people because that's what life is about. Great, Margot. Awesome, thank you. Q&A, I will run this back and forth. Any other questions? There'll be a few jelly babies in for you. Yes. What is the biggest community lesson you learned from OpenSack and how have you seen that applied in open source projects that have gotten large since, like for example, Kubernetes? Well, you're handing that to me. Well, you spent more time in OpenSack than I did. I feel like I didn't. I actually can't answer that question. No, no, no, no. I suppose, like from my experience, OpenSack, I had really great experiences from it. I thought the collaboration was really, really good there. And I think that was brought forward into the cloud native communities afterwards like Kubernetes, etc. So I think a lot of folks went and worked in in the Kubernetes communities with new people that came in. But I think the key at all times was that people understand that collaboration and being decent to each other and that you're trying to work towards the bigger thing. We don't need heroes in other words. Thank you. Anyone else? Can't be good. Either we did really well or people are bored out of their lives. Yeah, very possible. Okay. Thanks very much. You need folks and thank you.
Strategies for Building Healthy Open Source Communities
So, I'm going to talk today about strategies for building healthy open source communities. I wanted to start by just quickly thanking the Alfa Peace Loan Foundation. So, they fund the Chaos Data Science Initiative, which pays me, and also thanks to the Linux Foundation and Board Foundation, which also provides support for the projects. I have been in the technology industry for well over 20 years, working mostly on open source projects with a focus on community strategy metrics and growing your contributor base. And I can tell you that it is really, really hard to build a strong open source community for a project. Most of us struggle with finding enough humans to sustain our projects. So, let's start by talking just a little bit about the problem and why it can be so hard to achieve sustainable communities for open source projects. Like I said, the problem is hard. I like to start my community talks with a quote from an alien life form on Star Trek the next generation who described humans as ugly bags of mostly water. Now, I don't think we're ugly, so I think they missed that part. But we're super squishy, right? And not just in the physical sense. We can be unpredictable. We can be irrational, especially when we're stressed out, overworked, burnt out. And the reality is, right, we're not robots. We're not mindless automatons. We have feelings. We have bad days. We have other commitments and we have personal challenges in our lives that are often completely invisible to other contributors. And they can get in the way of our contributions to open source projects. But you can't have an open source project without having human beings to maintain it. So you need to be able to encourage people to participate in ways that are sustainable over the long term, both for your project and also for those people. And it helps to be proactive and ask people to participate in specific ways and in ways that match the work you need to do within your project. Now many projects struggle to find people who will actively participate in their projects and continue to participate over the long term. If it was easy, you'd already have all the people you need to maintain your project. We wouldn't need this dev room. And none of you would be here watching this talk. But I think a common theme throughout all of the presentations in this dev room so far really has been that we're in a situation now where there are a lot of open source projects and not enough contributors and not enough resources to maintain those projects. So maintainers are burning out and they're in desperate need of help. And sometimes it can be really difficult to get people to contribute to your project. And unfortunately there's no magic, there's no one size fits all solution. So throughout this talk I'll focus on some things you can do to increase the chances of building a community and growing contributors for your project. Now that we've talked about the problem and some of the challenges, I'll shift into talking about strategies for building healthy communities. So next I'll talk about, and after that I'll talk about us taking a strategic goals based approach to metrics. And then finally I'll talk about some metrics you can use to measure project sustainability. To grow your community along with some resources and some final thoughts at the end. So as promised let's start by talking about developing and executing on a long term strategy for building a healthy community. Including motivation, project governance, new contributor onboarding, roadmaps, contributor ladders which you might have heard before from some of the talks, and leadership. To have people's motivations for contributing to your project vary widely. Some people are contributing as a part of their job while others might contribute to gain experience or maybe learn about a particular language or particular technology. Regardless of why they showed up there are some things you can do to motivate them to stick around. Clear communication, working in the open, and reducing friction are key to helping people stick around. And I'll talk more in the upcoming slides about the importance of explicit and clearly communicated project governance along with onboarding docs and fostering a welcoming community. There are also other things you can do to motivate people to contribute. Having good first issues or help wanted labels are excellent places to start because these help those humans find something they can work on while they learn more about your project. Good first issues and help wanted labels are passive requests for help. So I also encourage people to be proactive and specific about ways that people can help. Asking someone specific to review a PR or answer a question or respond to an issue demonstrates that you recognize their unique expertise and that you want their help with it. Anything that we're wanted and appreciated makes us squishy humans feel good, right? Which can be a strong motivator to contribute to an open source project or to continue contributing over time. People can also be more motivated to contribute when all of the project work is done in the open where they can participate as equals. When some of the work is done within the walls of a company or maybe inside a close knit group of maintainers it can leave the rest of us feeling left out and demotivated. A lot of people like to hate on project governance. It's just extra paperwork, it's busy work, it's politicking, it gets in the way of doing the real work on the project. But this isn't true of good governance which is really just about setting the expectations and getting all of the various humans within your community collaborating together. Ultimately the focus of project governance is on people, the roles we play, our responsibilities, how we make decisions and what we should expect from each other as part of participating in the community. The goal should be to make the processes for participation as obvious as possible. Even for people who are brand new to the community. Having clear rules about how collaboration occurs, how decisions are made, what types of contributions are in or out of scope helps community members make contributions that are likely to be accepted and embraced by the project. This helps avoid wasting people's time with contributions that maybe just aren't aligned with the project for whatever reason. A healthy project with clear governance makes the humans happy and it sets your project up for future growth and long term success. The good news is you don't have to start from scratch. The link we have here is some good templates with some instructions that apply to most projects if you want to quickly and easily build out some basic governance for your project. It's a lot more difficult to participate in a community if you don't know anything about the role you might play, the expectations, the key players or any of the rules for participating. That explicit documented project governance gives you both new and existing contributors a clear path to guide them through your project. Spending a bit of time documenting that governance up front can save you a lot of time later with fewer questions about how things work and it gives you a document that you can point those other humans to if they have questions. When I start contributing to an open source project, I want to know how decisions are made, who makes those decisions, where the discussions about those decisions happen, which helps me understand whether those decisions are made fairly and out in the open. The bottom line is that if the processes for collaboration and decision making are not clearly documented as part of the project governance, this introduces uncertainty into the mix and uncertainty makes the humans nervous. It increases the barrier to contribution and it jeopardizes the health and viability of your project. Good documentation is how we scale the things that take up precious time for the already overworked human beings, like answering the same onboarding questions over and over and over and over. I see so many open source projects with contributing guides that don't actually provide any useful information for people who are contributing. At a minimum, a new contributor needs to understand how to spin up an environment where they can do their development, the expectations for testing, how to run tests, and any processes or other expectations that you have for pull requests and then instructions for any other requirements you might have. If this is all well documented, new contributors can get started with a minimal amount of help from the existing maintainers, which can save you a lot of time in the long run. When a project doesn't have good onboarding docs, those poor, squishy, burnt out maintainers can get frustrated by the amount of time they spend on new contributor questions, which can make it hard for contributors to feel welcome. It'll take a longer time for them to become productive. This is how the humans get discouraged and then just drift away from your project. This does not mean that you need to spend weeks and months writing the perfect onboarding documentation. At this point, anything is better than nothing. If you start with a few things that help people actually get started quickly, then new contributors can help make those onboarding documents better by adding more details and maybe some additional instructions for something that they found confusing or that they struggled with. Then after onboarding, people need to be able to find something to work on. Having public roadmaps is a great way to do your planning in the open, while helping people find something to work on that aligns with the direction of the project. If you were here yesterday for Lori Apples' talk, she talked a bit about roadmaps as well. Roadmaps provide some crucial functions within open source projects, including setting the direction of the project, prioritizing tasks, organizing the work, and attracting and retaining contributors, and also providing transparency into where the project is heading. I think a lot of people underestimate the impact that a well-defined and up-to-date roadmap can have when building community around a project. They can help guide everyone toward achieving common goals and having a shared vision about the future of a project to help contributors work on activities that are aligned with that vision. The document linked on the slide has loads of detailed information about building a roadmap for your open source project. One of the most important things to think about is how you'll maintain that roadmap over time and actually keep it up-to-date. It can help to use tools that are already part of your development or your community processes, like GitHub project boards, for example, if you use GitHub, so that people don't need to use yet another tool. If you have community or developer meetings, it can help to have someone walk through the roadmap every couple of weeks just to talk about the things that are blocked or need help. Maybe set aside some focus time once or twice a year to think about the future of the project, and then you can incorporate that back into the roadmap. Bonus points if you can find a really good project manager who can help with the process. Your project should also be designed to keep diversity, equity, and inclusion top of mind. Building a diverse community where all of these humans feel welcome and included doesn't just happen. It requires putting work and thought into it. But this time is well spent, right? Providing an environment where everyone, including people from marginalized populations, feels safe is the first step toward building a diverse community around your project. Ideally having programs that give people opportunities for shadowing, mentoring, and sponsoring new potential leaders can help you grow a diverse set of people into new leaders for your project. Paris talked a bit about this. The Kubernetes experience, sorry, the Kubernetes contributor experience special interest group is a really great place to see some examples of how to implement programs for things like shadowing and mentoring. And projects that make a concerted effort to actually bring in new people from a variety of backgrounds and have programs in place to move them into leadership positions are more likely to benefit from increased innovation and just have a healthier contributor community. And by having a diverse and welcoming community, you have the advantage of getting those humans who might not feel welcome in some other projects. Now Paris and Bill both talked about contributor ladders. Defining the roles and responsibilities for contributors, reviewers, and maintainers can really help with recruiting new humans into these roles. It can help to think about this as a ladder where contributors can climb up to become reviewers and those reviewers can become maintainers. But what's important is to document and make sure that people understand how they can climb that ladder and how they can gain more responsibilities within your project. A contributor ladder usually outlines the different contributor roles within the project and along with the responsibilities and privileges that come with them. Having a contributor ladder helps set expectations for the roles and it encourages people to think about how they might take on areas of increasing responsibility within the project. And as you get more of the humans moving into maintainer roles, you can reduce the load of the existing maintainers. And the good news is, again, there's a template that you can use to avoid building this from scratch. This one was based on Kubernetes, so it probably has more roles than you need, but you can simplify it, customize it, make it work for whatever your project needs. Paris talked a little bit about emeritus as well, so I feel like I'm just dovetailing on all the things Paris said. But humans like to think of ourselves as irreplaceable. We are not. We move on to other jobs. We burn out. We retire. And let's face it, unlike the robots, humans are mortal and we do not live forever. You should think about what you might want to do next and how you can prepare someone else to take over after you move on. I encourage projects to have an option for people to move into emeritus roles, which recognizes the hard work that they've put in into a project and gives others a point of contact. If they have any questions about what came before, while also allowing you to step away from the day-to-day responsibilities of the project. And I encourage you to think of stepping into an emeritus role as a successful way of just sort of handing off your duties to the next generation of maintainers for a project. Now, I've talked a lot about things you can improve. Metrics can help you decide where you need to improve your community and measure your progress after making improvements. But quite a few people seem to take what feels to me like a random approach to metrics by measuring the things that they see other people measuring or gathering the metrics maybe that are easiest to collect. And maybe this even provides something useful. But I encourage you to think about your goals and take maybe a less random and more strategic approach by focusing on those goals. And when I say start with the goals, I don't actually mean start with your goals for the community. I actually think you need to take a few steps back and start at the very top. What's important for your organization or what's important for your project as a whole? And this in a lot of cases has been a company in my case, but it could be an organization like a foundation. It could just be the project instead of an organization. But you should start by looking at what that organization or project hopes to accomplish and what its goals and objectives are. And then you can take this down a level and figure out what your goals are as a community. And your roadmap can be one input into this whole process. And the most important part of putting together the strategies and plans for your open source contributions is then aligning them with the overall goals of your project. If your goals for the community don't support the overall goals for the project, you aren't likely to be successful. So it's worth the time to figure out what you want to do and how it supports the rest of the project or how it supports what your organization is trying to achieve. Once you figure out what you want to do as a community and then can tie it back to the bigger organization or the project, then you can start looking at using metrics to measure your progress. So people often ask me, for example, of the projects with the best metrics. But I really just don't think that's a good approach. What you measure depends on your goals and what you're trying to achieve, which may be completely different for other projects. So I prefer to encourage people to start by defining their goals. And ultimately, you need to look at your strategies and plans and come up with criteria that will help you measure whether or not you are successful. For example, if you want to improve the performance of a particular piece of open source software, measuring commits is not going to get you that. You actually need to have success criteria and measurement based on those types of performance you're trying to improve. If you want to gain influence within an open source project, maybe you work at a company. Maybe you measure increases in contributions or the number of employees who are moving into positions of leadership. And as with any good strategies and plans, hopefully the outcome and results should be measurable so that you can tell whether your efforts are successful. And this is where your metrics come in handy. Once you decide on your success criteria, you need to make sure that you can get the data required to measure it and maybe start measuring it now, get a good baseline of data. And there are loads of tools available to measure data about open source projects. Some of the commonly used tools can be found in the Linux Foundation's Chaos project where I work. But there are also loads of other tools and lots of big projects already have dashboards using either the Chaos tools or CNCF uses DevStats. There are loads of tools available for doing this. Since this is a presentation about building community, I encourage you to focus on your goals while also thinking about your time would be best spent on community activities. I've given a lot of suggestions so far in this presentation and you should not try to do everything at once. So I recommend that you think strategically about where you should start while keeping your goals top of mind. If you know you've had people interested in contributing but they've given up when they couldn't get started, maybe you should start with onboarding docs. If you have a lot of casual contributors, maybe you focus on the contributor ladder and governance to help move some of those other humans up to take on more leadership positions. An excellent way to free up time from maintainers is by getting help with different types of contributions that take up valuable time and are actually required to make an open source project successful. Things like documentation, marketing, community management, project management, and many more roles. For projects with complex code bases especially, it can sometimes be easier to onboard people into these roles first to free up some time to onboard other contributors later. This also has the advantage of bringing people in to help with things that can have a big impact on growing your community like roadmaps, governance, and other documentation. Time is precious. So it is important to identify the problem areas within your community where you can focus on the right things while avoiding wasting time on areas that are already working well. However, metrics do need to be interpreted in light of your goals, how you operate as a community, and all of the other things happening within your project. There's no one size fits all interpretation of metrics. So in this next section, I'll use some example graphs from some of our chaos metrics and talk about what some trends might indicate and how to think about addressing potential issues. One key area to look at for your project is responsiveness. This is a backlog graph from the chaos grammar lab tool. In this project, you can see that there are times where they've got a lot of PRs in the backlog that need to be merged or closed. Now, if these PRs are coming from several regular contributors who aren't maintainers, it might be a good idea to look at how you can promote some of those humans to become reviewers or maintainers to help out with the workload. But as with any metrics, you need to interpret them in light of your project. There are other things that can cause an increase in the backlog, like everyone preparing for a big release or maybe a big conference or just vacation season that might not be resolved by moving more people into leadership. Again, these graphs come from grammar lab. Other metrics to look at responsiveness focus on the amount of time it takes for maintainers to close issues in PRs. Looking at trends for these metrics is particularly important. This example, you can see that it's taking a lot longer for maintainers to close issues or PRs. It might be a good idea to look at how you can promote some more humans to become reviewers, maintainers, help with the workload. Again, you need to interpret this in light of your project. There are other things that can cause an increase in time to close, like the project becoming more complex or becoming larger, which can just increase the time required for things like testing and other activities that would happen in the process of reviewing and closing PRs. It can also help to look at the types of contributors that you have. In this case, casual contributors are those drive-through contributors who make a small handful of contributions and then disappear possibly forever. Regular contributors are the ones who make some contributions and then they stick around and continue to make contributions over a period of time. Core contributors are usually the maintainers who are there for the long term. You can really learn a lot from this graph. If you have a very small number of casual and regular contributors, this can mean that people don't have the information needed to become productive and to contribute. In some cases, onboarding docs can help solve these issues. Another thing this graph can indicate is whether there may be some fundamental issues within the project that are driving the humans away from your project. If you see the total number of contributors declining or the number of regular contributors declining, this can indicate some deeper issues, maybe toxic community members or an unwelcoming environment, and that probably needs to be resolved before you do anything else. Or it can mean there are other issues with things like lack of responsiveness. This metric is often called the bus factor or lottery factor based on the idea that if one person disappeared after winning the lottery and that person was making all of the contributions, then the project would probably be in trouble if they left. This graph uses data from Chaos's Augur software. I recommend measuring this because there are a few things that can tell you. First of all, how big of an issue is your current contributor situation? If it's like this one, you really should focus on getting some additional contributors and maintainers. You also might find that there are people who are contributing more than you realized, which is the other reason this is a good metric. This can help you think about who you can encourage to contribute more or maybe find someone who can move up the ladder into a leadership role. So you might look at some of those people who are a little bit lower down on the graph and see if you can promote them up into being a maintainer. The catch here and with so many metrics is that we don't want to just think about the people who are making commits. This is a good start, right? It's a start, but you should also be thinking about how to move people into leadership positions to be responsible for things that might not show up in GitHub, like documentation, community management, marketing, mentorship, lots of other important roles. And metrics are not something that you look at once and never revisit. It's important to think about metrics gathering as an ongoing process of measuring, improving, and monitoring. So you think about your goals and what you want to achieve. You pick some metrics. You make improvements, and then you monitor that over time. And before I wrap up the talk, here are just a few resources that you might find useful. There's some great stuff there from the CNCF contributor strategy tag around how to use and templates. The Open Source Way guidebook is just another one of my favorite community resources. And then the chaos metrics. We also have a Slack channel. You're welcome to join us. Anyone can participate in the chaos project. Maintaining an open source project is so much work. And there are so many maintainers who are overworked, exhausted, and burning out. The best way to address this challenge is by finding more humans and growing your contributor community. But it's hard work, right? And it takes time away from the day-to-day activities now, which can be super hard to justify if you feel like you're barely keeping up as it is. In the longer term, spending at least a little time on things that can help you recruit and keep new contributors will be worth it. And as I mentioned before, you don't need to do everything at once. Think about your goals. Use your metrics to help you figure out where your time would be best spent. So this is what I'm asking you to do. If you're a contributor to an open source project, carve out maybe an hour a week to improve your onboarding docs, your contributing guide, your project governance metrics, or just spend that time helping another human learn to do something new in the community. With that, thank you. And I'll, I think we have another two minutes for questions. Yes? Thanks for the presentation. It seems that some of the ideas that you presented, the contribution layer, the layer. Sorry, can you speak up a little bit? Thanks. Thanks for the presentation. It looks like some of the ideas that you presented, like the contribution ladder, can be maybe at odds with a project that is really owned by a company or where there is a strong presence of the company. Do you believe that there is a way to resolve this? Yes, I do think that there's a way to resolve that. I do think that sometimes the governance and the contributor ladders sometimes work a little bit differently when you're talking about projects that are owned by companies. I think that the best thing that the company can do is to be honest about what roles are really open to people from the community and which ones might not be. And that might not be something that your company wants to be transparent about, but I think if you're really trying to build a community around it, I do think you have to be transparent about that. And I think that the people that will stick around in your community will at least respect that transparency, even if maybe it's not the answer that they wanted to hear. So I think there's definitely room for that, but it will look a little bit different and you will have to have that balance between the company and the community. Hey, thank you, Don. All we have time for. Thanks. Thanks. Thank you.
Intel TDX Deep Dive
Perfect, I guess then let's start right a bit earlier, but then leaves potentially a few times for questions. Hi, my name is Benifuri. I work at the Confidential Computing Enabling Team at Intel. My main job is to, I can try, but I notice when I said there that it's not really loud, the speaker is not loud there. Okay, speak closer to the microphone, I will try. So yes, Intel Confidential Computing Enabling, right, we work together with academics, companies, partners, whoever wants to use our technology, we help them to do that, right, that's my job. Today, I will talk about Intel TDX, and I will, it's called Deep Dive, but I will start with an overview and then go deep in a few slides, right. So overview first, I don't want to speak too much about that, right, it was just done in the talk we just had. Without confidential computing, or if you don't use any protection mechanism, everything is in what we call the trust boundary, right, everything can access your confidential data. With our first technology, Intel S-Chicks was just mentioned, right, only the application with Intel TDX, the topic of today, a virtual machine is protected, right, everything of that was just mentioned, just want to mention it again, because like we have the options, right, use whatever you want. In general, you could say Intel S-Chicks is the more secure technology, Intel TDX the more usable technology, right, but that's up for debate if you want. Yeah, today we will concentrate on Intel TDX only. Here, you see an overview, like this is what a regular system looks like, right, we have the platform with cores, caches and so on, the memory and a regular hypervisor, a virtual machine monitor here. And with normal VMs, this hypervisor starts the virtual machines, right, and it is also, this hypervisor isolates the virtual machines from each other and isolates the virtual machines from the hypervisor itself. In the main memory, everything is plain text, right, so, which means that every person and every program with the necessary privileges can access the data, right, it's plain text. This is different with Intel TDX. With Intel TDX, we introduce what we call the Intel TDX module. The hypervisor has to be adjusted as well, it's now says here it's TDX enlightened, because the hypervisor is still responsible for resource management and hypervisor now needs to, instead of starting the virtual machines itself, it has to go to the TDX module and say here, please start your TDX protected virtual machines for me. And this is what the TDX module does, right, it starts to protect the virtual machines, which we call trust domains, right. Intel TDX stands for Intel Trust Domain Extensions, Intel TDX Protected Virtual Machines, we call them trust domains. Inside those trust domains or TDs, the guest OS running there has to be enlightened as well, right, it has to be at least to have some changes, because it has to handle now accesses to private memory and shared memory, it has to handle that, it also has to handle exceptions and it has to block certain calls that were possible before. Yes, the application inside the TD do not have to be adjusted, that's the main advantage of Intel TDX or comparable technologies. The main memory belonging to the TD is encrypted with an ephemeral key that is dedicated and hardware managed, right, as you see on the slide, it says encrypted with key one and key two, because every trust domain is encrypted with a different key. Inside the CPU, the data belonging to the TD is plain text, right, that's what confidential computing does, inside the CPU, data is plain text, but the CPU takes care that only the trust domain to which the data belongs has access to the data. Combining the main memory encryption and access control, Intel TDX enforces the isolation of the different TDs by using the Intel TDX module, on top of that, attestation proves that this is the case, right, what we will talk about attestation a bit later. This slide is about the Intel TDX enabling in Linux, it contains a lot of details, I don't want to go into every detail, I only want to highlight three things. First via VM isolation requires to enable a lot of parts of the software and Intel has done that, right, we have put the work in and basically everything is open source, right, and this is the, basically everything is open source, and even the ones in gray, they are only gray because they are a reference implementation, but also open source. And most of the pieces are already upstreamed, but not everything, right, that's the current situation of Intel TDX, but this will change soon, hopefully. One last slide of the overview is the availability of Intel TDX. Intel TDX was introduced at the beginning of 2023 with the fourth generation of Intel Xeon Scalables, but back then, only at four leading cloud service providers you see on the right. Everybody else buying these CPUs did not have Intel TDX enabled. Previous at that cloud service providers started already in Q1 2023, and general ability is supposed to be soon this year. Intel TDX became generally available with the fifth generation of Xeon Scalables, which was introduced at the end of last year in December, meaning if you now go to your favorite hardware vendor, you should be able to get such CPUs or at least soon. Good. Now to selected technical details of the technology. First, the CPU state is kept confidential by managing it in CPU protected memory, and that's the responsibility of the TDX module. For example, on a TDX exit, the CPU state is saved by the TDX module in a protected memory region, and this memory region is encrypted. And all memory confidentiality and integrity that's provided by Intel TDX is provided by what we call the TME IM key engine, the total memory encryption, integrity, multi key engine. And this is used to encrypt all the main memory belonging to a TD to prevent untrusted software or from observing the TD's memory. It uses AX XTS 128 bit, and each memory as mentioned before has its own keys. The memory integrity feature detects TD private memory corruption by software and direct memory access. The TDX module is responsible for managing all the keys, but to encrypt the different keys. But the TDX module itself still does not have access to the keys. This is done by the TME IMK hardware that manages the keys, and the TDX module only references key IDs. No access at any point to the keys that are actually used for the main memory encryption, not by any piece of software. I will skip remote attestation for now, because I will explain details later. But a bit about IO compatibility. By default, no direct connection to external devices is possible, because those external devices are untrusted. SQIO can be emulated by software, but this has performance overhead. At the end of the talk, I will talk a bit more about these aspects and how the situation should be changed or migrated in the future. With Intel TDX, performance monitoring and debug facilities run inside TTD. This is a difference compared to Intel SGX, because this means you can debug your application handling sensitive data. Because even during debugging, you are protected, you are inside the trust domain. Sure, the person that does the debugging now has access, but still the infrastructure provider doesn't see it. One final aspect here, the page table management happens inside the trust domain now to address remapping attacks. This was also different with SGX. It was responsibility of the operating system, which was untrusted. A few more details about the TDX module and what we call the secure arbitration mode. The TDX module is provided by Intel and the code is open source. Since only two weeks ago or something, it's on GitHub now. The seam loader verifies the signature of the Intel TDX module when the system boots and loads it into a special memory region, which we call the seam RR. Only software in the seam RR itself are able to access other memory in the seam RR. In fact, hindering everything but the TDX module from doing anything. All other software access and DMA access to this memory is completely blocked. The confidentiality and integrity of the seam RR is again protected with AES, XTS, 128 bit. The Intel TDX module runs in what we call the secure arbitration mode or seam for short. To be more precise in the seam VMX route mode. The ISA was extended with introduction of Intel TDX. The ISA was extended by four instructions to enable the communication between the host, the hypervisor and the hardware. These four instructions are seam calls for interactions between the hypervisor and the TDX module. So, start the TD, stop the TD, things like that. Seam read to return the execution control back to the hypervisor. TD call from a call from the TD to the TDX module and seam ops for calls from the TDX module to the hardware. Certain security critical ISA instructions are denied in seam to provide the protection guarantees we want. Now to TDX remote attestation. The TDX remote attestation, you all know that, you all have heard of that in SGX or in other technologies, uses quotes. Quotes are created by hardware and the quotes are used to prove something. In this case, the TD can prove four different attributes, at least four attributes with this quote. The booted image is exactly as expected. During the loading of the image, it's measured, so it's hashed and this hash is stored in what we call the MRTD. This is part of the quote. Second, measurements created or extended during runtime. Intel TDX has what we call runtime measurement registers or RTMRs and they can be extended at runtime. It's not done automatically, it's a can. It's a subtle topic, if you're more interested in that, but that's what we have. Number three, the TD is executed on an Intel TDX enabled platform. It's obvious that that's important, so nobody should just be able to simulate that it's an Intel TDX hardware. Number four, the Intel TDX platform is fully patched. As you also know, I assume in the past, there were problems with the different technologies, loading Intel SX, but then we provide a patch and we have the ability to prove at what patch level your platform is. Then it's as it says here in the next line, whoever is dessert or a lying party can look at the quote and then decide if it trusts the TD or not. Some might decide even an older patch level is fine, some say only the newest one is fine, some say MRTDs has to be a certain aspect, RTMRs have to be or don't have to be used, all that's possible. A bit more about the process of remote attestation, which should look very, very familiar to people that have seen the SGX remote attestation. It all starts with a relying party triggering the trust domain of here, please prove to me the things I just mentioned. The TD will reach out to the TDX module and the TDX module will reach out to the hardware. The hardware will then generate what we call a TD report and this report contains the measurement I mentioned before, but it also has for example the security version number of the TDX module, the measurement of the TDX module and the measurements of the TD and all other aspects that are in the trusted computing base and it's signed by the hardware at this point. The TD report then is routed back to the TD, back to the hypervisor and then to what we call the TD quoting enclave. And as the name enclave already suggests, it's an Intel SGX enclave, right? So we use Intel SGX for remote attestation of TDX. The TD quoting enclave checks if the report was signed at the same platform and if that's the case, it will sign it with the attestation key. I will come, what this means in a second and why this matters. But then now we have a TD quote that's signed by the attestation key and this TD quote is passed back to the relying party who can do quote verification now. But the important question is what just happened, right? The TD quote was signed with an attestation key, what does it mean, why should we trust that, right? And a key piece I skipped before is that the TD quoting enclave has randomly generated the attestation key before the process even starts, right? Without Intel being involved at all, this happens on the platform. But that still doesn't help much. But what also happens on start, the so-called provisioning certification enclave that's also provided by Intel will do local attestation with the TD quoting enclave. It will see, yes, okay, we run both on the same machine. It's a TD quoting enclave that I expect. And it just provided me an attestation key. And then it will use the provisioning certification key to sign a certificate. So then we have, as is you on the right side, an attestation key certificate that's signed by the PCK key. But again, why does this matter, right? The important piece now is Intel is able to create PCK certificates that are then rooted in an Intel CA. And this ends the trust chain, right? So the attestation key generated on hardware, but it links back to an Intel CA. And during quote verification, whoever does it, wherever this is done, can reach out to what we call the provisioning certification service to get all the collateral that's needed to check this chain. That's the process of remote attestation. And as said before, Intel TDX attestation uses Intel SGX. All the sets of collateral we had before, PCK certificate, distribution, caching services, they supported Intel SGX in the Pado in the past only. Now they support both. And it also, this also means that as it's required to enable SGX on the machine when you want to use TDX. Just quickly, a few words about how you can do the verification, right? There are basically four options. You can use a service by the cloud service provider. You can use a service by the vendor of your application. You can use potentially the Intel trust authority like an independent software as a service offering by Intel to do the verification for you, to alleviate the process. Or you can build it your own with the open source Intel libraries we provide. A few pieces of differentiation between the services, if you want to have a separation of responsibility between the infrastructure provider and the verification party, then you should not use the cloud service provider, obviously. But in all the other cases, it's fine. If you want to have consistency, if you want to support SGX and TDX, then it's up to the cloud service provider and the application provider if they supported our variance, definitely supported. If you want to have consistencies across your applications, across the environment you have on-prem hybrid, whatever, then obviously cloud service providers cannot be used, the application vendor potentially, and the others will do it. From a development perspective, it's low in the first three cases and I would say medium in the last case. So quickly, very quickly, two upcoming features of Intel TDXO that we have at least a little bit time for Q&A. First TD migration. TD migration will allow to live migrate one TD from one platform to another. It uses a service TD called the migration TD to do that. All the data is obviously encrypted. Just skipping a few details now, everything is encrypted. Everything will go over step by step, a short break. One TD on the left side goes down, the one on the right side goes up, which guarantees that a TD lives only once at a time. You should not have two different TDs with the same content. And one last feature, Intel TDX Connect. I mentioned that before. So it's a bit problematic at the moment to connect trust domains with a device. It is possible, but what's needed at the moment is the trust domain, like everything in the private memory of the trust domain is encrypted. But it can't directly, so it can write to shared memory, right? That's what it can, but it can't, no. It can't directly write to device. What it only can do is put data on a shared memory and the device can take the data from a shared memory, right? What we call a bounce buffer. So this is a bit slow. Still, it can be done securely, right? If a secure session key is established between device and trust domain, the data can be encrypted, put there, read in the device, and it's encrypted. So even today, this solution is there. Like you can connect an Intel TDX trust domain to an NVIDIA GPU with their confidential computing technology, have it end to end secure. That's possible. But it's a bit slow, or it has a bit of overhead because of this bounce buffer stuff. And this will change when Intel TDX comes along. Because with Intel TDX Connect, the idea is that a trusted device is put in, let's say, the trust zone of a trust domain. They're able to write into each other's memory directly after they trust each other, which will make the whole thing more efficient and has low overhead. This is just nothing I mentioned today is any secret, right? All of that is open here on this page. We have documentations. Knock yourself out. It's like thousands of pages you can read in the PDFs to get all the details you want. If not, feel free to reach out to me at any point after this talk, at any point later. If you have interest in, for example, bare metal access to machine, I'm also your guy for whatever experiments at the University as an organization, whatever, right? Because at the Cloud Service providers, you normally don't get that, right? You get a trust domain. That's it. Might be enough in many cases, but not in all. So reach out to me and thank you for your attention. Can we repeat the questions? Yeah, so, yeah? I have to repeat the question. The question was, is it possible, or I rephrase, correct me if I'm doing a bad job there. You said, it's possible to run a legacy application in a trust domain. Yeah, that's what you said. The question is, how is the integrity of such classes being maintained considering the fact that this application is legacy doesn't? Okay, yeah. So the question was, again, in my words, how is the process then, how is this protected because the application wasn't written, right, for this environment? And the answer is, it depends, right? Meaning you have an in-memory only application, then you don't have to do anything, right? Because the main memory is encrypted and you're done. As soon as your application writes to disk, it's a different story, right? Because if you write to disk plain text data, then it's plain text and everybody will see it. One thing you can do is you can either your application encrypts data before, then it is a change to the application, right? Or another variant is you activate, for example, full disk encryption in your operating system. Then you have to manage the key, right? That's another question then, but that's what you can do. And exactly the same for network connections, right? If you, again, send plain text data out, yeah, plain text data is out. But if you use TLS, you can do it, you just put your TLS endpoint in the trust domain now and you're good. Yeah? Thank you for a very nice talk. So I had a question about the state software support. Thank you very much. So I had a question related to the sort of status on the software support on the guest side, right? So with some of these comparable technologies today, you still need kind of some components in the middle on the guest side, like basically like firmware inside the guest or like Paraviser functionality that hides some of this communication between the underlying layer. So how is it with TDX today? Can you take like a stock Linux kernel and run this? So you still need some components there which are not yet fully open source? So at the moment, as I said, briefly before not everything is upstreamed, right? So it's, I guess, like the basic enabling should be there middle of the year. So at the moment, it's not there fully. But what we have is what we call a TDX early preview. So we collaborate with three operate distribution vendors to provide specific distribution versions. And that's canonical, Red Hat and SUSE. And we have all this is online. You just go to GitHub and it's, I just did it yesterday night, right? It's really like you start up a Buntu 2310, for example, you clone their repository, click install, done. You go to the buyers and activate TDX. Then there have another script to create a guest image. I don't just take them like 15 minutes to create, but just of download and all of that stuff. You start your trust domain and you're done. So that's pretty easy already. Yeah, thank you for the talk. I have kind of an obvious question. Is there a latency cost within one trusted domain from memory access given that it's encrypted and so on? So performance you mean, right? Okay. Yes, obviously it has to be, right? Infection can't be for free. But how high the overhead is highly depends on your workload. If it's a processor only workload, it's basically free. I don't have concrete numbers, but let's say one, two percent, right? So really, really low. If it's really disk IO sensitive, it's a different question, right? Because of this balance buffer and all of that stuff. Again, don't nail me on it, but let's say like it might go to 10% or even more, right? It's really, really dependent on your workload. I guess I have to stop now, but you can just come to me later, right?
SEV-Step: A Single-Stepping Framework for AMD-SEV
So, the next speaker is Luca Wilk from the University of Lubbeck and he will talk about some recent work he has been doing, actually, attack research. I'm very excited that the Dev Room from the start has had some consistent attack research line as well, which I think is very important for this new type of technology. So Luca, enlighten us. Yeah, thank you very much for the kind introduction. I will be talking about SCVSTEP, which is a single stepping framework for AMD SCV, and it's open source and available on GitHub, so feel free to check it out. And this was created as part of an academic paper, which is joint work with these great people down here. Okay, just a quick recap where we are in the trusted execution environment landscape. So as the name suggests, SCVSTEP is about AMD SCV. So we are in this confidential VM area here. However, single stepping is something that basically affects all TEs that are out there right now, so keep that in mind. Okay, with that out of the way, we can jump right in and explore what single stepping attacks actually are. So we start with a quite high level picture. What you want to do here is you want to take some kind of snapshot or observation of our protected application, and we use this for our tech. Now, if our TE runs normally, then it runs basically at full speed, and if we take these snapshots, we don't affect any synchronization with this TE process, and thus the observation and the data that we get is very blurry. But now if we start to interrupt the enclave at certain points, then we have these synchronous points in time where you can start to take our snapshots. So it's not running in parallel anymore, but the enclave is paused when we take our snapshots. And thus we already get a little bit more information. And now if we take this to the maximum resolution and we are able to interrupt the enclave after every single instruction reliably, then we get a pretty clear picture of what's going on. So I hope that already gave you like a good intuition. And now we go into what single-stepping attacks have actually been used for, mostly in academia. And these are all examples that have been done with SGX that really made this popular in academia because it made single-stepping very accessible. So the first attack avenue basic here is something called interrupt latency, and there you basically measure how long it takes from where you like started this attack to when you get like this callback that the enclave has been now interrupted or exited. And it has been shown that this timing actually revealed something about the kind of information that's running in the enclave. And for some instructions like different instructions, you can even learn something about the operands. So dividing by certain numbers takes longer than dividing by other numbers. And thus you really kind of instruction and maybe even the operand with these attacks. Then the second major attack avenue here is called interrupt counting or instruction counting. And here the idea is that certain algorithms and applications have secret dependent control flows, especially true for cryptographic algorithms. We have some secret key, and then I do some large integer multiplication or division and decode the dusted. Executes a different number of instructions depending on the secret data. And now when I do this senior stepping attacks, I can simply count the numbers of steps that I take. And then if I know on which code page I'm currently in, then I can learn something about the secret data just by observing the number of instructions. So in this tiny example here with a conditional jump, and in one case we skip over this move here, and the others we don't. So here we get two instructions executed here, three. And by knowing the code that's currently running, we can infer the value of the secret bit here. Then the third really popular attack avenue is not directly senior stepping, but closely related. It's called zero stepping. And here the idea is that we interrupt the enclave even more frequently. So before this able to actually execute a single instruction. So it doesn't make any progress on an architectural level, but on a micro-architectural level, it is first instruction. It's already starting to execute, then gets a board and roll it back. But on the micro-architectural state, there's actually still already stuff going on. And these attacks are able to measure this. And what we can do then is basically take an infinite number of measurements, but only running the enclave once. And this allows you to measure really, really tiny effects. And then the third column here is kind of the miscellaneous sketch all column. So as you can imagine, just by increasing this temporal resolution, you can improve basically any side channel attacks. So it has been used in many of these MDS attacks here, for example. Okay, so now that we know what senior stepping basically is and why it's really dangerous, we come to the main question of the stock here. Can ACVVM be single stepped? And if so, how? So let's take a look at the basic setup here. So this is like a very boiled down version of the control loop that's going on in the hypervisor, where we enter DVM here. Then we execute some instructions and then at some point we exit. So for senior stepping, the obvious question is, when we exit DVM here, this is what you want to control in our attack. And there are multiple reasons why this can happen. So we can configure certain instructions to be intercepted. And you can also use page for it by removing access rights in these nested page tables. However, none of these two methods give us the amount of control that we want because they are not instruction granular. However, you can also use external interrupts to force an exit from our VM. And this is actually what will allow us to achieve this instruction granularity. And for this, the attacker uses something that's called the APIC timer. It's a common timer on x86 used by the operating system. And by injecting this timer, we will force exits from DVM. So let's zoom in a little bit. This is a typical attack sequence here. In red, we have the coded ones in the hypervisor. It's controlled by the attacker. And on the right here, the blue, that's the three instructions from DVM that you just saw. So what do we need to do now to achieve senior stepping? Well, intuitively, you would think that you would need to hit this tiny window between these two instructions here to single step. However, luckily on x86, it's already sufficient if our interrupt hits somewhere during the execution of this instruction. Because then it will be held pending and will be basically recognized at instruction boundary. Okay, but if we just naively implement this and try to do this, then we are not quite there yet. And we will see that sometimes we will overshoot here and then we will execute two or more instructions. And this, of course, decreases our resolution because now we cannot guarantee that we do something after every instruction. Maybe we have bad luck and skip over very important memory access instructions also on. So this is really bad, this mighty stepping. And on the other side, we might undershoot a little bit and zero step. And this is not really dangerous because then we simply repeat, we don't miss out on any instructions. We just try again and it's a little bit less efficient. So why is this the case? And there has been some really nice papers on SGX and they show that this APIC timer has quite some jitter. So it's not cycle accurate. So it kind of makes sense that we see this behavior here. So what do we do about this? And the kind of obvious idea is, okay, we kind of need to make this window larger because our timer doesn't have the high enough resolution. So we kind of need to enlarge the window at which our timer can hit. And for this, we look at what's actually going on when we execute an instruction here. So first we have to fetch the instruction from memory from the code page and then the CPU can decode it, issue it to the pipeline and eventually retire it. So for the attack, the idea here is now that we make sure that this year takes a long time and we achieve this by simply flushing the page from the memory. So we flush the VMs TLB and that's when we enter it again. We need to do a page that we walk, which will take some time and this effectively prolongs this window here. That is required to execute the first instruction. And now although our timer still has this jitter, this window is large enough so that we can actually rely on the single step. And the ACV step at the time of publishing was the first frame that did this shortly afterwards. There were also some papers that did something similar and it's also open source. So we hope that other people will reuse it. Okay, so now let's take a little bit closer look at the ACV step framework itself. So besides reliably single stepping, we wanted to achieve two other goals. And this is reusability and interactivity for the attacks. And I will go over these two goals now in more detail. So for reusability, let's again look at our setup here. And since we want to program this APIC timer, we want to manipulate these page tables and maybe do some cache priming and probing. All of these things would benefit from being really close to entering and leaving DVM because this is the point. We have the lowest noise. However, this also means that we need to manipulate or change the kernel code and developing kernel code. It's quite hard. It's hard to debug. You're limited to see. You don't have any external libraries. So it's not the nicest programming environment. And also it makes reusing this for different attacks or for different papers quite hard because this environment is not so nice. And your tech logic is basically mixed together with these attack primitives. So instead what you want to do here is we only want to implement these bare primitives inside the kernel, like programming the timer, manipulating these page tables and cache priming and probing. And all of the other stuff is then moved out to user space. And we use an IOCTL API then to trigger this behavior from user space. So then here we have this much nicer programming environment. And other people can simply link against this library and write their attack code with it. And one tiny note is that this execution loop of DVM is asynchronous from our IOCTL API. So it changes only take effect the next time DVM exits. So we have some data variables here for communication, but this is something you kind of need in mind when you program these attacks. Okay, so we achieved this goal of usability. Let's move on to the second goal for interactivity. And to understand this a little bit better, I will go into more detail of how I envision this programming environment here in the user space library. And there we also basically want to have some kind of event loop. Initially we set up some configuration like I want to get a page forward once this page is accessed. And then we want basically to wait until this event happens. And when this event happens, we want to react to this event. We have usually in these attacks some kind of page forward sequence that would tell us when the VM is about to execute some certain function. And then maybe at this point we want to enable single stepping and do some steps to a cache attack, this kind of stuff. So this is basically the process event and the deved config part here. And the really important thing is that once we got this event, we also want the VM here to basically wait for us to process this event because we would allow it to resume. Then we would again lose this precise control you wanted to have to manipulate the environment after every instruction. So we now also need a way to basically communicate from the kernel side to a user space library to be able to send these events and wait for these acknowledgments. And for this we opted for a shared memory protocol. So the library and the kernel code here simply agree on a shared memory page and then use a simple protocol with some spin locks to basically implement this. Why is this not the most efficient? It is very low latency because it's just memory communication. You don't have any user space, kernel space context switch as with the IOCTL here and also reasonably to implement. Okay, and this is how we achieve this interactivity goal. This is basically the current state of the framework. But to close up, I also want to give an overview of ongoing and future work. So one thing I've been working on a little bit already and I would really like to continue on is to improve this API, this programming environment because right now it's kind of basically have these start, track, stop, track commands. And if you start to write your attack code as I've experienced myself, this can get quite messy and quite long really quick. So it would be cool to have some higher level abstractions for this. For example, a component that could track a certain page for a sequence for you and restart the tracking if you get some unexpected access and so on. And then some kind of mechanism or protocol to chain together these components so that you can structure your attack better. Also make it easier for people to get started by reusing these building blocks. And thinking about this even more, this is totally independent of the actuality underneath. So this is maybe something where the existing S3X step community could come together and could build these libraries at a higher level and then S3X step and SIV step. And I think the trust zone one is called load step could basically be initiated as drivers underneath that so that everyone could profit from this. Okay. And this is more or less it. You can again find the links for SIV step and also for SGX step, which I mentioned here. They are both open source and on GitHub. Feel free to check them out. Send me a pre request if you want to change something, create an issue that's something broken. And yeah, thank you so much. And I'm happy to answer questions now. Yeah. Yeah, thank you for the very interesting talk. A new Satchel attack for me. And now you've showed how to break things. Do you have some ideas how this kind of attack could be mitigated possibly? Yeah, so it's a really good question. So for S3X, there recently has been a paper which was does is called a X notify. And then basically the idea is to make the S3X and play interrupt aware and then execute some special handler that will pre fetch this first instruction that I showed so that you can't do this. I flushed the TAB and make everything really slow approach, but ensure that this the first instruction always executes really fast and this then mitigates this attack. And for TDX, which we just talked about, there's also some mitigation built into the TX module. And for SEV, we are currently looking into ideas how we could protect SEV VMs against this. Thank you. Thank you, Luca. Yes, we're back. So can you elaborate a bit on how much of this is SEV specific and how much of it is actually, let's say KVM step? Let's say if you don't have a mitigation in TDX, can you just launch this as is on any kind of VM or is this specific to SEV in any in any way? Thank you. So I don't think it's really specific to SEV because this ability to flush the TAB that should also be available with VMX with the hardware acceleration for Intel. I think that the basic primitive should apply. I also know that there has been like an internal prototype that's what's called TDX step that's on one of the Intel pages. So they basically build something similar for this. So I guess in principle, this should apply to all like VM based systems where the VM can be forced to exit by external interrupts. There's one more question. Can you repeat it if you have all the plans for TDX? It's definitely really interesting. The question was if you have also plans for TDX and as I've said, TDX is a bit in countermeasure, but I guess it would be of course interesting to try to figure out how this works exactly. If you can do something there.
The ups and downs of running enclaves in production
All right guys, so back to the matter of the day. The next speaker is Kian who works at Evervolt and I think it's quite exciting to have a bit of a complimentary perspective in let's say this exciting new field where we talk a lot about new technologies, but you will actually talk about how to use them in production. So take it away. Thanks. So I work for Evervolt and I will talk about Evervolt to begin with just so you know why we use enclaves in production and not just traditional computing. So we offer also, I don't know how loud that is if I'm too quiet, tell me so I can speak louder. So we offer tooling to allow customers to guarantee data security in different forms like encryption before it ever hits your system or ways to process said encrypted data in secure environments and so on and so forth. At the core of all of this is enclaves. We're running on AWS so we're using the Nitro enclaves which as far as I can tell aren't as open source as the Intel SGX or any of that stuff. But we've been doing this for a couple of years now and that was when we started the best we could find for doing VMs that guaranteed the security model that we required. So like I said encryption, so yeah we're running in fully isolated VMs where we can basically see nothing that's happening inside the VM without a lot of effort on our part which is mainly so we can protect our users data. So just to give the context, relay is our main product is what I would say. It's an encryption proxy, you put it in front of your service and you define some rules and before the service your request ever hit you, the rules are applied and all your data is encrypted. Sorry, I lost my mouse. So yeah so it's very much focused on web services but it's mainly for people who want to de-scope their environment so they can be more PCI compliant or protect HIPAA data and stuff like that. Relay runs, relay doesn't run in an enclave mainly due to performance reasons because it's processing lots of network requests and we want it to get quick because encryption is slow and we don't want to add overhead to our users. So we store all of our keys inside a KMS that is accessed from a secure enclave. That service, we have no access to the keys then. On startup it tests connections to the KMS, pulls down user keys, decrypts them and then we are able to process the user's requests and outside of that environment we can't replicate anything. This started though when more users joined us and we started to scale. At first we just had a lot of automation. That was stuff like how do you run Docker containers in enclaves and how do you make sure that you can scale up or scale down. AWS Nitro enclaves are guest VMs on top of EC2 nodes. There's not much automation about actually running what's in there so you have to build it all ourselves and get all that running and actually performing requests for our users. So after we got all that running we had issues with just the libraries in general. So the parts of AWS that are open source are all the interface libraries for connecting to them but we found that there's many, many edge cases where they just were very poorly documented or not documented at all about how do you interact with it, how do you work with the proxies. So for reference for those that haven't used it, you need to run a Vsoc. There is a Vsoc on your host for communicating with the secure enclave and this is the only I.O. you have in and out of the VM. You then need to manage all the connections yourself and how you transfer data in and out and communicate. We ran into some really fun problems though trying to use this and talking to the AWS guys about using their library and getting stuff. The funnest one I think was we lost, we were dropping, we had file descriptor leakage. Our VM guests, VMs were dying because we just couldn't connect to them anymore because we run out of file descriptors on them which I had not seen in a long time aside from just like breaking my own machine which was fun. Turned out we just made some assumptions about how stuff worked because we thought, oh this is how it works in Rust and no, that wasn't how it worked in the library and we were just not reading the code but we needed to read the code which was unfortunate because I would have liked it to be in the docs. But yeah, it really showed that there was no metrics or observability for these enclaves. We weren't able to know what's happening inside them or how to interact with them. We started trying to monitor them. This was interesting. Like I said, no metrics, no nothing. I realize you probably can't see a lot of those graphs but these were our load tests. We started to try and get metrics out of them because there's limited IO. We didn't want to just try and put a metrics collector inside them and shoot all the metrics out to Datadog or AWS. We started instrumenting the clients that we were talking to it with and we started sending load data and trying out different workloads. So a lot of black box testing. This was several weeks of just staring at graphs that I may have gone a little insane during but we're here now and it worked. So once we got through it all we were able to find different bottlenecks in the code but based on guesses and automation changes we were able to go from, I don't know if you can see that but about 1,500 requests per second, no, encryptions per second inside the enclave to about 5,000 encryptions per second just by switching our default curve which we hadn't ever considered because we encourage our users to set the curve but it made massive improvements for us. But we had no idea that the encryptions themselves were the bottleneck because we couldn't see what was happening inside of our enclaves or the VMs and know where our workloads were slowing down. So once we started doing the observability we really went in on it. So we did this black box testing and we found the limit pretty quickly. We had to guess where the bottlenecks were and there was a whiteboard in the office of like here are ideas we have to try in different configurations. We just worked our way taking each box off and turning itself on and off until we were able to actually get some improvements from it. We then started working on a level of, so AWS does have a concept of debug logs but the moment you turn it on your enclave isn't actually a testable anymore. The attestation variables all just turn to zero and you're not able to attest your connection. And like I mentioned before we need to be able to attest the connection to the KMS to actually even load keys into it so we couldn't run in debug mode at all. We had to figure it out. So we had to basically reimplement a level of tracing like if anyone is familiar with open telemetry and stuff we had to come up with a way of doing trace requests inside of it. We couldn't use open telemetry because it had no understanding of how to communicate outside of the VMs. We had to take concepts, reimplement them and come up with a way of batching requests, sending them out and limiting the amount of IO overhead that we were doing that. We eventually got there and we were able to monitor our boxes. That's when we started to notice more problems. So we basically had these two processes in the enclave talk into a shutter and we expected the green line there. Yellow line would be perfect. That was our local dev environment. But the green line is what we wanted to see in production. The blue line is what we were seeing in production. I've lost. I wasn't allowed to put the lines of the numbers on it to be specific here but that was about a 20x slowdown I think which was insane. We're still debugging this one. We're not 100% sure where the bottlenecks are. We're fairly certain it's the virtualization of the network layer inside the containers is just insanely slow. So what we're looking at is how can we short circuit that. There's some things like sock maps. You can re-root sockets. But effectively you can't just take a container and throw it, take a process and throw it in or take two processes and throw it into the VM and think that will work. It works on my machine. It does not just magically work. You need to really tune the system to actually be able to talk effectively. We're still tuning it. We're hoping to have some stuff to note soon about ways to speed it up with sock map and different improvements. Like I said, it's seemingly either the VM or the user space networking. The fun one which I think was a lot of people who have worked with Enclave go, duh, of course you had time slippage. There's no NTP in an Enclave. You can mount the PTP of the hypervisor. But again, that invalidates our security model for PCI. So we had to actually synchronize with NTP which meant we need to add another layer of periodic work that needs to be done by the guest box to ensure that the VM could actually know what the hell time it was. We noticed that we were losing a second a day which is quite a lot when you are doing and that was based on traffic volume as well, more traffic, more time we lost. But if we did nothing, it was just one second a day. That really got into it when we had to do anything that was sensitive such as token validation. So, off effectively broke if a VM was running for more than three days which led us to a cron job that just cycled VMs for every three days for a little while until we re-implemented NTP through the VSOC. Fun. These are a lot of, like, yeah. So we kept running into issues and we kind of said, why is this so painful? It should be easy to just deploy a service into an enclave and give other people the ability to, like, say, yeah, that person who hosts my cloud computer definitely can't see the data being processed and I can guarantee it. Really useful for health data or financial data which are our main customers. So we put it all together and have a product called enclaves if you want an easy way to do hosted enclaves. So you can effectively give, we'll give you a Docker container. No, we don't give you anything actually. We give you a CLI and you build the Docker container with a Docker file into a secure enclave. You are given PCRs so it's fully attested. You give us your secure enclave and we run it for you. We push our data plane and control plane into the enclave and it talks to the control plane that we use so you can leverage it, all of that is open source so you can reproduce the build yourself and validate all the attestation. That's the same and ensures that everything is communicating effectively and there's no, well, me or my team aren't, like, messing with your code or changing it or anything like that. So it's just regular Docker containers. The connection is fully attestable and you can connect to it. I see 10 minutes and I probably don't need that long. So, but yeah, we're working on this. We're taking everything we learn from building our own service, putting it into our Everloot enclaves and it's on our GitHub. If you want to have a look and go through it, we want to be able to, people to be able to look at it, see that we're not doing anything wrong and try it out and hopefully have a better experience getting on boarded with confidential computing than we had because it was a lot of like throwing stuff at the wall, seeing what broke, where it broke and trying to figure it out. I'm going to go for questions then. You said you had problems with curse. British didn't be using ECC. Do you have any idea why the curse might have been a problem? Are you hitting page boundaries, packet boundaries or any ideas? Yeah, so we what we were seeing was that it was in the CPU. There was optimizations that we hadn't accounted for. So by default, the box we were developing on ARM Max, who were highly optimized for the curve we were using in default, which led us to like say, great, look at the performance here on our local machines deployed to production performance crashed. Turns out the AWS boxes we were running on were optimized for the standard K1 curve or one curve, a camera, which one it is now, but basically wrong curve. And we were an evening in the enclave, those optimizations still come true. So we were able to get 20 X performance gains from that, I think. Anyone else? Can you elaborate a bit on the nature of the payload or whatever you're executing there? Because I mean, we saw there pretty much encryption transactions. But what was exactly running there? So what do we run in the enclave? So for the benchmark was so the benchmark was basically fuzzing was what we were doing. But we send. So as I mentioned, in the enclaves, we have all our customer keys each in it. So we had one of our keys in there that we would send 20,000. So we would have 20,000 fields to encrypt. And we'd say each of these fields, we're going to iterate through this dictionary and encrypt it. So we'd send just a generic JSON blob. But for purposes of encryption, we could send just that we could just be a Boolean or a string or whatever and just send it in. And we then would iterate through that JSON blob and it would have the it would say, I am this user or application in which would then cause service to choose the right key. And it would then just it and they would say, these are the fields inside the JSON blob to find and encrypt. So it was JSON blob and ID and fields encrypt. Very simple payload, but it was just iterative work. And because of how the encryption is implemented, it's all blocking work. So we'd have to farm out the work differently, not like directly related to enclaves, but when we did the load testing, we determined that we were blocking and dropping connections in the service. So what was happening was the connection we'd schedule the work on the enclave and then the connection from the upstream service would die. Then we wouldn't propagate that connection dying downstream. The enclave would do the work, try to send the encryption back and then go, oh, no one wants this work and stop. So we had to put some keep alive and connections. But these are again the things we missed because we were having to reimplement just what would be generic TCP or or HP for talking over the Vsoc in the enclave. So you mentioned the the architecture you you are using made you adapt your cryptographic parameters. So how would that scale up to the future? I mean, crypto agility facing any words on that? I don't know. I'm the SRE who's meant to make such scale, but that's actually outside of my domain. We have people who understand cryptography a lot better than me and the company who would be able to answer that question. I can give you an email address if you want to talk about it, but I can't speak myself on that. Thanks a lot for the great talk. So I wanted to go a little bit back to like the use case you presented in the beginning. And I might have missed something, but it sort of sounds to me that like the use case here was not really like a sort of protection at runtime, but it's kind of like a long term protection of the of the keys and not while they are used by the proxy, but where they are in store. So did you like consider other solutions for this like HSM's and do you have like any insights there that why did you end up choosing the nitro enclaves for this particular particular use case? So I'll be honest, that predates me at the company. I'm not sure why it is. I will I would say that we did level of evaluation that were probably not too deep. We are a startup for and find our feet at the time and we had implemented a level of encryption just inside a process. And then when we attempted to secure it and build it, the enclaves seemed like an easy solution. I think that we've proven they were not an easy solution. But yeah, that's so we we would have valid what we validated were just ways to do encryption that would guarantee we didn't have access to users keys and we couldn't decrypt any of their data. And yeah, uncle, it seemed easy in reality, not so easy. So there's one online question. Can you explain the TSTLF protocol that you use? Is the protocol specified somewhere and has it been formally verified? So it is. So we actually had to reimplement it. We I can't remember which one we did, but we've looked at one that was done by the confidential computing. Consorting more of the paper that was published on it and we it was attestation in TLS connection inserts as our original implementation. That then I can't remember the specifics of it. So I will have to refer you to our get history on this. We deployed it and we were able to do it. In production because people had to add our root CA to their root CA store because you couldn't extend TLS in the way that was specified in the RCA for customers. So we eventually had to switch to a new attestation, which unfortunately I'm not the expert on. But it is available in the TLS. It's written in Rust. It's on actually it is linked in the talks on the page and the files them under attestation bindings. So anyone can look at the protocol we use for a testing it. Effectively, we leverage the PCRs that are provided by the underlying nitro enclave to and then we have an attestation protocol that we use to test the TLS. The underlying nitro enclave to and then we have an attestation protocol that on connection to what we do a TLS handshake that then performs the attestation and the client must supply the attestation bindings and we have implementations on the client side and go Rust node Ruby Python. Actually not Ruby, just Python and node and go. Oh and Swift and Kotlin. I will ask it like this because the interference with the micro and then you can Yeah, sure. So there was also a bit of discussion in the chat here about nitro enclaves and in what far you can go about the E and I know this is an endless debate and we even had an exclusive debate last year. Maybe can you briefly react to that and maybe also say a bit about the infrastructure you built, how tidal is to nitro and then the next problem can be solved. Yeah, so last, oh yeah, sorry, repeating the question. It's the debate about nitro enclave versus TLS. They are not, as I said, they're not as open source because it's mainly on the client side, they're open source rather than the server side and it's mainly just white papers, I believe that specify how the nitro enclaves operate or just documentation. So and the other part of the question was. How specific the tooling you built in the company? Yeah, so how specific is the tooling to nitro? So we did evaluate other cloud providers to see if we could move off to it. This was done a year and a half ago. We looked at Azure for doing it. Azure didn't have the new Intel SGX or is it SGX? Sorry, TDS sorry. They didn't have the TDS at the time, so we validated that it couldn't fit our model of secure computing. We probably need to reevaluate now, but the tooling is very AWS focused right now and nitro enclave focused because it was about trying to make nitro enclaves easier for us to use. Conceptually though, the control plane and data plane aren't specific to that. So far, they could be reimplemented for anything that wants to do TCP over a network connection for inside non-clave and outside non-clave.
Integrity Protect Workloads with Mushroom
All right. So the next speaker is Tom Dormann and Tom is a real hacker. So I first met Tom last year at the CCC event where he talked about, which I did a bit about, an attack he did on NX. And I think it's very inspiring in the Dev Room that we have this great company talks, but it's also really nice in the real, let's say free software ethos. Tom, there's also some work in his free time, just pure hobby projects. And he'll talk a bit about work he's been doing on AMD-SEV. Sure. So thank you for the introduction. Today my talk will be on integrity protecting Linux workloads with Mushroom. Okay. So what are you going to talk about? Well, first up, we'll talk about some of the goals of Mushroom. And then I give a short demo to show you how it actually works. And then we'll talk about higher level architecture and some of the parts in particular. So the supervisor, kernel and the DMM host. And then we'll also talk about some of the things we don't want to do, some non goals. And finally, we'll briefly touch on how Attestation works with Mushroom. Okay. So, but this is a micro walker. Yeah. Okay. But before that, a brief thing about me. So my name is Tom Dormann. I mostly do developing and re-security research. But my day job is also reverse engineering. Here are some of my links. And one thing about me is that I also really love Rust. So all of the code that you may see here today is also written in Rust. Okay. So what do we want to do? The main goal is to run Linux programs securely. And in particular, we want to run programs that just transform an input file or maybe multiple input files into an output file or potentially multiple output file files. And while doing that, we want to prevent any tampering during the process so that we can make sure that the output files are authentic. So for example, one use case would be that you have some untrusted server and that you want to compile code on. And ideally, you want to not trust that server, but still be assured somehow that there hasn't been a backdoor injected somewhere in your code. And you just want to have assurance that the code has been compiled without any tampering. So yeah. I'll give a brief domain of that. So yeah. Okay. So what we'll be doing, I already talked about workloads. Mushroom is completely generic in what kind of workload you want to run. It has to be Linux binary, but that's basically it. So for this example, I chose to do a simple Docker image and just because it's easy to set up. And so in this case, it's an alpine image, which has GCC and muzzle installed. And it will run this in its script, which just copies the input file and that we want to transform to another file on the file system. And then we'll run GCC on that. And then in the end, we'll use that output and copied it to a special output device. And so the file that we want to compile is this one right here. Yeah. So it's just a simple head award, just a proof of concept. Okay. So, yeah, I should clear that. Okay. So beforehand, I already set up some like environment variables just for some of the components. But the important thing is this one right here. Okay. So what we'll do is we'll run this command, which as you might already notice contains some information like the input file that you just that I just showed to you. It also specifies the output and it also specifies where to put the gestation report. And because that's in the end, how we really know that that process hasn't been tabbed with is that gestation report. And so we'll run that. In this case, we'll actually take a bit longer than usual, because like the Docker image is fairly large. It's like under six megabytes or something. And just loading that in this fairly slow process. But any second now, the workload will start running. Okay, now it's started running. And now it's finished. Okay, so let's take a look at the output file. So just file test. And we can already see that it's a 64 bit elf binary, which is of course expected because we compile the C program. But before we actually run the executable, like, let's actually verify that it hasn't been tampered with. And we can do that by just using the same command that we used previously. But instead of saying run, we use verify. So we use the exact same configuration parameters. And this will take very shortly. And it says okay, so we know that the process hasn't been tampered with. And so as the last step, let's actually make it executable. And run it. Yeah, so you can see that also works, which is, yeah. Okay, so now that we saw how it work, like, what it's supposed to be doing, let's talk about some of the details about how it's implemented. And the first thing to note here is that it's implemented using SEV, S&P. And so in this case, we have all the virtualization. The workload is of course supplied by user, which in this case was GCC. And then around that we have a completely custom kernel, which we'll also talk about later. And around that, then we have so called supervisor, and which is a concept I came up with, which is basically just responsible for communicating between the kernel and the host. And the important thing to note here is that most of the logic is actually in the kernel. And this will probably grow a lot in the future as well. And the supervisor is fairly slow, not slow, fairly small, and will also probably not grow a lot in the future. It might even shrink. And even in this configuration, there's some code that's like disabled at compile time because it's only there for like debug features. Okay. So about the kernel, it's completely written in Rust. It just implements the Linux Cisco interface, so that we can run unmodified Linux programs. It currently implements 83 Cisco's more or less because like some Cisco's have a lot of flags and we don't implement all of those. But still it's a lot of, it's enough for some of the applications at least. And yeah. So apart from that, we also support 32-bit and 64-bit binaries. And the reason we have this kernel is that usually you have a lot of bloat and you have a lot of stuff that you just don't need. And so the reason we have those on-custom kernels that we can just throw things away and only implement the things that we need. And yeah. We'll also need that for some things that we'll talk about shortly. Okay. So the really interesting thing I think about mushroom is the supervisor. And so I already talked about that it handles communication between the host and the kernel. What does that mean? Well, the first thing the supervisor does is actually load up the input. And so the input is not part of the initial measurement. The reason for that is that we don't want to measurement to change every time the input changes because then we can sign it or at least not in a way that's like really makes sense. The other thing is memory hot plug. So initially mushroom starts out with a very small amount of static memory. And then after that we use memory hot plug to load in more dynamic memory once it's needed. And lastly, the thing that we do during runtime is scheduling. And so if the kernel wants to run another CPU, it somehow has to tell the kernel about the host about that. And so that's also a responsibility of the supervisor. And so the interesting thing here is that this communication, it's not just a convention, it's not just that the kernel chooses to talk through the supervisor to the kernel and to the host. It's actually impossible for the host to talk to the kernel directly. And so the reason for that is that we want isolation there. We don't want the kernel to have potentially malicious input sending to the kernel and we want to prevent vulnerabilities by just not having an interface there. And this is implemented using a couple of hardware features. So for example, one of them is virtual top of memory, which just basically makes it so that the kernel can't access shared memory, which would of course be needed to access, yeah, to have shared access with the host. Another feature is VC reflect, which is basically in some cases you need the hypervisor. And instead of using the hypervisor, we can then just offload that responsibility to the supervisor. And that way the kernel doesn't even really have to be aware of it being run in a SAP VM. Lastly, the separation between the kernel and the supervisor, which is of course also important, is done using virtual memory privilege levels, which basically makes it so that the supervisor is allowed to access all memory. But the kernel is not. So for example, the supervisor has some secret keys that it uses for agitation. And the kernel is of course not allowed to access those secret keys. And yeah. So the important thing here though is that the supervisor is the only security critical component. The kernel can have as many bugs as it wants. The host will never be able to talk to the kernel directly. So it doesn't really matter if there are security bugs in there. And this is of course really nice for auditing, because the only thing we have to audit and make sure that it actually works is the supervisor, which is this once again fairly small fairly small component of code. Yeah. So for the VMM, we don't use QMU or anything. Reason for that being is that we have this fairly custom like memory hotplug and so on. And like all those interfaces and getting the data in and out. So yeah, instead of using something that already existed, that maybe has like abstractions that are not ideal for us. And we just implemented this for our own. It's not actually that complicated because we once again, we don't have that much host guest communication. So this VMM doesn't really have to implement a lot. And as of a couple of weeks ago, it also supports running the kernel outside of an SCV SMP VM, which is of course really useful for like debugging and profiling. And maybe not everyone has an S like an epic CPU that can actually run those VMs. Okay, so we already talked about a lot about things that we want to do, which yeah, but there are also things that we don't want to do. So one of those important things is that we don't want to do IO at runtime. If I want to run GCC, I don't need network. I will never need that. That's just not a thing that we need. And sort of thing is by not having network, we can reduce the tech surface drastically. And once again, like reduce complexity in the supervisor in the kernel and mitigate vulnerabilities by just not implementing interfaces. Of course, there are a lot of a lot of use cases where you do need network. But in those cases, you can just use standard Linux and like, you can just use other projects. But the point is that for a lot of projects and workloads, you don't need the extra complexity and job us by just not implementing that you can lower the potential for vulnerabilities. Same logic goes for persistent storage. So every time mushroom boots up, you boot into tempfs with like all those files that you split doing initialization. But once the VM actually exits, all that memory is destroyed because for a lot of use cases, you don't need persistent disk. And by not having that, you can once again, low complexity. Similarly, we also have fairly low complexity in the supervisor, which once again is this one part that's actually security critical. So one of the things that you might have noticed is that none of the things that the supervisor is doing are really CPU bound or performance critical in any way. And so for example, we can get away by just not implementing multi threading, because in reality, there's nothing that requires that amount of performance that could potentially be that could potentially like get a performance boost by multi threading. And so by not implementing multi threading, we can once again, just like eliminate the whole class of concurrency bucks, because that's just can't happen if you don't have multi threading. Similarly, the supervisor is fairly simple and doesn't actually need the heap. And then once again, we can just not have any bucks in there if you don't have a heap, if you don't need it. And yeah, so I think those non goals also really important because they could strain the things that we want to do and that way, like increase the security by setting up clear goals. Okay. So lastly, let's talk about agitation. I'm sorry, talked about the measurement. So in this case, this contains all of the binaries that you want to load up in, which is the first supervisor, the kernel and the in the binary. Those could be signed in the future. Currently, we just compare the raw hash. And so the SEV firmware, when you load in the image, it like hashes all the memory and like chains it together and just produces a hash that you could be some could sign, but we don't currently. And the host data field is also field that's supplied when the VM is booted up. And so this, this field just contains a hash of the input file. And the first thing the supervisor does is when it boots up is loading the input file and actually verifying that that hashes correct. So it doesn't even really look at the data, it just hashes it. And that way, there's no way hopefully for the data for the input file to potentially be malicious and influence the supervisor before it's actually been verified that it actually is the one that we want to see. And lastly, of course, we also want to attest the report, I'd like the output. And this is put in the report data field. And this is also interesting because this is actually the only field that the guests can influence at runtime. So both the measurement and the host data field are set by the SCV firmware. And even if you have like some malicious input file or malicious input binary, you can only modify like you can only modify the report data field. And so this is really important because if you have like, assume, assume you have some untrusted input, you will never be able to fortune attestation report in such a way that it pretends to have to come from another host data from another input file. And that way, we can just like by making this the simple abstraction choice, we can hopefully reduce the potential for any vulnerabilities there. And so this is also another thing where it's compared comparatively simpler compared to other projects. Because one of the things is that we only do attestation at the end of the process. So we don't have any certificates during runtime. And because we don't have any IO at runtime, and so we just don't need any other certificates that would usually have to interface with other services. And like, I can see why there are a lot of problems like sanitization. But that's just one of the things that this model doesn't really need. And similarly for like this encryption case, so the attestation model for mushroom is just really, really simple. And hopefully made in such a way that's actually easy to audit for external people if they wanted to do that. Okay, so do we have any questions about that? Thanks a lot for a very interesting talk. So I particularly like this demo that you showed because showing like this use case where you actually run a compiler inside the CVM is like a very, very desirable property in like build environments where you want to have this notion of hermiticicity where like you actually record the entire tool chain that you use to produce software. So related to this, I sort of had a question related to the sort of the trust assumptions here. So you talked about this that the supervisor is the sort of the only security critical component, but that basically only applies to the communication between like the outside world and the kernel. And but you still, you know, you later talk about it that you can still have like attacks via the input itself. So for instance, if I have like malicious code that targets some vulnerability in GCC, let's say, that's still possible, right? But on the other hand, that gets somehow recorded in as part of that station. Can you a little bit like elaborate on like, like these aspects? Yeah. Thank you. Great question. Yeah, of course. So yeah, so if you have an malicious input, that would show up in the attestation report. And I mean, ideally, if you have like a scenario where you want to have like a code cache, where you like compile, compile code once, you will only supply inputs that like are not malicious. So as long as you don't like request malicious inputs, you will not get malicious inputs out outputs out. Yeah. So I mean, in theory, there could be like attacks from the inside, but that's not really a problem because that always shows up in the attestation report and like a normal user will not request that. So yeah. Yes. Yes. So an additional comment was that this is audible. The question was whether or not this is auditable. And yeah, so the answer to that is yes, everything shows up in the attestation report. And yeah, so hopefully that's not a threat. Any other questions? Thank you. This was awesome. And then it's not a question is a feature request. If you could spit out as bomb, as bomb from, from the compilation thing with, you know, that would be fantastic. And yeah, well, the thing about that is that mushroom is not necessarily only meant for compilation processes. But if you want to do that, that's great. And one of the things I've been toying around with was, was running nix compilations and nix builds in that. And of course, those are already contained like in the build hashes, like the way nix works, all the inputs are already specified in that. And so if in that scenario, you would like more or less have an S form at least like traceable to some input. And but that's independent from mushroom. But of course, that's also the use case I intended. Okay. So yeah, first of all, very awesome work. I really like that you show that these confidential VM based solutions can also be used with very tiny trust computing bases. That's nice. And I mostly agree with your design choice of the non golds. But they said that you don't support mighty swing. But wouldn't that be somewhat important for compilation to be able to run on multiple costs? And it's kind of CPU can you consuming? Yeah, sure. And so this thing about multi threading, this only applies to the supervisor. The actual kernel can run as many costs as he wants. I mean, technically, a second limit on 128. But yeah, that could be changed. And it's probably enough. Yeah. Maybe a question also moving forward, you mentioned support and also even protected combination. Do you have a part of your design and I'm thinking about the PMPL support? Okay, so the question was whether or not my designs are tied to S&P, a CV S&P, or whether or not they could also apply on TDX. So currently, the supervisor is highly specific to S&P. But I don't see a reason right now why it couldn't be implemented for something like Intel TDX. That should probably be possible. Yeah, I mean, the MPLs are specific by I think with TDX stuff like partitions, maybe that could be something. I'm not sure. I haven't looked into that. Yeah.
Reproducible builds for confidential computing: Why remote attestation is worthless without it
All right. Let's get going. Our next speakers are Paul and Malte from Azure Systems and they're here to talk about remote attestation and reproducible builds. Yeah, thanks. And I will start with some motivation. First, the topic of the talk is reproducible builds for confidential computing and why we need it. So first, the motivation. Yeah, confidential computing. What is the situation with confidential computing? We have trust issues, especially when we're running in the public cloud. So yeah, first of all, we trust no one. Well, that's not entirely true. We need some hardware we would have trust. So we have some will have to trust the hardware manufacturer. And in all the other components that we are using, we have to establish trust before we can rely on them. And we're doing this using remote attestation. So a quick overview of a remote attestation based on the Rats RC. So here we have our three entities, the attestor, the verify and the relying party. And the goal of the remote attestation procedure is that relying party can place trust in the attestor system. So how are we doing this? Inside the attestor, there's a testing environment and the target environment. And the testing environment is doing measurement of the target environment. And then handing out some evidence that is verified by the verify. And the verify uses two kinds of resources to verify the evidence. First, the endorsement, which usually provides guarantees about the authenticity and integrity of the evidence. And then some reference values that are compared to the claims that are inside the evidence. And yeah, there were verify does verification and produces an attestation result. And that attestation result is consumed by the relying party. And using this attestation result, the relying party can place trust in the attestor system. So the aspect about this remote attestation procedure we want to talk about here are the reference values. So as I already said, we use the reference values to compare or to check the claims inside the evidence. And yeah, some of these reference values represent the code identity of what we are actually running inside of our TE. And often these values are hashes over what we are executing. And as we all may know, hashes are one way functions. So it is really difficult to go back from a hash to what was actually hashed. So many questions arise from this. Where do these hash values come from? Who produces them? So who's our reference value provider? What do these hashes stand for? And how can we establish trust into them? And often the answer is we just can't. And in this talk, we want to present a way how we can establish trust and meaning to those reference values. So why might this be a difficult task? So the main scenario we are talking about here is about CVMs. And these CVMs have quite large TCBs. So we need to cover all of our software component with these reference values. And yeah, there are quite a lot of components from where bootloader kernel user space. We all need that stuff. Yeah, can be quite a lot of lines of code, not always only some lines of code like in rushroom. But the more interesting question is who is part of our trusted computing base? So software vendors usually and usually also a lot. And there are different ways that we are including people into our trust base. So maybe the more simple one is that we can consume code from other people. So well, that's quite usual. It's also okay. We can order the code before we include it. And ideally, our language ecosystems provide us with some mechanism to pin the dependencies that we use by some hash or so. So that's okay. The second mechanism is more problematic. We could consume binary artifacts. And going back from a binary to source is expensive. And yeah, typically, this is when we install packages using a package manager, or if we use prebuilt VM images. And even if those binaries are signed, if we rely on the signature, we include the signer into our trust domain. And then there's the third case that is even worse. These are the situations where we cannot choose what we want to what is actually running inside of our TCP. So this is, for example, the case when we have like something hardware compatibility layer running below our guest OS in the CVM. Or if we are not able to run customer provided firmware in the public cloud. Okay, so talking a bit about the consequences here. Yeah, every software vendor we include on our trust boundary could potentially run an attack on us. For example, by delivering malicious reference values, so reference values for malicious binary. It's just really difficult to check for us what these values stand for. And in the end, we have no insights what is actually running in our system. So a simple solution could be we build everything from source, right? Source is good. So we can audit the source. But usually we are not the consumers of the things we built. So we're not the end users. And as a consequence, there's one remaining software vendor in the trust boundary. And that are we. So that's not good too. And the actual goal here is to provide a testable systems for the end user. And reproducible bugs can help us to do this. And much of it will continue to tell you about reproducible bugs. So thank you. So let's quickly talk about what reproducible builds actually are. So the basic idea is you have software development principles where third parties or anyone can take the same inputs and produce the same binary output. And this part of being independently verifiable is really important to us. And let's just take a small step back and look at our perspective. We are building a lot of software that is supposed to run inside of enclaves. For example, we're building a full Kubernetes distribution with OS images and containers. And we really want people not to have to just trust us because we are like reputable. We want people to take the stuff we build, look at the source code, verify it, rebuild the binaries. And only then, only if they can rebuild the same binaries, they can also just get to the same measurements and then they know that they can trust us. So in the perfect world, this is what we would like to have. We just take the source code, we put it into a function and we get out the reference values. But as you will see, this is sadly not the reality today. So looking at this more closely, you have the source code and then you have some kind of build process. And then what you get out is binary artifacts like the firmware, the kernel, anything that goes into the user space applications. And from these, you derive the hashes or other reference values used for remote attestation. And in reality, this is already where you start running into problems because the software itself is not open and you cannot rebuild it if the source code is not public, basically. So sometimes this is where you just have to stop. But then if you're lucky, the source code is actually available. But that's when you run into a whole different set of problems because if you want to build the same firmware and the same kernel and the same user space and everything else, you notice that if you build your software, it doesn't actually just depend on the source code. It actually also depends on timestamps and randomness and inputs that you didn't know you had. And also it depends on tools and specific versions of them. So let's say you actually managed to get all of this under control. Then you can still run into this situation where you get the same firmware and everything else, the whole stack, the whole TCB is the same. And you boot it in a trusted execution environment. And still the evidence that you extract is different. And this is often the case if you include anything in the measurement that is not part of the code, but it's actually dynamic, like a timestamp at boot or the instance ID of your virtual machine. And yeah, in this case, you have to do some run a policy engine on the other side, basically. Yeah, so this can be solved, but it's also really annoying. Next, we will quickly look at who's already doing good work in that field. So first is the AWS UEFI firmware. And this is used today to run AMD, SCV, SNP virtual machines. And it's really nice. It's just EDK2, OVMF firmware with some patches, but they also provide the full build system. So you can just download it and rebuild it from source and actually get to the same measurements. Yeah, another example is Constellation. This is our stuff that we built in there. We actually provide every container image, any tool, the whole operating system, anything can be rebuilt from source. And it's all reproducible. And yeah, then there's also the confidential containers, cloud API adapter, and there's the peer ports images. They now have an option to build images with MKOSI that are now also mostly reproducible. And we also have a GitHub repository where we basically just wrote on all of the steps that are needed to take a general purpose Linux distro and get also reproducible builds for that. And it's also documented and we show you all of the steps that we took. So you can play around with that, which I think is like a good starting point. So that's the repository if you want to have a look. Yeah. So now some concrete help if you actually want to do this. So this is for building OS images in particular. First of all, you need to pin your build tools. Basically, if you don't do that, tomorrow you will have a newer version of a tool and you will have a different result. And what we noticed is if we use something like NICS, we can pin all of the build dependencies in a very nice way. And also we were able to patch a lot of the tools in NICS. So they actually become reproducible. For example, we had a tool like MKFS for FAT partitions that was not reproducible. We could make sure that the version NICS was actually creating reproducible outputs. The second thing is about any things that you depend on. So that's libraries of your building software or binary packages if you have to include them in your image. First of all, you want to pin them. So you want to make sure that you know in advance the hashes of everything that will be a dependency. And then it's not done just during that. You also have to make sure that they are available in the future. So you have to archive them. You have to make them available. And you also need to have a mechanism then to actually operate your log files because if you just pin them, you will have a lot of security vulnerabilities in the future. Yeah. And then it goes on. You really want to build every piece of software in a sandbox because otherwise you don't actually know if your build is reproducible because it could depend on something that is not actually there in the future. So user build system that does this, there's MKOSI for building OS images. There's NICS and NICS OS, which are really great. And there's Bazel that also uses Sandbox. Yeah. It will eliminate a whole class of issues. And then you also really want to restrict build actions or install actions or any other kind of logic to only perform deterministic steps. For example, I think the Cocoa project was using HashiCorp Packer and that has the issue that it can run arbitrary steps, which means it could, for example, run up to get install. And then you basically have no idea what version of something will be installed. And the same applies to Docker files. So just use something that only does what you want. So this was our talk. There's some important things we want to give to you to think about, learn about reproducible builds. We want you to provide an open software stack for CC. And we want to enable the community to reproduce the reference values that we put out into the world so we can remove ourselves from the trust. So thank you. Thanks a lot. So I have a bit of a philosophical question. And that's related to sort of the sort of like the relationship between reproducible builds and build provenance. So we also had, like for the last time, there was last talk, there was a question about like S-bombs. And this is of course something that is of increased importance because of the focus we have on supply change security in general. And there is also people working on build provenance, right, where you have the build hermiticity and you have like a record of how software was built. And that also gives you like some guarantees of how you end up with a certain set of reference values, even if it's not fully reproducible, right? So, but you know, from the provenance, like what goes into this recipe. So do you have like any thoughts on like pros and cons or like reflections around these two related topics? Yes, definitely. So I think first of all, if you're able to have reproducible builds, you already basically have an S-bomb because it must be the source code and anything that's locked in there. So it's actually like already the S-bomb there. And then also, if you have an S-bomb, how do you trust the S-bomb? Because if someone just gives you the S-bomb and it's not signed, it could be fake. And if someone does create like a trustworthy S-bomb, it probably needs to be created in a confidential VM or something like that. So then you actually make the whole problem a lot more complicated. So if you can just use reproducible builds and then the problem is just fixed. So the question was about if this also solves the problem of pinning the toolchain. Yeah. Yes. So the question is about like if you can trust the toolchain that bootstraps the whole system. And yeah, I think you can bootstrap yourself from nothing. And I think also the NICS project has some kind of bootstrapping where they do exactly that. So that's it.
Increasing Trust and Preserving Privacy: Advancing Remote Attestation
I'm Leonardo and you're not from ARM and I think it's going to be a great end of the day so looking forward. Well, hi everyone. So this is a talk about remote attestation and yeah, because we think remote attestation is sort of an inflection point and it's becoming increasingly available and used and any new technology when it comes to the fore you have to consider different aspects, societal and technical as well. And yeah, so we're here to talk about this. Possibly interesting things. My name is Thomas as Fritz said. I'm your notes. The ghost of Hannes is here with us. He couldn't come to Brussels but he's here in spirit. Yep, that's us. Okay, so I wanted to start with this sort of timeline that tries to capture some of the more relevant events in the history of remote attestation starting from the theoretical underpinnings with the DDSA paper from the fine people at PARC in 1983. And you have to wait sort of 15 years before research trickles down into industry. You have at the end of the century the first industrial consortium that is formed to actually define what a trusted computing architecture is in terms of behavior, in terms of the interfaces that it needs to expose. And so we have TCPA formed which then morphs into TCG, the trusted computing group. These are the guys that are responsible for producing the TPM12, TPM20 specs, among other things. So you have the first decade of the 2000s that is sort of where attestation, because you know, TPM has a strong attestation story bound to the idea of using the TPM as a route of trust for reporting. So you have the first decade that is sort of driven by trusted computing use cases. Then enter the second decade and you have AMD and Intel SGX cropping up. And this starts the sort of confidential computing driven decade where you have the first, second and finally the third iterations of the architectures, which culminate into the SCV, SNP, Intel TDAX and ARM CCA. And you have a few other interesting events in that period. You have the riot paper from the Microsoft guys that sort of articulates fully the ideas that were in the DDSSA paper. So 30 and odd year later, you have finally the dice ideas on paper and not just on paper and code. And you also have the PSA attestation from ARM, which is sort of an IOT targeting attestation for IOT platforms like riot as well. So you have also this area of the users of the, well, this is covered by attestation primitives. Where attestation primitives starts to enter. And then you get into 2020s and so on. And here is where we sort of see some kind of maturity in terms of the standards that are actually coming to the fore. Not just standard in terms of standardized formats, data formats and rats that was mentioned before. And that is coming out this year. Is one example, but also software standards. So the configure FSTSM ABI that Linux kernel is just upstream is one very, very concrete example of standardization in that space, in the software space. So we are here, as I said, we're probably at an inflection point. The primitive is increasingly available, not just in the configuration computing space, although CC is a very, very prominent area, right, that drives this. But also, so you have use cases in IOT, you have use cases in TCP remediation. Well it's also cropping up in your devices with interesting societal fullbacks. And so basically the idea is that, like Dave Taylor said, every authentication use case is also an authentication use case. For wherever you have the need for authentication, authentication which is effectively a stronger authentication primitive, stronger identification primitive is something that could be used to either reinforce or supplant your previous thing. So that's where we are. And yeah, so I think when you have, as I said, when you have this new technologies, you need to look at the bigger picture and try to understand what are the implications that the use of these technologies have on the wider ecosystem. One of the interesting things here is the centralization risks that are involved with attestation. Another one is privacy. Well, let's start with centralization. I think it's here, yeah. So if you have looked at the RATS architecture picture that was in the talk before, you have seen that the verifier is at the very center of the image. And it's not just visual biases, it's really a central architecture upon a choke point of the architecture where all the message flows are intercepted basically. And also where the decisions are made because the verifier box has a verifier owner attached to it. And the verifier owner is the guy that has the power to decide who talks to who, right, which attest has the right to talk to a reliant party. So he's actually gating the information flow. And therefore he's a very powerful entity. And the risk here is associated with monopoly, right? So there are the situations where if you don't look at the carefully at your design and your architecture, you slip into this potential centralization risks, which we have seen in a way. I don't know whether you have followed that. Environment integrity is something that exploded last summer. And yeah, it's the cautionary tale. It's the perfect story of vertical integration where you have basically a monopolist actor that takes care of the whole thing and basically subversely. Well, it creates problems. So the fact here is that centralization can be sort of tackled, we think. The RAS architecture has a nice, it basically cuts through the, it has sort of, it curves out the roles in a way that you can actually, across a long tussle boundaries. So you can actually remodel the roles in a way that, you know, for example, you're moving the very fine function towards the user, in a user centric way. But you know, not all use cases are, not for all use cases, it is possible to do this, this rearrangement of roles because sometimes you would end up in a conflict of interest situation or something like that. So maybe one idea is to run the verifier as a neutral entity, a multi stakeholder entity. Analyze and script, that's what they did with when they democratized the X599 world by creating this multi stakeholder consortium that runs the less encrypt function, which is another example of this kind of centralization opportunities. Yeah. Privacy is another aspect. So all the flows, all the message flows go through the verifier, the verifier has to see the claims to make the ref value matching. Therefore, you know, it sees everything. So the potential of abusing this position is great because PII are maybe not in the evidence, but can be actually indirectly obtained from that. And so, so this is, this is a risk. There are things in the, in the, in the toolbox. There are basically two kinds of ways to deal with this. One is to inflate your anonymity set, either by cryptographic primitives, group signatures and stuff like that, or using methods like anonymization in the hardware, like, you know, creating a batch of devices, like FIDO does, like ARM CCA does, in certain configuration. The other thing is, well, yeah, is you reduce the, the claim set. So what you need to expose to the, okay, to the outside world by claim reduction and, and other patterns like selective disclosure and, etc. So things are there. So these were the societal aspects. This one is instead the technical aspects. So we have, we have been in a situation where mostly the, the designs were, so we're, we're, we're transitioning from a situation where the, the solutions were experimental, right? So we are, we were mostly in research mode. Now we need to move to a different approach, a more engineering oriented approach and, yeah, more structural approach. And we think we have, you know, some suggestion to make and I'll let Yonis, and I'll let, sorry for taking so long. Hey. Okay. So I want to talk to you a bit about IEDF and why we think it's a good venue to try to standardize standards relating to remote out the station. So first off, let's look a bit at some of the IEDF principles that form the core of its mission and why we think these are relevant to, to the hacker crowd here at Fosnum. So we started openness, open process, so everyone can get involved and can, can, can read standards that are being worked on. And this includes not just technical folks, but also members of the, let's say, civil society who have things to, to say about things that are being standardized or drafted. The second is technical expertise or competence, meaning that the IEDF only works on things that it has the competence to talk to. And it will, it will listen to technically competent input from whatever source there is. And this, the third, third principle is that of practical ethos. So rough consensus and running codes or trying to base all our standards on our engineering judgment and our real world experience. And more pragmatically, it means that all the standards need to come accompanied by some, some code for verification and hopefully multiple, multi-employment, implementations that are interoperable. So let's look at, at the station in the IEDF. I think the rats working group has already been mentioned and the major milestone that's been achieved about a year ago, the, the remote out station procedures architecture document from which this diagram is taken shows that the roles involved in making the remote out station usable. And the rats working group is there to essentially standardize around this diagram, around the roles, mechanisms, data formats inherent in this. But if you want to look at remote out station as an authentication mechanism, then we need to go beyond the rats and this diagram. And we need to look at cases where the attester and their lying party are trying to interact over different protocols like O-Auf, TLS, ESD, stuff like that. So let's start by looking at credential insurance. And in this case, I mean, for example, X5-9 certificates. So the enrollment over secure transport and certificate management protocols are central to public infrastructure. And it allows an entity to request from a registration or certification authority to generate a certificate. And a recent requirement from the CA browser forum has put in place a need from RA or CA for the entity to prove the security state of the key that's being certified. So that's why we're trying to integrate remote out station to make this happen. So the way remote out station works here is the verifier sends an ounce to the entity and entity that uses that to generate evidence and package it up in the CSR. And then the RA-CA can get an out station result back and decide whether it wants to trust the entity and issue the certificate. The identifiers there are for the places where you can find more information about how this all works. If we look at ACME, it's again for certificate insurance. And as you can see, the diagram looks pretty much the same. The only difference is in the fact that the evidence is carried in a different format defined by the W3C, so web-alpha format. So just to highlight the fact that we're pretty open and pragmatic about what we use. And if there's something ready, then we can just use that. The second type of credential that we care about, for example, in this case is OAuth, where a client might want to get an identifier and perhaps some credentials from an authorization server. Again, pretty much the same diagram. And then if we move on to secure channel establishment protocols like TLS, these are quite different because of their symmetricity compared to credential issuance. And we've tried to preserve that. In the diagram here, you can see one type of flow where the server is the one that testing itself, but you can have the same on both sides. So both the client and server can test themselves. They can use either attestation results or evidence as credentials, and they can use credentials, these credentials instead of PKI or alongside PKI. So obviously, we're dealing with some sensitive stuff here, and we want to make sure that our specifications are as secure as possible. And the way we do this is obviously we use our experience with these protocols, making sure that they're secure, and we use implementations to drive testing and make sure that we catch any bugs. But obviously, we can't just rely on that because we can't do proper thorough testing. So recently, we've been integrating formal verification into our work, trying to prove that the security properties that we care about are upheld by our designs. And actually, in IETF, we have a new usable formal methods proposed research group to take care of this more broadly. So I want to leave you with one message, which is please join us. Please join us in drafting these standards and implementing them and making sure that they work properly in the real world. Yeah, and we tend to lurk around in the ROTS working group and the CCC at the station. Thank you. Okay, I'll repeat the question. Is there like a representation for applying the service? I think there probably is. Yes, so the question was for ACME, for the ACME integration of remote other stations, whether there is example codes or reference implementation. I think there probably is. I think I've seen a demo from the person who was drafting this. But yeah, we can get in touch.
Introducing Incus
Hello. So, yeah, I'm Stefan Graber. I'm the project leader for Linux containers. And I'm just switching to the right screen here. There we go. And I'm one of the, one of the in-cast, in-cast maintainers. I was also the former project leader for LexD when I was working at Canonical. So, gonna go for a tiny bit of history first and then go into more, you know, what in-cast is and what can you do with it. So, the LXC project itself was created way back in August 2008 through IBM. That's the original Linux containers run time and was, has been used kind of everywhere, including the original version of Docker and some other places at that point. Linux containers itself was created, so the organization was created back in September 2014 and the LexD project got announced by Canonical in November 2014. Then LexD's been going on for a while until a lot of things happened in 2023. So, on July 4th Canonical announced that LexD was gonna be moved out of the Linux containers community project and moved into the Canonical organization itself. The next day we noticed that all non Canonical maintainers had lost all privileges on the repository. So, only Canonical employees were left maintaining it at that point. Then a few days later I left Canonical, so that happened. Then August 1st, Alex Astorai, who was the open SUSE package for LexD decided to go ahead and fork LexD as a new community project called InCus. A few days after that we made a decision to include InCus as part of the Linux containers project, effectively giving it the spot that LexD once had. InCus 0.1 was released on October 7th and we've had another four releases since then. Lastly, just as a bit of an early Christmas present, Canonical decided to go ahead and re-license LexD to AGPL as well as require everyone to sign a CLA to contribute to LexD. The consequence of that for us as a not-patry-to project is that we cannot look at anything happening in LexD anymore. We can't take any changes from LexD anymore, so InCus is effectively a hard fork at that point. So, that's the history. Now, to go back to what is this thing actually all about. So, InCus is a system container and virtual machine manager. It's image-based, so you've got a pretty large selection of distros. It's going to be a whole slide about it a bit later. But, yeah, it lets you effectively kind of cloud-like immediately create instances from any of those images. The system container part means that we run full Linux distributions. We don't run application containers, we don't run OCI right now, we don't do any other kind of stuff. The containers are really like a full Linux system that you then install packages in a normal way. Everything is built around a REST API with a pretty decent CLI tool. That REST API also has other clients who will go through that in a tiny bit. InCus got great support for resource limits, so you can pretty easily limit CPU memory, disk network, I or whatever you want. It's also got extremely good device pass-through to both containers and virtual machines, so you can do things like passing GPUs or attaching virtual TPMs or sharing your home directory or doing a whole bunch of different kind of sharing and passing through devices into containers and virtual machines. It also supports all of the expected stuff. I mean, it does snapshots, does backups, it's got a variety of networking options, a bunch of storage options, all of that stuff. It can also create projects as a way to group a bunch of instances together and effectively even open ID connect, which is cannot go to standard these days. And for authorization, we support OpenFGA, which is the open fine-grained access control project. That gets you, as the name implies, pretty fine-grained access control. There's also a number of web interfaces you can use on top of that. So here you've got one of those, which is actually the LexD web interface that runs perfectly fine on top of InCus. And yeah, that's one of the options there. As far as what you can run, well, there are a few options you can see up there. So InCus is indeed all based around images. We build images for pretty much all of the major Linux distros and even some of the not-so-major. And we build everything on both X86 and ARM. The vast majority of them are available for both container or VMs. We've got a number of them that are just for containers. And then because we do normal VMs, you can also run Windows, 3BSD, whatever else you want inside of the virtual machine. All right. So let's do a first quick demo of the standalone InCus experience. So if I switch over there, first thing we'll do is just launch an ARM Linux container. There we go. So we've got that. Then let's do another one for, let's do Alpine, the Edge release. So just do that. And this is obviously at risk of blowing up at any point because I'm on the first Wi-Fi. I think Ubuntu was planning on doing a VM. So let's do a VM instead of a container. So just tell it you want a VM instead. That's pretty much all that there is to it. And with that running, so we can see that the two containers already started, got their IPs and everything. The VM is still booting up, so it hasn't got its IP yet. It does now. If you want to get into any of them, you can just exact any commands. You can get a shell into Alpine. You can get a full bash inside of Arch. And you can do the exact same thing with the virtual machine. So like you don't need to get a console and log in and everything. Like there's an agent automatically in our virtual machines. You get to just immediately access them as if they're containers. So that works really well. You can create snapshots. So if you wanted to snapshot, the opposite snapshot creates the Arch one. If you don't give it a name, it just picks one for you. So we can see there's now a snapshot that we can restore or just keep around. There's also the ability to do automatic snapshots with a chron-type pattern with automatic snapshot expiry. You can do all that kind of stuff. Now let's create a custom storage volume. So we'll just do storage, volume, create, default. Let's call it demo. And then we're going to be adding that as a device to, let's say, Arch. So just call it demo. It's a disk. It comes from the default storage pool. And the volume is called demo. Configure this. There. And I forgot to do add. There. twice add. Now if we go inside of that VM, again, we see there's a new entry there. And then empty home. Hey, that's my home die tray. So that's very nice and easy. It's kind of doing automatically, VIRTA, UFS, 9p, all that kind of stuff. It talks to the agents to trigger the mounts. And it just, like our goal is for virtual machines to feel like containers in, like as much as we can. And having that agent in there really makes that super easy. And for the last party trick of this demo, let's do launch images. Open suzer, tumbleweed, desktop KDE as a desktop image. And also tell it that I want to see the VGA console as soon as it starts. So when I do that, it actually gets me a second window, which I need to drag over here. And let's try full screen that thing. Maybe. Yeah, full screen doesn't work. Okay. But we can see it boot. And it's going to get us eventually into a KDE session. Not sure where the resize didn't work. Oh, okay. Maybe the desktop where? I saw a mouse pointer that was about the right size. Nope. Okay. So it is starting KDE there. So we even have some desktop images. We've got an arch desktop image with GNOME. We've got Ubuntu with GNOME. And we've got open suzer with KDE. We're not building too many more of them mostly because they're actually very expensive to build as far as like resource, like the build time and distributing pretty large images. But it's to show that this works. And if you want to run your own, you can totally do that. All right. So let's just go back to slides. Come on. There we go. So other things you can do as this thing is effectively your own local tiny cloud and it's all built on rest API. It's what it also makes it very easy to integrate with other things. And other things here mean some of the pretty usual tools you might be dealing with. So Terraform, OpenTofu, you can integrate with that very easily. We've got a provider to maintain ourselves that you get to use. Encebal has got a connection plugin that you can use to deploy any of your playbooks directly against virtual machines or containers. And if you want to build your own images as derivatives of ours, you can use Packer as a very easy way to take our images and inject whatever stuff you want in there. There are a bunch of other tools. I mean, LexD especially had a lot of third-party tools that could integrate with it. A bunch of those are now migrating over to InCurse or supporting both. So that's very, it's a list that's very rapidly growing. Other things you can do, well, InCurse exposes an open metrics endpoint to get the details like the resource consumption and usage and all that kind of stuff of all of the instances running on it. So you can integrate that with Prometheus to script that data and keep it on the side. It also supports streaming, logging and audit events to Gryffina low key. So you get to effectively have your events and your metric in the same spot at which point you can use the dashboard that we've got in the Gryffina store to get something like that and run on Intel. So that's pretty convenient as well. If you don't like typing the name of your remote every time, you can switch to a remote. So you just do a remote switch at which point if I do a list, it goes straight to that remote and you don't need to type it every single time. That cluster is actually using a mix of local storage and remote storage. So it's got CF for HDD SSDs and it's got a local ZFS storage pool as well. And on the network side, it uses oven. So it actually has all of the stuff in place. And actually if we look at the remote list from earlier, we can see that it uses OIDC for login. So it's also using authentication bits I mentioned. Now if you wanted to launch, say, the BN12 instance on that thing, you can do it the perfectly normal way. And that's just going to instruct the cluster to go and do it. So in this case, thankfully it's running back home with very fast internet. So I don't need to wait for the first Wi-Fi to download stuff for me. But it's actually downloading the image and parking it creating the volume on-safe in this case and then starting the instance. I didn't even tell it whatever I wanted it on. So it just picked wherever it made sense, which is actually funny because if you use an image and you don't specify what architecture you want, you're going to get one of the architectures. So in this case, I didn't tell it I wanted ARM or Intel. There was more capacity on ARM, so I got an ARM instance. We can go and check that easily. But I know that the server it picked in that list is an ARM server. So if I go in here and look at the architecture, it's AR64. All right. Let's just look at things here. And I wanted to just show the dashboard as well. I'm just going to drag that particular window over. Where is it? It is here. I had it open. I've got way too many windows open on my laptop. Okay. So it's Grafana. It's loading. It's loading. And in this dashboard. Okay. I'm just making sure it looks at the right cluster before I show it to you. So there we go. Yeah. So this is actually the dashboard for the very first I was talking to. So it's SHF, the one I was showing. It's looking at the demo project. So we can see the top offenders as far as resource usage and that kind of stuff. We can look at graphs for network, for storage. And we can even kind of go down on the specific instances and see what they've been doing. So you could expand an instance and go look at its usage. It also gets all of the events from Loki. So we can see the instance creation and any comments like that. That shell I got is actually right here. And any error and stuff is also all captured right there. So that's the metric side of things. All right. So where do you get to run this thing? Well, quite a few distros have packages now for Incus as well as I meant, as I've mentioned, Devenin Lubuntu without packages in their next table release. We're also looking at doing a long term support release for Incus itself. Right now you might see version numbers like 0.4, 0.5 and be a bit scared about it. You need to remember that this is a derivative of Lexday. So one of our zero point release is just as stable if not more stable and like a five point something on the Lexday side. We've just not done anything past zero because we're waiting for the LTS of our other projects within the next containers, which we will do in March. And that's going to be the LTS of LXC, like CFS and Incus all at the same time. And we usually try to line up versions. So Incus is going to jump straight from 0.6 probably straight to 6.0. That's what's going to happen with the LTS. As far as other features we're looking at adding, with the release of Linux 6.7, we now have Bicash FS in the Linux kernel. And it's pretty interesting for us on the Incus side because it's very close to what ZFS or the RFS does, which we already support. So we're looking at adding a Bicash FS storage driver for people who want to start using that. On the cluster side, I mentioned that we support Cef right now, which is a pretty good option, but also a bit heavyweight. A bunch of people could instead do something different, whether it's like using a shared NVMe of a fabric drive or using some old fiber channel sand they might have gotten on eBay or something like that. So we're looking at adding distributed LVM as a storage driver, which effectively means if you have multiple systems that can all see the exact same block device somehow, then you can use LVM on top of that with a distributed locking manager on top so that all of the different machines in the cluster get to use that. So that kind of solves the issue of like how do I use my old sand at work or something else, you can use that. But it can also work in some other cases. I think someone is looking at using that with DRBD, for example, as an option. We are looking at adding OCI application container support. So that's potentially a bit of a surprise for some folks. But we feel that these days, like the application container space has now stabilized enough and we've got enough of our users who literally just run, like for some reason are running Docker inside of InCast to run a few specific applications that this particular use case we could support natively. So we're not looking at like competing with Kubernetes with all of the service mesh, this auto distribution thing. Like that's crazy stuff. They get to do that. But we would like it to be possible for you to run like two or three small containers for like your IoT software or whatever. That's kind of what we're looking at doing there. And on the networking side, we're using OVEN for distributed networking, which works pretty well. But we're also working now on another feature of OVEN which is called Interconnects, which then allows for having multiple clusters and then interconnect to the network. So you can have instances on multiple networks, on multiple clusters, and then connect those together and can direct them. And you've got 30 minutes with InCast pre-installed in there to just take it for a ride, play with it for a bit, see if that's something that's interesting to you and if it is, then you can go and install it for yourself. And that's it. We can try some questions. We've seen it's a bit difficult. So please, everyone remain quiet if there's any questions. So we can try and hear them. Is there anything? Oh, you have it there. Okay. So I'm quite sure some people are interested with the differences from the end there and this too. So compared to what? Sorry, I didn't catch that part. Oh, VMware. Okay. Well, so it's a lot cheaper. Yeah, for anyone who's using VMware professionally and has followed the news recently, let's say your VMware build is not great right now. So this is a viable alternative in many cases. It doesn't do, it doesn't have all 50,000 components all around it and all that kind of stuff. But if you are primarily using it as a way to get a cluster, create a bunch of VMs, maybe create some containers, run whatever OS you want on there, that will do it just fine. So it's definitely an option there. I mean, it's kind of in the same vein at that point as compared to like, you know, a Proxmox or some of those other options, it will work just fine. With the FireTact, we do have like, it's not a distribution you can install it on any system you want. It's obviously all open source and yeah, it is a pretty viable alternative and we do have a lot of people who are using VMware that are very closely looking at this as a potential way out of VMware right now. So the question here, better understanding terminology, would the LNs find a backdoor between a system container and a location container? Yeah, so the difference between application containers and system containers is a system container will run like a full Linux distro. It will run system day, it's going to have Udev running, it's going to, you'll be able to access it into it, install packages, reboot it. It's really designed to be like a state full long running type thing. Whereas your application container is usually, I mean, ideally single process or a process of some of its children, it's really more designed around delivering a specific application and most often it's going to be quite stateless with the idea that you can just nuke the thing and replace it at any point. It's kind of two different concepts. Like some people like the idea of having a system that they actually get to select what packages installed exact config and stuff and some people prefer not to care about any of that and just have something pre-installed and that's what an application container gets you. That's why having the ability to run some application containers directly on InCus alongside the system containers I think will be quite interesting because if you just, like if for a specific application it's easier to just get their pre-made thing then you'll be able to do that while still being able to run everything else. Yep, so we do have a bash completion profile. I absolutely hate shell completion for some reason, so I don't have it on my machine so I can't show you. System containers provide the ones that are interested in the rights? Yeah, I mean it is possible to get application container run times to get you a full system container. I mean nothing prevents you from deciding that the application you run in the container has been in it. That's definitely possible. It's just not what they were really meant for so there's a bunch of kind of, it just feels kind of less polished because that's not, that wasn't their goal. Like things like being able to dynamically pass new files in and dynamically attach devices, get whatever number of shells you want, be able to interact with the outside words through like a Unix socket inside of there. Those kind of things don't make too much sense for application containers just at the beginning and so some of those features will probably be lacking on that side. I tend, I mean, I was going to say like I usually like, you know, having one tool for the job and like picking the right tool for the job and effectively if you really care about running a bunch of application containers use one of the application container run times whether it's using Podman, Docker or some of those others. One thing that's actually interesting is that you can totally run Docker or Podman inside of an InCast container. So that works. You can run your normal Ubuntu, Debian or whatever in Existio inside of an InCast container and then install Docker or Podman in there and run some containers alongside whatever else you might be doing in that container. So that's something that works fine. I think we're probably out of time at this point. So thanks a lot everyone. I'm probably going to just be outside for a tiny bit if anyone has more questions and things. But yeah, thanks a lot.
Using chroots in a single Linux Container as an alternative to docker-compose
All right. So next up we're going to have Aiden who is going to be talking to us about multi-image and container. All right. Ready? Okay. All right. Hi, everyone. I'm Aiden McClelland. I work for a company called Start 9. So this project here is a little bit of a work in progress, but it is something we are trying out because we have a little bit of a less common use case for our containers, and we decided to try something a little different. So first some background. We develop an operating system called Start OS. The purpose of this operating system is to allow end users without technical expertise to run their own home servers. So the idea being like trying to bring the desktop experience to home server administration, and that way we can bring a lot of these self-hosted applications to a wider variety of people on their own hardware without them having to learn everything you need to learn about Docker and the hosting tools that we're all familiar with. So as part of this, we do have a little bit of a different use case than is generally intended for things like Kubernetes or Ansible or a lot of these tools that are designed for deploying corporate infrastructure at scale. We're really looking at like a single host machine that the user wants very low touch with. They don't want to spend a lot of time configuring their applications at a granular level. So we decided, you know, like a lot of these applications, they come with these Docker-composed setups, right? You have a main image that has your application code and then you have things like databases and reverse proxies, etc. And commonly we deploy this as a Docker-compose file, and what this does is it creates a bunch of containers that now have to be managed by the OS and by proxy by the user, right? So what we've always tried to do with Start OS is we've maintained this idea of one container, one service. And what this allows us to do is it reduces a lot of the complexity of the management of a bunch of different containers and also provides a single IP address and virtual interface on which the application is running. So when you're doing all of your network mapping, all of that can be mapped to a single virtual IP address that can then be viewed either from within the subnet within the device or is then exported through the host. This also means that you can define resource limits on a single container basis as opposed to having to do a group of containers and managing that as a group, a C group with subgroups, right? Another final reason that we did this is that our package maintainer scripts, we prefer to run inside the contained environment and these package maintainer scripts are run in JavaScript. So we run a service manager in the container that reads the package maintainer scripts and then is able to set up all of our subcontainers, our sub file systems from there, and execute our actual binaries. Okay, so the question is why do people want multiple containers at all, right? Like oftentimes you can take a single Docker image, a single application image and install all of the software you might need, but in practice this is not as easy for the service developer, right? A lot of times we have people coming to us asking for, hey, I want to be able to use an off-the-shelf Postgres image, I want to use an off-the-shelf Nginx image, I don't want to have to use like the package manager for the distribution of my container, to install that and manage it. So that's like the number one use case that we have for that. It also allows you to run applications, like say you have one in Debian, one in Alpine, run all of them together. Then, you know, the other reason that you might want multiple containers is you can isolate the subcomponents of an application away from each other and also do resource limits on individual application subcomponents. If anybody has additional reasons why you might want to do separate containers as opposed to a single container for an application, I would love to hear them, but these are the reasons we came up with. So our solution, we cover this first use case using trutes. Number two, as far as we can tell, works for the most part, but that is remaining to be teased out. This does not allow us to isolate the subcomponents of our application from each other or create resource limits on individual applications. Subcomponents as easily, those will have to be managed by manual tuning of resource limits within the prokates of the container. So, yeah, we've ultimately decided that those last two components aren't really necessary for our use case. Ultimately, a single application is where we define our sandbox. So sandboxing separate parts of an application from each other, like has some security benefit, we've decided isn't worth the complexity. So we decided to do this with LXC. Why do we do LXC as opposed to something like Docker or Podman? LXC is a lot more composable. It allows us to pop the hood on a lot of the very subcomponents of container technology and manage it more manually. So we can, for example, easily manipulate the container root FS at runtime. So even with an unprivileged container, that unprivileged container can communicate with the host and modify its root file system very easily. We use our shared mount propagation for our root FS, which allows the host operating system to easily manipulate that file system. And then it's also unlike some other container tools, you can perform commands like shrewt and mount from inside an unprivileged container, which is not allowed on a lot of other technologies. So to put together a service, an application, we have effectively a single root FS image that all of our applications share. This root FS image is just a base image that we use for all of our containers that has a, like, we use Alpine right now, but it loads a Node.js application that runs the package maintainer scripts and then launches the various actual demons inside their trues. It communicates with the host using a JSON RPC API over a Unix domain socket. So there's bi-directional communication between the host and the service manager in the container, and then, yeah, it can perform the actual application code inside the shrewts. So the host API, what it does for the container is it can perform some manipulation of the root file system of the container, and this allows creating overlaid images in the same way you might be creating a container. All we do is we create a root FS image with an overlay file system and attach it to the container in a way that they can trude into it. And then we also have a bunch of other APIs that these packages can interact with, mostly for integration with the end user experience, and integration with other services and applications on the host in a way that the user might have to intermediate. And then we also have a set of APIs designed for hassle-free networking. If you have, you know, some application bound to a port, you can now attach that port to a Tor address, to a clearnet address, or to just a LAN address so that you can be accessed by your local area network. And the host OS manages all of the certificate management, either through Let's Encrypt, or through a host root CA for the LAN communication, because obviously you can't get a Let's Encrypt certificate for a .local. Okay, so then the service itself, it runs a very basic API that receives commands from the hosts. So when the application is running, it can receive like an initialization command, it can start or stop the service, and then shut down the service entirely in order to kill the container. And then it also invokes all of the various package maintainer scripts, such as editing user configuration, installing the service, or updating the service. All of those perform various package maintainer scripts that get called from the host. Okay, so when we actually launch a binary, the package developer defines in some JavaScript, we have some well-typed TypeScript APIs for this to describe this structure, but it defines what binaries to launch, what images to launch each binary in, where to mount its persistence volume. So we have a series of persistence volumes that are mounted to the container, and can be attached to any path within these sub-file systems, and then it defines any environment variables or arguments in any standard way that you would launch a program. And then for each command that you have, when you just similar to how you would define a system deservice file, you can define all of these arguments and then any dependencies or health checks associated with your service. And then for each of these commands, the in-container service manager will mount an overlaid image for the requested image ID to the container. It will then take our special directories, proxys, dev, and run, and bind them inside the container. So all of the containers share the same proxys, dev, and run. And then it will run the command in the true. Okay, so here is an example I have of a package maintainer script. I don't know if that's actually visible to everyone. Is that, are you guys able to see that? Okay. Well, I suppose I can just talk about it. But effectively, you have a fairly simple JSON configuration where you define your image ID, your command, your arguments, and then some health checks defining when is this thing ready, as well as some dependencies. So like if you don't want to launch a various demon until another service is ready, you can just specify that and then it won't launch until its health check passes. So all of this is available on the GitHub if you want to check it out. This particular example is in GitHub's start9labs slash hello world startOS. There should be a link on the talk. So time to do a little demo of what I have working so far. Let's see if I can get my shells over here. All right. So here I have an instance running, hold on. There we go. Here I have an instance running startOS. I've already installed a package. This package in this case is NextCloud. This NextCloud package contains two images. It's got the NextCloud base image, which also contains the Nginx server because it's running the PHP for NextCloud. And then we have Postgres, which is our database persistence layer for NextCloud. So what we're going to do, so we've attached into this container, and then I'm going to go ahead and just inject, basically run a REPL inside the JavaScript engine here. And I'm going to go ahead and do my imports here as well. And what this has done is it has connected us to our JSON RPC APIs, both the hosting of the container and the container into the host. And then we're going to create a couple of overlay images. So first we're going to do our Postgres image. And so what this is going to do is it's going to tell the host, hey, I want to mount this Postgres image to the container. It says, okay, here you go. Here's the path at which I have attached it. I'm going to do the same thing for the main image. And there we are. I'm going to go ahead and define a couple environment variables. Okay. So I have a set of temporary hacks that I've put in that will later be managed by the actual container service manager. But it's mainly around permissions of the container. I still need to get Shift FS working properly. Because LXC, what it does is it maps the UIDs within the unprivileged container to UIDs on the host. And so when we mount stuff to the container, we also need to perform that same mapping. So we're not doing that yet, but I have a set of ownership changes that will manage that. And then all we have to do is go ahead and launch our application. So I'll go ahead and launch Postgres first. And here we go. We have Postgres running inside a tru, inside the container. And it looks like it's ready. And then now I can also launch. Next slide. So here we have, now both of these applications are running within the same process namespace, the same C group, the same container. But they're running from completely separate images. And that's all I have to show you guys. I think we can open up for Q&A. Thank you. So we have considered the idea. Right now we actually haven't found it necessary yet. Like the tru seems to be sufficient for the sandboxing we need to do. As far as we can tell, the technology is at a point where it wouldn't be too difficult to do containers and containers, but realistically we haven't found it necessary. That's all. So I think you're asking as a package developer how we distribute your application. So if you have a service that you want to distribute to our users, to people who are running on StartOS, we have our own, like the company Start9 runs a marketplace. But we just have a very standardized package format. In this package format, you could host on any website. If you want to charge for it, you can charge for it. But ultimately the APIs are generic enough that you can run your own marketplace to offer whatever services you want using whatever protocols you'd like to to gate access to those S9PKs. So as a service developer, in general, if you're publishing to our official registry, that means that you have a free and open source project that you're looking to distribute for free. But that does not stop you from running your own paid marketplace. One more question. I'm sorry, I couldn't hear that. Other resources for our application? Yeah, so the resources are managed on the scale of the entire application using the configuration of the outer LXC container that everything runs inside of. So you can just modify that LXC config. Well, we modify that LXC config automatically based off of the host APIs. Thank you.
Soft Reboot: keep your containers running while your image-based Linux host gets updated
Welcome everyone to our next session. Thank you very much. Hello. Good afternoon. My name is Luca. By day I work as a software engineer in the Linux systems group on Microsoft where I am responsible for the operating system that runs on the Azure infrastructure. By night I am involved in various open source projects. I'm a maintainer in system D, a Debian developer, DPDK, yes maintainer, a bunch of other stuff that I consistently forget about. So I'm going to talk to you about this new feature we had in system D middle of last year called software boot. And yes, it's a new type of reboot and we're going to look at how it's implemented first and in the second part of the talk we're going to look at two demos showing that running and how it can work with containers. So if you were at all systems go, you probably saw the first half of the talk while the second half is new. So first of all, why? Why do we want a new type of reboot? Don't we have enough already? And the answer is of course is performance. So rebooting means if you have some services that are running on your system and they're providing some functionality during that window of time they are interrupted and people don't like interruptions. So that is the main motivation for this. I also know that there are some updates system that require double reboots. I've been told for example that DNF line upgrades require double reboots. So by shorting the time it takes to do this we can save something there as well. But the main use case is the first one for avoiding interruptions. So when you go from a reboot to a KX, you're saving time because you're cutting away the time it takes to reset the firmware and the hardware. So the next obvious step was to cut away the kernels at time. If the kernel is not being updated you don't need to reboot it and do all the device initialization and everything else. So we came up with the idea of soft reboot and this is what it does. It just reboots the user space portion of your Linux system. Again the goal is to minimize disruption as much as possible. So this pairs very well with image based Linux. We've been talking about image based Linux systems for a couple of years now. This works very well with it because in the system you have a single root FS which is usually read only. And then you have a UKI where your kernel is in VR and these are distinct components. They are usually updated independently. And so with a soft reboot when you don't update your kernel you can update just your root FS. Now this also pairs very nicely with kernel live patching. So on production system you can fix bugs in your kernel without rebooting by using kernel live patching. And this pairs nicely with that because you can use the system to update the user space portion of your image when you have bugs or security problems or whatever. Again we are replacing the entire user space atomically and moving into a new root file system. Now it's not only for image based systems though. This can be used for package based OSs because for example you cannot restart the D-Bus demon or broker on a Linux system. Your system will explode if you do that. So by doing a soft reboot you can save some time when your D-Bus has some security problems that needs to be fixed or what not. So let's look at how it is implemented. So as far as the kernel is concerned nothing is happening. Everything is business as usual. It doesn't see anything. It's all the same session or the same boot. So for example we have still some problems to solve, some papercasts. For example if you do journal CTL minus boot minus one you will not see the previous software boot. You see the previous full reboot. We have ideas to fix this only to do list but it's one of the fewer papercasts left to solve. Now as far as user space is concerned everything goes away. It's a normal shutdown. So system D goes through the usual phases. It starts a shutdown target, a software boot target that conflicts with everything else so all the services get stopped. And then instead of giving control back to the kernel with a Cisco to reboot it just reexact itself into the new root file system by passing the full reboot. So you can do this in place. So your software boot is in the same root file system or you prepare ahead of time the new file system. And the run next route. And we allow this because usually prepare the new root file system and position all the mounts across and whatnot take some time. So you can do this ahead of time without having to interrupt all the services by doing it in line. So you can prepare your next root of s in run next route and then code the software boot so that you transition very quickly to the next root of s. And again you can prepare your any additional storage you have if you have any encrypted partition for var for example. You can prepare it ahead of time so you don't need to redo the decryption steps which again takes some time require maybe use an interruption maybe accident tpm or whatnot. And again the kernel stays the same so no configuration changes. So in system D 254 we added a new verb system system CTL software boot to do this equivalent the bus API and the next version. We also had some new signal that tell you yet this is shut down happening and it's off type software boot. So we are cutting time away from their boot is that all we can do with this. Not quite we can go further. So given system D set doesn't exit it's reexec itself. You can carry over any state we want to the software boot. So for example the file the script of store is not aware what it is a way to store for the script or inside PID one and then it gives them back to you to your service when it starts. And by the way all these links are on the slides are used to documentation I will put the slides online. But basically your service can say hey I have an active TCP connection take the sd for me and keep it there. And then your service goes down the software would happens you come back and you get back the TCP connection you can pick up from where you left. Because the kernel just stays running the connection is not interrupted it just buffered and there's some delay of course but it doesn't have to be established for example. It's not just sockets you can use this for MFD for example for any buffer any state that is expensive to calculate you can store it in a MFD and get it back immediately. And you can do this for the network stock for example in network D we have these options so that when it goes down it leaves interfaces configured. And when you go back in the software boot in a new file system you don't have to reconfigure your network interfaces which again can be a bit slow. And then finally we transition across Zashran as a state pseudophile system or tempfs so that if services have state in Zashran they find it again when they come back. This is not recursive but and also Zashtemp is reset completely because that's a scratch area. So by doing this we can accelerate the time that the services need to go back to fully functional after a software boot. But is that all we can do again and what the hell does any of these have to do with containers is it a container dev room. So here's an idea now some payloads are completely independent of your router fest for example containers but also portable services. Now if you don't know what a portable service is I suggest to check it out they're awesome they're a way to attach a system service to your OS that runs from a different root file system. But it comes with its own image but it's fully integrated with your system services it's quite cool. But it applies to these but not only this so these these are these services these containers these payloads are independent of the root file system. So can we let them run during this software boot process the answer is well yes why not. And the configuration to that is a bit complex it's linked there I want to show it here we show it in a demo later. But basically you can configure a system service so that system you will not kill it or stop it when the software boot happens. So is the service keeps running while the router fest is updated under it. Net or is it accessible we keep it up the current doesn't go away doesn't the conflict devices same thing for the disks. So for this kind of payloads we go from some interruption to zero interruptions we quite nice. Of course there's a catch there's always a catch these payloads they really need to have nothing to do with the root file system because for example if you keep anything. And if I open for the old root file system and you will keep the resource pin and they will be free that you use more memory or whatever else. So you need to make sure they are disconnected and also other parts of the US are going away for example the bus. So in the documentation there it shows up but you need to change the way you use the bus via the SD bus library for example to automatically reconnect when it comes up. It's usually not done because the bus never goes away normally but if you have one of these payloads so virus of the boot you need to change our use the but it's very simple and it's a. Describing the documentation there. Now one thing I will look at in the near future is also if we can if we have actual bind parts from the host. The first into the services if you can automatically refresh them after software boot I'm halfway through that is all done yet. So let's see this happening with Podman now because I am a coward I did not I'm not doing a live demo I'm showing a recording. Now this is a dead end image dead end testing and it's running podman some version and so podman has this thing called a quadlet where it generates. Some system services for your container and now this is not exactly what podman generates though it's a bit different as most stuff here and we see what that is in a moment. Or you can see down here it runs a very important production use case of sleep infinity that's typical production use case everybody uses. But to show what the actual difference is because this is a demo to put it together I am not a podman developer or user. I thought it was cool to make it work and I have it a bit together so podman gives you some some systems service I change it and show you the deep here so. These settings up here are necessary to make the containers service survive the software boot. This is a bit of a hack and if this is supported by podman natively it would have to be solved in a better way but basically this ties the container to the root file system to the var directory. So I have to comment that out so that they are not tied and it doesn't get shut down and then there's four more things down here that are suspicious and we'll see what they are in a moment. Now which is simple to explain if I start this container this. Sleep service and it takes a second because it downloads the image in the background and I resolve the complaints that we don't care. Now. The way podman works when you run it as part of a system service is works correctly creates some subc groups so there is the payload. C group node and then there is an additional sidecar control service that runs as part of the same C group and is also a group is dedicated to podman. Now the reason for this for settings here is because this common binary comes from the root file system. So we need to make sure if we just do this it will keep the root file system pin that we don't want that. So my my hack to make the demo work is actually we're running on a different the service runs on a different route image. So it's another big image with podman inside. So this binary and the podman binary that runs they come from this image not from the system that way they are independent and they are not tied together. And then we disconnect a couple of things. So. So now we we have that prepared and there's other things so you saw the two C groups there. Now the way system makes marks a C group for survival of software boot is by setting these extended attribute here. Now because podman gets a delegation from this C group which is the right thing to do but we don't touch the children. We do not set these extended attribute automatically for these two payoffs and if podman wanted to support this natively it would have to do that when he sets up the C groups. Now of course again this is how to gather so I'm doing that by hand just setting the that's a group there. The extended attribute so that system we won't kill all the these processes when they are running and now we can finally type software boot and we see all the US space going away. And then shortly thereafter we come back and we get a shell and then we check with us some errors in the C H and we don't care about so just ignore them. I was too lazy to hide them and then we can see that the sleep is still running and the control monitor as well and it's the same PID is the same processes. The containers kept running while we shut down all this stuff. All the system services have been shut down and restarted but the container is just going without interruption. So yeah again this is very quickly out together. I am not a podman developer is pondering there interested into supporting this or maybe LXD developers. I'm happy to help them but this is a have to get a demo I have another one which I think is a bit more interesting. So as your boost if you're not familiar is the an offload card that is installed in every Azure node so your Azure nodes that run your virtual machines have these arms 64 offloading card that runs the operating system that I work on. It's called Azure boost and I'm showing here a demo of this recorded in production on an Azure boost that he pulls for a second now we recorded this my colleague Maya to my oh my thanks go for recording this go record this amount ago so far executives and then I asked hey can I show this in public at a conference. This is never shown before I didn't only in turn on Microsoft stuff super secret and surprisingly they went yes you're going to like what now I have to do it so I had to unfortunately blank out the host names because this is a real node somewhere in the fleet in that it's entering the US and I couldn't show the host name which identifies the node so you will see this blank things I apologize for that but I had to hide those but we are showing here let's start going again. So in Azure we are running this Microsoft operating system it is just a machine it's arm some version of the kernel 5.10 we have what we call agents these are containers running as portable services. Some of these are critical for the actual customer VMs if they go away the network is interrupted you cannot make new connections. The agent is the critical one that goes away network goes away up local is a local one that does some local service so it doesn't matter so we configure the first one that portable service to survive the software boot. And the second one will will we just go away and disappear now we attach the update agent that does the software boot you can see the portable service is just a touch as a new image so we are moving to a new image here in the background there you can see Sierra console going away. Now we switch to a new SSH because of course SSH is not a critical service like it went away issue come up in a second. And we reconnect and we will check and compare the US versions before and after kind of version before and after and check on the status of these containers and see that actually run again so yes the version and the zero three so it was zero one before so we did update the root of S. It always read only the and very to the fast so we updated as a one block the corner is the same we didn't I didn't cheat and do not show a boot there the current is exactly the same same big than everything so let's check on how these containers are doing. And we can see this is the critical one the the net agent and we compare the P. I. D. is before and after they are the same so the same process is one nine seven and two zero nine nine they're the same the same process is the same pale. It keeps running to the software with while we change the the and very to the fast image behind the other one as we started because it's it's just a non critical service so we let that that be a starter so yes this is it for the demo and hope this was interesting this Nick pick at the Azure production machines and running in. Down in the fleet and we have five meals for questions questions. Any questions. I cannot. So checkpoint restore we don't and that's a very different thing right so checkpoint restore gives you an interruption service. This doesn't so you check point and then you come back to the same state of the process but you still have an interruption while you do your update this is different this is. Aim to let us update the root file system with zero interruption for this payloads so it's a bit different and we don't have plans for that at the moment now these are a bit complex payloads so we have a look into CRU at all. Think there was. Any other questions. So I end. No questions everything clear. I don't believe that there you go there we go. I know that guy I'm gonna second. I'm gonna second. So. Excellent question now the demo was recorded in production with a custom image loaded. Thank you. The demo we show was on a production node with a custom image with a new feature we are deploying this sometimes this year so it's not yet deployed a scale we will see I'm sure it will explode in horrible ways. But for now the main thing we found was debas reconnecting to debas was the main thing that broke the services but it's easy to fix that was the main thing for now. Other questions. Going. I can't hear shout. Shout with the microphone shout. Yes so the pen is on the local system I showed it before. You need to prepare them ahead of time. From here. It can work it can work. Thank you.
What's new in Containerd 2.0!
Alright, let's get started. Is that, I am unmuted, yes. So yeah, this will be fairly quick, just an update on container D. You're either here because you're interested in container D or because it's too hard to change dev rooms and so you're just going to sit here and hear about container D. Hopefully you're somewhat interested. I was having a bit of phosom nostalgia like 2018, talking about just like the first year and a half of container D getting to 1.0. So now we're on the cusp of our 2-0 release, our first time having kind of a major version since we started the project. First just a few stats in case you're unaware. Container D adoption has been growing a lot. Some of that's probably due to the Docker shim deprecation in Kubernetes. This is from DataDogs, ANO report. The CNCF and Sysdig also put out reports. They all come out with different numbers so believe whichever one. This one was positive for container D so I used it. You can probably find another one. Maybe more importantly to the project are actual community growth so people actually contributing, getting involved in the project, becoming maintainers. This is a crazy eye chart from the CNCF. You can see Kubernetes way up there at the top. Again, there's some magic math being done here about how many PRs and issues are flowing through your project. How many people are contributing and it comes out to container D being in the top 15 or so projects. One of the cool things is that we've had a lot of, I think this captures like the last nine months, but new maintainers, reviewers, committers from many different companies, some independents. So that's awesome to see as well. The cloud providers you might be using use container D underneath their Kubernetes service and some other projects as well. The thing I wanted to focus on is one of the reasons I think container D continues to grow as a project is that we've built in extensibility in different directions. I'll talk about three main directions that container D is extensible or how you can build around it. One is on the client end and so one of the newest representatives of that is Nerd CTL written by one of our maintainers, Akahiro Sudha who you've probably heard of because he's written 100 different projects in the container space and anytime you use rootless containers it's probably because Akahiro started that work many years ago. The hero nerd CTL which gives you now kind of a Docker command line for container D. The other way that we're extensible is in snapshotters and those are, if you remember Docker's graph drivers, these are the way that your containers file system are actually stored and so overlay is obviously a very common one that many of the container runtimes use but we've actually made it. So we have built in ones which I'll talk about but we also, you're able to extend that with a remote snapshotter and that's an area where we see a lot of growth where people are writing their own snapshotters for their own unique use cases. Then sort of directly down from container D is this layer we call the shim layer that drives an actual OS level runtime and so obviously many of you have heard of Run C or C Run that's kind of the common Linux adapter if you will that drives the set of syscalls you need to name space your process but the container D shim API again is extensible and there's many different shims available and we'll talk through those. So these are kind of three directions. There's also some other pluggable interfaces that I don't have time to get into today but these are all ways that again as we go into 2.0 we continue to see people expanding container D in these directions. I'll spend the least amount of time on clients. We've had this sort of simple tool in the project since the beginning called CTR. It was never really meant to be a production client for container D but just an easy way to poke at the API, get a list of images, list of processes. Run CTL is much more recent and has its own set of maintainers who are marching along with new releases that are either bringing better alignment with the Docker command set so all the flags, all the features or adding features that they can reach because they're built directly on container D like some of the lazy loading snapshotters, image encryption, container image signing, all those are built in to nerd CTL. Cry CTL is from the Kubernetes community that drives the CRI API of which container D has an implementation obviously CRIO and others have implementations for that API and then of course the Docker project is also built on container D. There's some interesting developer platforms built around these clients. After desktop and CoLima allow you to drive the Docker engine or container D but we have a team at Amazon who built Finch that's just built on nerd CTL build kit and container D again that allows you to do macOS and I forgot to add Windows here because we just launched Windows this past week. But again these are ways that people are extending the capability by building new clients around container D. So the other area I mentioned was snapshotters. There's a bunch of built in ones. Many of you will recognize things like overlay and device mapper, butter FS but this plugability of having proxy plugins to a remote snapshot are so now two things you're not tied to container D's release life cycle. You don't have to get your snapshot or merged into the container D code base. You can write your own, you can run it as a separate process with a GRPC listener and container D will call you for the API of the snapshotter, prepare, diff, unpack and those operations that are required for the snapshotter. So there's three main ones that all three of these have now been donated into the container D GitHub organization. So they were started as external projects and they've now been donated. They're all related to lazy loading file system so if you've played around with being able to run a container but not having to pull the entire image, say it's a 10 gigabyte image with scientific data sets or some complicated ML model. These lazy loading snapshots will only pull the files that are needed to start the container and so Star GZ, overlay BD and NIDUS are all in that family and then there are two, there's Sochi that was built by one of our teams at Amazon that is seekable OCI so again a lazy loading snapshotter and that's open source but then GKE also has a feature called image streaming built around the same ideas of lazy loading but that's at least for my understanding that's not an open source project today. So again these are ways that people are extending container D by having their own snapshot technology and plugging that into container D. Allison mentioned shims so OCI runtimes, there's several options there. So we have run C built in, you can also use C run and we test that in our test suite for container D and there's also some experimental Rust and free BST runtimes but then again you can have your own shim outside of kind of the container D core project such as the one for Windows maintained by Microsoft, HCS shim. Run wasi is one of the more active projects in the container D, you have namespace where again this is a shim where you can drive container D to the same API and clients but actually run wasm workloads instead of running a traditional Linux container and again there's a micro VM based shims, trusted execution environment and Quasar I think is how you pronounce this shim that deals with a new feature of container D 2.0 called sandboxing which we'll talk about in a minute. So again those are just three ways that I think have benefited the sort of container D's growth of being able to plug in and enable features that don't have to be part of the main container D code base and allows people to sort of expand for their use cases that maybe we don't even know about. So this is kind of the picture of where we are currently in the container D life cycle, 1.5 is now end of life, we created 1.6 as a long term support release that again until 2.0 is released we don't have an official end date but it will at least go out another few years. 1.7 is an active release cycle right now and then 2.0 should release in a month or two based on kind of our current set of betas and release candidates that we're in and so that's where we are as far as releases. I just mentioned this isn't new news but 1.6 is our first LTS release as it says here support at least until February 2025 and of course it's always a trick to try and maintain some integrity about how you get things into the LTS and one of the reasons that's tricky is that Kubernetes may add features in the CRI we need to implement that CRI endpoint so it sort of looks like a new feature and so we're having to try and do our best to make sure that we maintain compatibility with Kubernetes without sort of opening up 1.6 to a lot of new features so that it's a stable and mostly just back ports of fixes and obviously anything security related. So yeah so we have this idea that late this year we'll even make that back port criteria a little bit stricter so that people can rely on just a long stable release without a lot of changes to its feature set. 1.7 therefore is the end of our 1.x release cycle and what you'll see here is that we basically merged a lot of new features in 1.7 before we released it that we marked them all experimental so that people could start to try and use them and then in 2.0 all those become supported features and so I already mentioned the sandbox service and the API around that again we had this extensibility at the shim layer but with micro VMs and other ideas about how you treat the sandbox and how you configure it several of our contributors came up with the sandbox service and there's a whole API around that you can read a lot more about it on our either via the PRs or the documentation that's been merged. It was a preview in 1.7 but it'll be automatically turned on in 2.0 so in 1.7 there was a split that we actually had two implementations of the CRI one based on the sandbox and one our legacy code so that'll go away in 2.0 where it will just have the default sandbox implementation. NRI is very interesting if you've ever played around with OCI hooks and the ability to you know modify the specs so say I want to insert a device before my container starts the node resource interface is the sort of our decided implementation for doing that safely and having a way to have NRI plug-ins that you can that the administrative your cluster can enable and give the proper permissions to so NRI was experimental in 1.7 again will be fully supported in 2.0 and then transfer service so if you think about commands like save or export an image pull an image push an image in all our previous releases of container D that was a client side API and so your container client was actually doing those registry interactions in 1.7 and then of course in 2.0 this is now a service within the demon and so for some some use cases that was very important that the demon handles credentials of the demon handles the network connectivity to registries and also gives us a lot more tools for plugability of sort of source and sync so say I'm trying to copy an image from one place to another the transfer service gives you all that in a configurable way we also added username space support which was a new feature coming down so container D core had username space support but the CRI kept the enabled username spaces and Kubernetes added new API to the CRI and so those are now plumb through and implemented and supported in container D and then we had a lightweight RPC mechanism for shims and we've now added full GRPC support which was important again for certain use cases that people wanted so as I said we're in the midst of like our 2.0 release plan right now we are just about to I guess I didn't move that line over far enough because it's February now and we're just about to put out our first release candidate so we're possibly a little bit delayed from our original thinking but again 2.0 will be final sometime this spring and like I said all these sort of new capabilities that were in 1.7 will be final and supported in container D 2.0 it was our first chance to finally deprecate so we've been insistent on keeping a very stable API so that you know people aren't surprised that the latest container D release removed something so you can see that over the years we've deprecated a lot of features or at least mark them deprecated 2.0 will be the chance for us to finally remove those and provide recommendations. One of our contributors added a nice feature so you can actually turn on deprecation warnings and you can actually run container D 1.7 or even 1.6 LTS and get notified of all the features you're using they're deprecated to help you prepare for 2.0. One of the things we were going to remove was support for our oldest configuration version but then someone wrote a converter that automatically converts your configuration so we won't actually have to deprecate that in the sense that you're not going to have to rewrite your config unless you'd like to it'll do automatically for you. There's still a lot of things we'd like to do that we're still working on so I mentioned this new transfer service again the CRI is a plug in implementation within container D that uses container D's APIs to do the work so when the CRI says pull an image the CRI implementation calls into container D to do that so one of the things we're trying to migrate that to use the new transfer service so that's in development to allow plugability for shims themselves and then there's two there's two kind of API layer enhancements that we're thinking about if you think about Docker, Docker kind of gives you this higher level API again HTTP based if you ever have built a tool that uses the Docker API it's at least nice in that you can say run container and give it all the configuration information and it just does it and when people have come to container D they're like hey you don't have the Docker API what can I use that's similar to that and we really don't I have to create a task I have to create a container resource that I have to start the task and so we're thinking about really creating some of these abstractions so that when people move to container D they have a higher level image service and container service so those are things that if you have ideas if you have concepts we're open to them these aren't things that we've built yet but we're planning to as we go into the container D to the T to dot oh time frame if you're interested in contributing or getting involved there's a couple channels in the CNCF slack that we hang out in that we you know talk about new features or people ask us questions we do have a live community meeting on zoom twice a month the second and fourth Thursdays if it's bad for your time zone let us know obviously that's always a tricky thing to handle with time zones and again go to the repo open issues give us your ideas pull requests and that's all I have thank you
Lift and shift: Modernising a legacy LAMP application with systemd-nspawn
So, next up is going to be Martin, who is going to be talking to us about lift and shift modernizing a legacy lamp application with system B and spawn. Hi, everybody. Welcome. So the last time I spoke at this conference a few years ago, it was in the microkernel dev room. It was a very small room. So the bigger the kernel, the bigger the room, I guess. So I'm going to start with a little bit of backstory. One evening about a year ago, I got a phone call from a friend, a principal at a school, saying, Martin, I need help with something. Our sole IT person that's worked here for 20 years has decided that they're just going to go off to the mountains and leave, and they're off in about a month. And I have no idea what state-house systems are around. I know nothing about that. I need someone I can trust who can step in and help. So I originally came in there as a consultant to look at what systems they had and figure out what the next steps were. I'm still there. It's still temporary. And I'm going to tell you a little bit about what I did over the last year there concentrating on the containers. So they weren't kidding when they said it was in a bad state. The critical application that the school ran on was running on one single server, along with a whole bunch of other stuff, pretty much everything else. And you can see here that that server basically dates back to 2009. Someone at some point tried to upgrade it from Debian Edge to Debian Leni. They failed, or they gave up, partly because from Edge to Leni, you had the transition from PHP 4 to PHP 5. I did a quick naive slot count of what's in Vah-dub-dub-dub HTML. There's 200 something thousand lines of PHP. It turns out that this person did not use source control. So there's a hell of a lot of duplication in there. And it's also very much a typical crud app, as you would design it 20 years ago. So it's all just very basic PHP with hidden HTML mix, the worst possible thing you could have. But at the same time, it's very simple as an application, which turns out helped us later. So my naive plan, how do I salvage this, try and extract as much business and technical knowledge from the author before they leave and never come back? And then virtualize all the things, secure all the obvious attack surfaces. I mean, this was still running TLS version 1. It had Apache 1.3 exposed to the internet, worst possible cases. So then split off the business critical system from all the other things that were running on that server. Do that in a way that's as future-proof and maintainable as I can. All while keeping it running and not getting killed by 550 students and 100 odd employees during the school year. The first two steps were pretty obvious. They had some new hardware lying around. I spun up a hypervisor. I had a bunch of VMs. So put the physical server into VMs, started splitting chunks off it. That turned out to be hard. So I eventually decided that I needed a way of reproducing this 15-year-old environment. Reproducing it in a way that I could then develop with, maintain with modern tools, source control and so on. So the nice thing here is I found that the Debian community have developed something called Debian EOL, which are basically Docker images of end-of-life Debian releases, all of them going way, way, way, way back. You can use these images to run both Docker containers or to do whatever else you want with them. The nice thing about them also is that they're actually integrated into the modern infrastructure so that pointing at archive.debian.org, you can, as you'll see, install additional software and so on. I could have probably done this with Docker, but it doesn't really fit the bill because this application, I mean, it's never going to be a 12-factor app with a bunch of microservices. I needed something that's more like previous Dejails or Flourish Zones. And I've previously used SystemDnSpawn. I use it, in fact, today to run a bunch of my own infrastructure, which was originally a bunch of Zen PVVMs and is now happily running for many years as SystemDnSpawn containers. So you want something that can do full system containers that's available, lightweight, and flexible. So how do we get Debian Lenny from 2009 running, using these Debian EOL images with SystemDnSpawn? We need a couple of tools, something called Scopo and OCI Image tool, to get the images off the Docker registry, flatten the OCI image, you basically end up with a root file system. You then, what I do is I use, the reason I'm emphasizing RefLink here, I didn't know about that, it's basically copy on write. So you can use this to create a lightweight copy of an entire directory tree, which only takes up more space if you actually change things in it. So, you try and run this, previally with SystemDnSpawn, and you find bam, it's safe false. Thankfully, we actually get a helpful message from the kernel saying, ooh, you tried to do VSys calls, but no, we don't do that anymore. We can fix that, that's fairly easy, and we can see that, oh look, we have Debian Lenny running in a SystemDnSpawn container. Okay, that's great, and if that was all I was going to tell you today, then that probably wouldn't be very interesting. But if all we want is Ben SH and that to get, that's fine, but I want this full system where I basically want to run full SBIT in it, inside the container to manage all the original LAMP stack services to run the application. I want to integrate the container's networking with the host system's SystemDnetworkD, and get a dev log in it, get, use username spacing, and start and stop the container as part of the normal host system boot process. So I made a script for this, I extracted this out of my build scripts so that you don't have to. There's a link to it also in the resources for this talk. Please take a look. So this script basically gives you a Debian Lenny root file system that has all the things applied to it to let you do the first, the steps that are described here. I spent quite a bit of time working that out, so I hope people will find that useful. You can then do, with that root file system, you get out of that script, you can boot the resulting root of this, like this. The important parts there are private users, private users equals pick, that turns on username spacing, so your container root gets, automatically gets a special user ID in a range mapped to it, which system dns-born will pick when that particular root file system is started. And you get a VF network talking to the host. Kill signal equals SIGINT, we want that so that when the host system, if you run this container as unit file tries to stop it, then the SIGINT gets sent to the sysvian as inside the container, and it will actually interpret that as a system shutdown and shutdown cleanly. So if you run that, you can log it on the console and you'll see that yes, we can shut down the container with control C. So there's a bunch of gotchas, networking, system d network d, you want this, since it integrates very well or bar some problems. Obviously your host needs IP forwarding enabled. As I found out today, or remembered today while making these slides at the hotel earlier today, if you're doing anything at all in your forward chain, since I was trying this top, then you need to make sure that forwarding is actually being accepted from and to container interfaces. Another really interesting one. So I'm still a DHCP client inside the container so that the container integrates with system d network d and gets a network address assigned to it when it spins up. Turns out that old DHCP clients are actually picky about getting proper checksums back in their responses. So if you don't add that particular mangle rule, then what will happen is your networking will appear to work and then mysteriously stop when the DHCP lease expires and the client tries to renew it and gets upset and you just see it renewing and renewing and nothing happens. So, system d journal d has a nice name spacing mechanism. It basically lets you spin up separate instances of system d journal d which have their own name space so you don't really want the container logs or different logs of the different instances mixing with the host logs. It works, but I had to actually read the source code of the system d main loop to figure out why it would just, after you start it, just mysteriously say, oh, no clients, I'm going away now. So the way to fix that, not described anywhere, is you add a drop and set your retention time to something high and then it will just wait around until something connects to devlog. Devlog you can then bind mount into the container. That's fairly straightforward. Starting up, start up and shut down integration. System d n spawn comes with a default unit file and you can then customize that. There are some useful things you can do there like you can add a dependency on your journal d namespace service so that everything nicely starts up and shuts down and there's an example of what you can start with exact start that if you want to use this particular arrangement. So I actually did this, or the bulk of it during the school holidays last summer. Application has been running fine since then. I was quite surprised. I could talk a lot more about PHP and MySQL 5 but that's mostly just be ranting. One thing that I didn't mention is the application is actually running all in CP1250 and not only that but originally the databases were all running still with MyISAM. So I ended up basically exporting the lot into SQL text files. Then I discovered that MySQL and PHP at this time didn't really understand character sets so the database thought that everything was Latin 1 when it in fact wasn't. Well, the way to fix that is again you export it to a text file making sure that the database or nothing tries to convert any of the data. Then you do a set on the text file and say just recreate, replace MyISAM everywhere with the InnoDB, replace Latin 1 with CP1250 and it actually worked. Still there. No data got corrupted. And it's 64 bit now so it won't fall over in 2038. So yeah and I'll end this with a quote for the conversation I had in the autumn with my long time friend Martin Sustrick who was asking, so you spent the last few years before that working on OS research with Unicernals and Docker and the University of Cambridge and so on. So what was more complicated? All this OS research that you were doing or the work you've been doing at the school over the last six months? And I said well definitely the work at the school over the last six months. And I still have 10 minutes. So in fact I guess questions. It was quicker than I thought. Yes sir. This man here? Sorry? The hyphen N option? Oh, ah yes. Okay so the reason you can't do that, in fact this is important and I sort of glossed over it here. That will only work. The journal D integration will only work if the distribution that's running inside the container is new enough. The Debbie and Lenny from 2009 does not have journal D, does not have system D, this predates it. So this is all running good old Cisvi S bin in it. So none of the integration that you'd expect, the fancy stuff that you get today with system D and spiral with machine Ctl if you use the full interface. If you run a system D distribution inside the container then your logging will just transparently get integrated with the host journal. Likewise you'll get things like machine Ctl login which will get you a TTY, a console that you can use to log into the container. We don't have that here because there is no system D, all of this relies on there being system D inside the container as well as on the host. It is exposed to the internet but not directly. So it's the first thing I did way back before I started on all of this. Right, number two here, secure the most obvious attack surfaces. I stuck a modern reverse proxy in front of it.
vscode-container-wasm: An Extension of VSCode on Browser for Running Containers Within Your Browser
So, our next talk is going to be about... Hello, I'm Kohei Tokunaga from NTT Corporation. I'm a reviewer of container D and a maintainer of Build Kit. And today, I'm going to talk about an extension of VS Code on Browser for running containers within the browser. So, this is the summary of this talk. So, on Browser VS Code lacks Linux terminal running completely inside browser. And VS Code container wasn't. Extention enables to run Linux-based containers and its terminal inside browser. And there are two options available for distributing containers to browsers. First one is pre-converting containers to wasn't images and distributing them. And second option is distributing OCI container images to browsers. So, there are several on Browser VS Code implementations in community. There is a limitation for that functionality. This is lack of Linux terminal running completely inside browser. So, users can edit code inside browser but cannot run them inside browser. And Linux-based development tools like CompilerS are also unavailable on browser. And one of root causes for this issue is that browsers don't provide Linux compatible system. So, Linux-based applications needs to be ported to browser. If the application is written in language other than JavaScript, WebAssembly or wasn't will be... will also be used for running them on browser. But actually, porting apps to WebAssembly is not easy. So, wasn't lacks compatibility to Linux system. For example, the binary format is completely different from the existing common binary format like x86-ELEF. And the app might need to be redesigned for Harvard architecture of wasn't. So, this might include like eliminating fork and exact related cause from the application. And some of the issues can be mitigated by CompilerS wasn't target support. But they still don't provide full compatibility to Linux. So, can we run a modified Linux terminal and Dev environment inside browser? So, here VS Code container wasn't. Extension can be used. This is an experimental VS Code extension for running containers inside browser. So, the container and the terminal is available on VS Code on browser without preparing remote SSH servers or something. And this is implemented, levelizing CPU emulators compiled to wasn't. We will discuss about it later. And the workspace of the editor is also mounted at slash workspace path. So, container can refer to the contents on the workspace. For example, it can compile the code stored on the workspace. And HTTP or HTTPS networking is also available. The container runs inside browser. So, the networking functionality is also restricted by browser. For example, the set of accessible sites from the container is limited by calls. So, how container images can be distributed to browsers? There are two options. Option A is pre-converting containers to wasm images. And option B is distributing OCI container images to browsers. So, first option for distributing containers to browsers is pre-converting containers to wasm images. And container to wasm converter provides this ability. The container to wasm is an experimental converter of container images to wasm images. It receives an arbitrarily Linux-based container as the input, and it outputs a wasm image that runs the container on wasm. So, we can run the containers on wasm-enabled environment like browsers. As shown in the right figure, the converted wasm image can be uploaded to any HTTP server accessible from the browser. To use them on VS Code on browser, you can configure the workspace using .vscode slash settings.json file. And the image location URL to that configuration file. And so, you need to add the image configuration URL to that configuration file so that the extension can launch the specified container on browser. And the pros of this approach is that once the container image is converted to wasm, it can run on any wasm-enabled environment, not limited to browsers. For example, the container can run on like washy run times, like wasm time as well. And cons of this approach is pre-conversion is needed for each container. If you want to run many kinds of containers on browser, all of them need to be pre-converted to wasm, so it may take extra cost for development time. And second option for distributing containers to browsers is to directly distributing OCI-compatible container images to browsers. If you use container registry, that registry needs to allow code access from the browser because it's accessed from the browser. But unfortunately, as of now, well-known public registries don't allow codes, but so you need to try it on local house registry with code header configured. Alternatively, you can also use codes-enabled HTTP or HTTPS server. In this case, the container image needs to be formatted as OCI image layout. This is the specification of layout of image content to be stored on the file system. For example, you can get a tar archive of this format using toka-save command newer than v25. And vscode container wasm supports fetching the image formatted with this spec over HTTP. In neither case, the image location needs to be written to the workspaces.vscode.settings.json file so that the extension can launch the specified container on browser. The pros of this approach is that this doesn't require a pre-conversion of the image, and a modified container image can be distributed to browsers. And cons of this approach is that obviously existing public container registries don't allow codes as of now. So if you don't use OCI layout approach, you need to prepare codes-enabled container registry or users need to use like a proxy or something to access to the registries. And this is an example of running container on github.dev. This is github.dev is an on-browser vscode that allows us editing codes of github-reports on-browser. This slide shows an example of running gcc installed devian container inside browser, and workspace is mounted at slash workspace, and HTTP or HTTPS networking is also available. And so this is a demo for this extension, and we use github.dev here. And that. Okay, so here, this is the extension of container wasm, and this is available on Marketplace. And this is the settings.json file in this repo, and this config file points to the URL of the devian container converted to wasm using container to wasm converter, and this is served on github pages, and we use that image on this workspace. And this is the terminal of the devian container running inside of the browser. And this is a secret we are going to use in this demo. And currently, yeah, by executing a command of this extension, this extension quickly loads the image, and the container image stored on github pages to this browser, and it just booted the Linux kernel and container inside browser with cpu emulation. And we currently see the devian shell in the browser. And by executing your name a command, you can see this is the x8664 and Linux environment inside browser. And this workspace of this, this workspace of this repo is mounted at slash workspace slash, so you can see the files of this repo inside browser, mounted on workspace directly. And in this container, we have gcc compiler, and we have a hello world pretty simple clanguage source code, so we can compile that c code inside browser using gcc compiler. Then we can run the compiled binary on browser. So the entire compiling and running steps are done inside browser in this demo. So how this extension works, the container depends on Linux to learn, so this project runs both of container and Linux inside wasm VM on browser. And to enable run existing architectures binaries inside wasm VM, we use cpu emulators compiled to wasm. We use box emulator for x8664 containers and tiny emu for risk 5 containers. So this extension launches all of the emulator Linux kernel and the container inside wasm VM on browser. And we also use microsoft slash vs code dash wasm for wasm and wasm host on browser. So this is a wasm host integrated to vs code, so this allows wasm VM to access to the terminal on vs code and the workspace directly over wasm compatible APIs like fd APIs. And how mounting workspaces to containers works. So as mentioned in the previous slide, we use vs code dash wasm for the wasm host and it provides the access to the workspace directly to the wasm VM over wasm compatible APIs. And emulator running inside wasm VM recognizes workspace directly via wasm APIs, then it shares that directly into the guest Linux via vortio9p. And that workspace is mounted to the containers slash workspace slash directly so the container can access to the workspace on that file system path. And container can perform HTTP or HTTPS networking with restrictions by browser. So this is implemented by running the entire networking stack runs inside of the browser. So additional proxy outside of the browser is not needed. And this networking stack supports forwarding HTTP and HTTPS connection to the outside of the browser using fetch API of the browser. And HTTPS connection is terminated at the networking stack on browser with its own certificate and the connection is re-encrypted by fetch API. So the container can access to the outside of the browser via HTTP, HTTPS proxy running inside of the browser. And there are actually some important restrictions by fetch API including accessible sites are limited by browser so code restriction is applied. And some headers are actually uncontrollable from the container because they are entirely controlled by browser. And vscode container wasm allows fetching container image directly from remote location without pre-conversion to wasm. So this is implemented by fetching and unpacking the container image in browser. The unpacked root file system of the container is mounted to the guest Linux via VARTA ION IP. And not limited to on-browser IDEs, we believe there are some expected use cases or possible use cases of running containers or wasm or browser. So first one is interactive on-browser Linux based demo. And second one is on-browser development and testing like this extension. And also sandbox execution environment of containers and application debugger runable on-browser were recorded and replayed debugging. There are some existing approaches for running unmodified applications on wasm. And I listed some of them here. First one is V86. This is a x86 compatible on-browser CPU emulator by Fabia Hammer. And it supports wide variety of guest OSs like including Windows. But it doesn't support for x86 64 now. And tiny emulator is a risk 5 and x86 emulator by Fabia Spillard. It can run on-browser and container to wasm converter actually uses this for risk 5 emulation. But it doesn't support for x86 64. And this project is still in a very early stage. So we expect further improvement. First one is performance analysis and improvement. We heavily rely on CPU emulation. So I think we need to analyze the overhead and I think we need some improvement for it. And possible integration will be with ELF Conf or ELF Conf. This is an AOT compiler of Linux. And this is a 64 ELF to wasm by Masashi Yoshimura, my colleague from NTT Corporation. So at LLVM, tomorrow my colleague Masashi also have this AOT compiler. So please check it out. And the integration of container ecosystem with browsers is also needed. As I mentioned, container has call to the solution. So currently accessing OS package repos from browser is not possible. And also in terms of container registries, as long as I know a public registries, container registries doesn't allow calls access. So on this field, your help is really needed if you know some technologies or repos or registries that allows calls access, please let us know. And graphic support is also on our milestone. So this is the summary of this talk. On-browser VS code lacks Linux terminal running completely inside browser. And VS code container wasm, experimental extension is enables to run Linux-based containers and its terminal inside browser. And there are two options for distributing containers to browsers. First one is pre-converting containers to wasm images. And second one is distributing OCI container images to browsers. And that's all of my talk. Thank you very much. Do you have any questions? Yes. Yes, please. Can you run Firefox inside the container? Okay, so the question was Firefox inside the container. So Firefox inside the container, inside Firefox. All right. Yeah, I haven't tested yet. But yeah, I believe it's possible. But I don't find any practical use case for this, but I think it's possible. Yes, of course. Yes. Sorry. QM. Thank you for the question. The question was about using QMU alternatively for like a box and tiny Mule. Yeah, I think this is very good question. And actually we have a, we have on container to wasm repo, we have an experimental branch that integrated QMU TCI to this extension. And yeah, in terms of like a TCG, yeah, we haven't integrated yet. So TCI is completely, yeah, so TCG we need to wait for running the generated code. So we, it is not obvious on wasm environment. But yeah, we are seeking for the way to integrate QMU into container to wasm. So this is, yeah, definitely on our milestone. Yeah. Thank you very much. Thank you very much. Thank you.
Debug your stage-1 systemd with GDB and the NixOS test framework
So, my name is Julien and this is Ryan and Linus and we are three NixOS developers. And today we are going to talk to you about the situation that we had during the sprint where we found ourselves in need of debugging our system in Itaardee. So, I'm going to talk about, let me just, it's because I know them. I'm going to talk about why actually we were in this situation. And then Ryan is going to talk about what is the NixOS test framework and test frameworks in general. And then we are going to showcase how we did this specific fun debugging. So basically I'll motivate a little bit the situation we were in. So basically we wanted to work with encrypted secrets in Itaardee. So basically as you may or may not know, Initardee or Initial MFS is the initial file system loaded in RAM as part of the boot process. It supposedly contains all what is necessary in terms of drivers and executable to mount your root partition, which is what its main goal is, like be able to mount your root partition and continue the boot process. But in some cases, especially when your boot partition is encrypted, it also need to acquire like the key to mount it and to encrypt it. And so this can be done by displaying user prompt where you input your password, but it can also be done if necessary by starting a SSH server where you connect and then put your password in and then it mounts your root partition. And for that purpose, you sometimes need to have like secrets stored in this Initardee, for example, SSH key. The problem is that if you have an encrypted system, you kind of have to start from something unencrypted and this Initardee image is not encrypted. So if it has secrets and you just put the secrets in this image, then anybody reading your boot partition can have access to the secrets. So as Nixxos developers wanted to have like an option where you could actually have the secrets be encrypted. Currently, like in Nixxos, you have the secrets are like just put plainly in the boot partition and suffer the drawback that I was just describing before. And so we wanted to find a solution and the solution is we have an option to use systemd as like the Inix script. So we use systemd in stage one instead of a scripted Init script. And what we can do with systemd, we can use something called systemd credentials, which is basically an executable of systemd that has the main, just the role of encrypted and decrypting secrets. And you can do this by using your TPM. And so basically what you can do is use the same TPM in your Initardee and this way you have secrets that were encrypted when your system was booted. That systemd in stage one is now able to decrypt in your boot process. So why all this? Where am I coming? I start, I try to implement this in Nixxos and what we found out is that I don't know if you can read this particularly well, but this is the log of the boot process and you see that there is systemd that is running in Initardee, it says here running in Initardee. And then it says it loaded the credentials that I tried to pass it, to pass to it and then it caught an assertion in some function and says okay, I'm retiring early, goodbye. It's crashing. So the question is how can we, how can we like debug this kind of thing? And one of the things we consider at the beginning is to use the Nixxos framework because it allows us to be in some very constrained situation where you can find maybe the bug easier. And then Ren is going to talk to you about the Nixxos framework is the main turner for us. So the screenshot you just saw earlier was a screenshot of the Nixxos framework. So you can see that it's a VM test and we can repeat that VM test very easily. But so what I'm getting at is in Nixxos as Nixxos developers we have this test framework that we use a lot and I'm giving a screenshot of an over test framework that is open QA used by other distributions. But basically what is interesting with debugging is that when you debug you want to debug a situation, a particular situation where you are hitting the bug. And in our context the fact of using Nixxos test framework, the fact of writing test first is a way for us to automate entering into certain particular situation including the ones that we are interested in, interested to debug. So for us like the Nixxos test framework is only a way to facilitate debugging sessions, a way to be able to write code but enable us to explore various scenarios and try age and bisect very easily any sort of dependencies. In the distribution context we really care about system wide testing. So for me I will just do a very quick intro on that. There are two components I will define. There is the driver, the code you write to assert the invariance that you care about like for taking the example of the system decredentials you want to assert that the credential that you decrypt contains the contents that you are expecting. That's an invariant. You also have the setup. The setup is how do you bring the system to the state that you care about so we need to prepare an image that contains a system decredentials containing the contents that we will be expecting and that's the set of code. And both of them are usually written in some sort of domain specific language that could be a bash script, that could be C, that could be Python. And I made just a very simple state of the art table which is not exhaustive but I find it very interesting to compare which is that for example over project that needs to have like complicated integration testing framework are the kernel and they do have solutions to test file systems and various things. And you can see like they all have their own DSL whether it's bash or any ELF program or executable that you can run on the system and they use some sort of emulator to give you environments to give you full system ablation, to give you network, to give you VLANs so that you can reproduce any sort of environment. And I find interesting so I'm not aware of any over operating system wide integration testing framework except from OpenQA and the NixOS test framework which is just a bunch of bash scripts, Python script cobbled together using the Nix domain specific language and we're using the Nix machinery. And I find interesting that so the biggest difference I find with NixOS test framework and the Overs which enable us to do some interesting stuff is that usually you have one language for the domain specific language so you have Python or shell or something but in the case of the NixOS test framework you can use both. You can use Python and Nix together so you can interpolate Nix code inside of Python code and like you have two levels of DSL that enable you to reason at build time but also at run time. And you have so that's why I do the funny thing of saying Python Nix for driver and Nix Python for setup because you think run time and build time differently at this moment. And so to give you an overview the NixOS test framework can offer you like OpenQA anyway, work test OCR machinery so you can run a VM, you can spawn a chromium instance and you can like use the OCR to read the window title for example in a GNOME desktop environment and verify that it is indeed the window title you were expecting. And all of those tests are running in our CI automatically for every what we call channel bump that is a roll up of a lot of commits in the Nix repository basically. What I think is very interesting in our case and enable us to debug very quickly this problem is that there is a secret source for our test framework which comes from the fact that we use the Nix DSL here. So the Nix DSL gives us a way to describe packages, to describe system the units and various things and it's a functional programming language. So it means that you can write functions that abstract a certain test scenario and then you can write more code to do more advances in the assertion on that environment. So for example I just take a very bad screen and I'm sorry but I will describe it. We have ZFS in NixOS and ZFS is very complicated to maintain. I'm maintainer of ZFS unfortunately. And ZFS is very complicated to maintain because it's out of three kernel package that often has ABI breakages with the kernel for many complicated reasons and legal reasons. And to make the burden realistic on maintainers you need to have strong testing. And so we are able to do matrix testing over multiple version of ZFS and multiple version of the kernel itself and multiple version of even like stable versus unstable and we even have a variant for the system D stage one because NixOS has both stage one. It has a scripted stage one like Julian described and we have experimentally the system D P I D one stage one. And so we are able to test all those scenarios and be able to understand what is going on in a very like in not a lot of lines. And here I will pass it to, we tried a lot of things. We tried to isolate the problem with the NixOS test framework. We are able to patch things easily. But even though we were not able to find the root cause. So we passed on to more powerful tools. Thank you. Yeah. So there we were trying to work out how system D was crashing exactly. It was dumping its core to a file in the temporary file system and promptly exiting causing the kernel to panic and it's not a persistent file system. So we had no way of recovering that core file. So we decided to try and run GDB in the init ramfs or we quickly abandoned that idea because GDB is big and doesn't fit into an init ID that well. Thankfully we have GDB server which I'm guessing anyone familiar with GDB might already know about. So with GDB we can attach, we can either launch a process like above, launch a process as a child of the GDB server. It can listen on the TCP port and then we can attach to it with a separate GDB client process. That doesn't quite work if you want to debug your PID 1 because PID 1 can't be the child of another process. Thankfully it also has a mode where you can attach to a running process. So in this case we're launching sleep infinity in the background and then running GDB server to attach to that and likewise attaching to that GDB server using a GDB client. Now how do we do that if we want to do that in PID 1? We have to put GDB server in our init ramfs and then we have to have it target the PID 1 inside the init ramfs. The tricky part is we want to debug system D but because system D is crashing we can't use system D to launch GDB server. So we go back to having a shell script as our init and that shell script launches the GDB server, has that GDB server attached to itself and then executes system D. First thing we do is launch that GDB server, have it attached to $ in this case it's going to be 1 so the PID of the shell script and background that because otherwise Bash is going to wait for GDB server to exit and GDB server isn't going to exit. Then we sleep 1 because the GDB server needs a moment to start up and actually attach and then we exec system D to actually do our debugging. That ended up getting us actually able to debug it and Julien has a recording of how we did that, of what that looked like. Thank you. So let me try to put this demo on. So basically what we did, try to comment it as it goes. Oh this is not right. Yes it's not doing whatever I want. I think it's... And you can exit the full stream mode and then full stream it. No you didn't exit. Yes yes and trying to do it. Did I... Yes. You have your time. Yeah okay. So on the left side we are running our test framework virtual machine and you see now the virtual machine is not starting because it's waiting that we attach from GDB which we do in on the right side and you'll see as soon as we attach through this socket that is called hello the virtual machine is starting and GDB is loading the symbols yes and then when we do continue then the virtual machine is starting. So this one first virtual machine is as you see on the left is the installer virtual machine. It's going to install in XOS on a disk, populate the boot partition and everything, put the credential in it and then we restart it and we will eat the bug with system D. So what you see here is just a log of XOS installing itself and so this first GDB instance will not do anything purposeful because we are just... Because we change it in its script we have to change it both in the installing VM and in the installer VM so we are only doing the first part that is not really the part we are interested in. But should not take too much time. I can do filling. So what is interesting here is you can see like we have a very complicated well complicated setup to initialize system D initialize the installation and all that stuff. And this is the second VM booting now. All of this is automated. So we are reattaching with GDB and so we are now... The VM is now booting and it's now stuck on waiting for GDB to attach. So when I do this it doesn't work but when I properly attach actually it's reading the symbols and now when I do continue I will eat the bug that we were trying to debug. This we are eating it now and we now can see a backtrace. So yeah that's it. By reading this backtrace we found the bug we were looking for and we were able to open a PR to system D and fix it. And that's it. Do you have any questions? Do we have time for questions actually? Yes. Oh that's good. You said that you couldn't have system D be like the child of another process so you couldn't have GDB like start and run it. Why not? Yes. Do you want to answer this question? Yes so the question was why we can't have system D not be PID 1. It's because our bash script won't reap zombie processes which only PID 1 can do and because yeah there are various bits in system D which require it to be PID 1 especially if you are running it in the init ramfs because it needs to actually switch into the final root file system which you can't do as just any process. I don't understand how and when the transfer the ownership move from GDB server to system D because you attach GDB server to itself then you hit continue. The question was you don't understand when the control goes from GDB server to system D. The init in this case was a shell script which launched GDB server in the background and then the shell script replaced itself with system D and the GDB server was attaching to the shell script. Any other questions? Yeah just a matter of curiosity. Why do you say it's a problem to put all of the GDB binary into the init ramfs? So the question was why it's a problem to put all of GDB in the init ramfs? It's yeah it's fairly big. Big init ramfs can be a problem especially with limited with boot partitions of limited size. For that we might not have the terminal control bits and pieces necessary to make actually using GDB enjoyable whereas with a GDB server we can even attach a graphical front end to GDB or something similar to the target. And the debug symbols and the sources? Yes exactly. So GDB needs to access the debug symbols and the sources at good point. The question was why if we are using a TPM anyway to store the disk encryption keys why would we need to store more secrets in the boot partition to do anything else? I think so there are many use cases here. For example imagine you would run SSH server in the early boot to obtain another part of the key. So you store a part of the key in the TPM2 and another part on a server and the server asks you to prove your identity or something then you need to have your own identity somewhere because the server doesn't know if you're the true server who is asking for the over part of the key and that means you need private SSH house keys to be stored somewhere. So to confirm in general if you haven't configured something like an SSH server and explicitly put a secret in your init you're not going to get one. If that's part of your framework or where you want to split the key up and get it in different places for example this can help you do that. So again to repeat what you just said and I agree with that this sort of approach is useful when you have more secrets than just having the TPM2 disk encryption secret in the TPM2 when you have identity cessation or more parts of the secret somewhere else doing SSSS and what not. Shami's secret sharing to be more precise schemes and this makes sense in those use cases. We still have three minutes. Recompuse. Yeah. Is this already in stream with the TPM user in the init? Do you want to answer? Can you repeat sorry? Is this already in upstream mix? Mix package with the TPM2? Yeah so the question do you want to answer? Yeah okay. Repeat the question. Sorry yeah the question is this way to store secrets? Secret stream. Yes this way of storing secrets in init already upstream. The answer is no. We have a few dependencies necessary. One of them is using booting from system distub because system distub can measure the credentials you're passing. So there are PRs open. If you are an excess developers do review them please. But it will come soon I think in system reboot and also there is work being done in LANZABOOTIS for the same features. So both are going to be available soon I guess. Related is this one of the things that's kind of on the road to LANZABOOTIS? I'm the maintainer of LANZABOOT. So the question was is this part of the work to upstream LANZABOOT which is a secure boot component for NixOS? It's a bit special to NixOS because we have too many generations. The answer is this is in the ecosystem of those such things and yes basically. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you.
GDB on Windows: status & plans
you you you you you should I start over? No. Hello, checking, sound check. Sorry everyone at home. Alright, so not asynchronous. So you move this to separate thread and then there's a way for one thread to communicate with the event. Something happened. We did this change in GDB more recently in GDB 13 before that the debugger really blocked. Like you continue the execution, you couldn't do anything else until the inferior stopped. So that was something that was upgraded in GDB 13. So now, skip the slides. So now in GDB 13, this is something that's more important to ID. That the ID you can press the continue button, the inferior is now executing. But the ID at the same time now can execute GDB commands like disassemble or install new breakpoints, search symbols, things like that. Now it can do that while inferior is running. Well, before it couldn't. The ID would have to stop the whole program and then do something. Going back, this other function, the counterpart of the waiting for an event is when you continue the event. You have this argument here, this parameter where as argument you can pass either one of these two macros. And this is basically, you know, like in GDB, when you get a signal and then you can decide to pass the signal to the inferior or not. You can suppress it or pass it. So when you pass it, it calls the signal handler in the inferior. There's something like that on Windows. Not that important, but it's similar enough. But they call it exceptions, not signals. And this function here, you decide whether to suppress the exception or not. Will the inferior continue processing the exception or will it be suppressed? And it's important to know that you do this decision when you call this function. I mentioned this already. All right, keep that in mind. So this is very basically how the debugger internally works. All stop mode is default mode in GDB, how everyone knows how it works. So here we have five threads. This is time, time period one. Everything is running, runnable. And then T3 is about to hit an exception. So it hits an exception and you're calling wait for debug event. It returns saying an event happens and the kernel freezes everything in the process. All threads, the elements, they're frozen. And this one got an exception. At this point, the user is now inspecting the program, debugging the actual bug, reading memory, backtracing, blah, blah. And finally, they decide to resume execution. And then that's when GDB calls this continued bug event and then passes that decision of whether to suppress exception or not. So it's late, it's here. And then all threads go back to being runnable again. That is if you want everything to be running or stop and then everything running again. There are times where you'll want to only resume one thread and leave everything else suspended, frozen. Internally, GDB needs to do this, like to step over breakpoints. But the user may also want to focus on a particular thread leaving every other frozen. And we do that, the user interface is to enable the setting. This doesn't work currently upstream, even though internally everything works because GDB needs to know how to step over a breakpoint. But it's never been exposed to the user. Nobody wired up this to the back end. So I did a little change in my work and it actually works. So it's the same as before, the exception triggers, user inspects the program and then decides to resume T1 instead of T3. And GDB suspends everything else and then calls continue debug event for T3, because that's where the event came from. And now T1 is runnable. But what if you want to do the converse, which is instead of running one and stopping everything else, you want to stop one but leave everything else running. That's what's called the non-stop mode. And this is what I wanted to make possible in Windows. Because this is supported on Linux since 2008. I know because I worked on that. So a long time by now and also supported on remote targets, meaning GDB server for Linux, but also some other embedded systems out there. They support this mode as well. But native Windows debugging does not support this. So non-stop mode means only the event, the breakpoint, reports to stop to the user and everything else continues running. This is interesting, again, mostly for IDE's. You can imagine a big list of threads and then only one of them reports an event. But it's also interesting because maybe one of the threads is important to keep running because maybe it's a watchdog or something that needs to ping a server and if it stops pinging, the program doesn't work. There's something on the other end that needs to see this while you inspect some kind of debugging triggered by some thread. And the reason I thought all over these years that Windows wouldn't work for this is that, well, we have this problem. Wait for the bug event, that magic function reference in the event, suspends everything already. The kernel already does this. There's no way to tell the kernel to not suspend every thread except the one that got the event. And we want to leave them running. So naively, I thought maybe just immediately suspend, block, freeze the thread that you care about, and call continuity bug event, right? But you can't because this is too early. We just got the event. The user hasn't yet decided whether to pass or not. The exception. That only happens after. And I was looking up this past year, and I noticed on the Microsoft website describing these APIs that they introduced a new flag to continue debugging events. And I read this and I was like, really? It's like they wrote this just for me. Hey, they're awesome. Well, it's not the ideal thing that I would like. I would like to have a way that the kernel doesn't freeze everything. Still freezes everything. But what they do is, if you pass this flag, what you're saying is, I got the event. Okay, cool. But I don't want to handle it right now. So, but I call continue. And I'm asking the kernel report the event again as soon as the thread becomes runnable. So what I do is, I call suspend thread on a thread that got the event. So it's no longer runnable. I call this continue the debug event function saying, get me back the same event again once I become, make the thread runnable. That's what he's saying here. Well, in other words. How does this actually work in practice using the same diagram as I showed before? I'll prototype this quickly with a hack and it worked. Amazing. Now I just need to make it clean. And of course that's, oh, sorry. So, some years before everything is runnable. T3 is about to raise an exception. It raises an exception. The kernel freezes everything. There's nothing I can do to control this. And then I freeze the thread that got an event. And then I call this function with this new magic macro. And then GDB remembers that T3 will get a repeated event later. Now users is inspecting the thread, but everything else is running now. Right? So the kernel paused all the threads, but I immediately told the kernel, we resume everything else. So there will be a small freeze. There will be increased jitter caused by the debugger. But most of the time whole threads will be running. And then later the user decides to re-resume T3 and the debugger just calls, you know, resume thread, unfreeze the thread. And remember, now the kernel, because the thread is now runnable, is going to re-report the event. And because we recorded it earlier that we will get a new repeated event, the debugger knows, okay, it's a repeated event. Now I need, I know I need to call continue debug event with the proper flag saying suppressed event or not, the exception or not. Yeah. And a colleague of mine wondered, does this work when multiple threads hit the breakpoint before you decide to resume? Yes, it does work. You know, same thing as before. And here you are looking at this thread and this one raises an exception. Everything works. You can look at this offline if you want to. Yeah, there's a lot more to this. That's when I, okay, the hacky version works now. I need to make it clean. And that, you know, I stumble a lot of things that I don't have time to go over right now. I'm going to touch a little bit on the test suite. How much time do I have, Dodgy? Three minutes. Plus five. Yeah, okay. All right, so I put this in the abstract. The reason is when I talk about the test suite, I need to make this distinction. And when I say GDB on Windows, there are actually two ports for Windows. There's GDB Compiled as a SIGWIN program. And there's GDB Compiled with the mean GW tool chain, which means it's a native Windows program. SIGWIN, for those who don't know, it's like, gives you a POSIX environment. It's a collection of tools, but it's also a runtime, a DLL that every tool is linked with. And this runtime provides you POSIX things like signals, PDYs, and a bunch of stuff. The C runtime that's used is not the one that comes with Windows normally. It's based on NewLib. Try to be as close as a Linux environment is so that you can recompile your application, a Linux application with minimal changes, quote, unquote. It works. So it's not an emulator. You have to recompile your program. Right. So the core of GDB has two ports, like the event loop, for example, is based on select slash pull for most Unix machines, ports. And SIGWIN is one of those. But the native version of GDB for Windows, based on mean GW, has a separate event loop based on this wait for multiple objects function, which is the Microsoft version of select. Right. But the backend, the code that talks with the debug API, those functions I mentioned for it, it's shared between both ports. It's the same code except for SIGWIN, there's extra magic to make some of SIGWIN-specific things work. And this is where I get to the test suite, because part of making this work and upstreamable, and I would get to a point where I was, you know, sure that I wasn't breaking things, because this isn't making this work involved, I'm revamping the backend very substantially. So I want to make sure that I wasn't breaking things. So run the test suite, right? Except running the test suite on Windows is a major pain in the... The test suite is... GDB test suite is built on Dezhek Nu. Dezhek Nu is an infrastructure built on expect, and expect itself is built on TECL, which is a programming language. And Dezhek Nu assumes a Unix-like environment, which you don't have on Windows normally. You know, assumes POSIX shells and utilities, kill, CPE, VOO, and there is no native expect part. There was a company active state that had something like that, but they killed that project some years ago. So you have to use something that's Unix-like to run Dezhek Nu. If you test GDB on a segment environment, you just run MakeCheck, it does work. It's super slow, not stable, but it does work. But if you want to make native Windows GDB, you test that, it's not the same thing, it's a proxy, but it's not the same thing. Remember, I said that the core of GDB is different code paths. So I would want to be able to test this guy as well, I mean GWGDB. So how about we run the test suite, Dezhek Nu under SIGWIN, but make it spawn the Windows GDB? Yeah, that's a potential idea. But the problem is, it's a SIGWIN expect, it's spawning a Windows process, and the input and output is going to be connected to a PTY from the SIGWIN side, but what the Windows GDB sees is just a pipe. And when GDB is connected to a pipe, because that's how SIGWIN PTYs work under the hood, it's a pipe, GDB is connected to a pipe, it's not what's called an is at PTY, so it disables everything that's interactive, so in the test suite, it completely falls down. And something else is that the test, the Dezhek Nu, because it is expecting that the inferior is being run under PTY, so there will be terminal mode controls, time's up. But I have the five? Because... I'll tell you, if you want one minute, you can do it. I'll give you one minute. I'm almost over, just 30 more slides, no, just one more. Right, so there are some ideas to get this working, there's also path mapping issues, because what they expect sees path-wise, slash, C drive, slash, X, it's not the same as GDB. C is because it's a native program, so it sees X, colon. And another problem is that the GDB test suite, when it wants to test multi-threaded things, the tests are all written with P-threads, which is not something native to Windows, even though mean GWW-W64 does have the WinP-threads library, so maybe we could use that. I have some ideas to try to make this work, but I haven't had the time to actually experiment much with this. I tried other things that I thought would be interesting, but they didn't work. The test suite, compiling on, yeah. Right, so about compilation, just if anyone here is motivated by this talk and wants to help. Compiling GDB on Sigwin is super slow, so the way that I got around it is to cross-compile, and yeah, some things here you can do. And then I can cross-compile to Sigwin, but to run the test suite, I need to run it inside Windows, that's, I can't avoid that. But I can point GDB, the test suite inside Windows, pointed to the built GDB that I've built on, sorry, on Linux. Whoo! All right, so maybe I should skip, yeah. So test suite, bad, need to fix a lot of things, that's the thing. GDB, it's the native, yeah, this is the thing that's for the future. Make it possible for GDB to debug programs compiled with Visual Studio. That is something that is missing, it's making people not use GDB on Windows, and I would prefer people not to think about using other tools, you know, staying on the lane. So at some point I would like to work on this, but, you know, no time for that. Just leave it on the screen if people have questions like maybe one question? Nothing. All right. Thank you. So. Okay, actually there is one minute left. Is there one quick question? Yeah. Okay, so here's my question. Oh no. Have you tried using Python to run the test suite? I have. GDB executes and stuff. I, that would be writing a new test suite. Yeah, that's right. I know there's actually some people that do that, some companies, but I wanted to find a way that it can run the existing tests before giving up completely. Okay.
Yet another event sourcing library
Yeah, this is better. So yeah, I'll talk about history, how we made some decisions we made, some things regarding lambda and the project, and basically this was kind of a point where we started to do most of the stuff on our own. Then I will go over the patterns that were kind of influenced for the libraries, so the security and even sourcing, I'll briefly show how the whole thing works in architecture, diagrams, and then I will say why we actually decided to open source it. So the project started in 2019, everyone wanted to do several lists, it was kind of a fancy thing to do at the time, and also we wanted everything to be managed by Amazon and we didn't want to monitor containers or run stuff around, we just want to give our code to the Amazon and run it, and that was kind of perfect to do this. We also had to keep the business logic vendor independent, so this is kind of a regulatory requirement, so we kind of speak that our business logic is the most valuable thing and then we isolated it from the infrastructure, and so the infrastructure part we can always rewrite, but the business logic we want to reuse. You want a simple API, so I had all these query pads, headers, discussions, we always had about API, so I wanted to drop this thing out, and we wanted to keep data so we can transfer it, I know rewrite library and move it to another language and use the same data and so on, so like binary stored messages in Kafka queues were not an option for us. With Lambda basically the big problem is the startup, so we wanted to use closure because we had lots of data stuff to take care of, so the biggest problem was of course the startup, so the ground we had at that time was pretty new and basically most of the stuff didn't compile. We tried AWS SDK, this was a mess inside, they bring half of the main repository back to it when you use it. Also we had like Hockey Recipient, we had to fork it as well because there was some stuff there that they were using that didn't compile as well, even Logback didn't compile for like one year ago with this as well, so then we started to build something on our own to make it simple, so we created our own AWS SDK because everything they do in all this magical SDK is kind of a post request to the AWS, so it was kind of super easy in the end to do. So the first pattern we chose to use was TECORES, just command and query segregation pattern, so the idea is that you have place where you send commands, where you mutate data and you have a place where you query stuff and this kind of influence our implementation, so we had on HTTP site we just had two endpoints, commands and queries, you send in the body everything you want to do in the system, which also make you can take the same body, you can send it to the queue, you can send it and batch of the command in S3 buckets, so this was kind of great because we could just store the commands from the post request, put them in the queue or store them in S3 bucket as a list of commands so it was super practical. The query site is also very simple, so just the query endpoint which made the front end client, we implemented our own front end client for this, it was 300 lines of code together with it mocking, with retries, with deduplication, with everything, so basically just simply having this simplicity on the HTTP site made it possible. Together with Tech QRS, now it comes the event sourcing, so the idea of event sourcing is just we will not store the current state of the system, we will store events that happen, so it is a pattern from 2000, 1970, basically but then they didn't have enough resources to do it, so they decided to event like a relational database model where you just store the current state, so the event source will be, for example, if you take a shopping cart as an example, you would instead of storing the current shopping cart, you would store item edit, item removed, item edit and then when the client asks what is my shopping cart, you would go over the events and figure out what is the current state of the shopping cart but the nice advantage of this is that everything is stored, so basically for us it's very important, the audit logs, basically the event sourcing, they are naturally there, everything is stored, the database itself is inutable, so we are just appending stuff forward, so it's quite easy to handle from the security perspective, information perspective and so on, so for our implementation we have chosen to take postgres, we just store our events as a JSONB field with some metadata around, so it was super simple, we have the transactions because it's just append only, it scales very well, so we have around one terabyte of data and we just add, we don't even think about adding new stuff there, we use optimistic locking, so on the client side we just add sequence to every event and basically unique field on the postgres gives us optimistic locking, so it was super easy to do, so yeah, this is a simple diagram, so from the client perspective how things look like, so we have a command coming into the system and there we touch our service, we just edit the core implementation, edit the core does four things, so takes a snapshot from the view store, then does the processing, whatever needs to be done, stores the response in event store and basically sends to the router all the events effects that were created, so events are, as I said, something that will store the changes and the effects are the things that need to be distributed to the other services, so if you want to call service B or never call it directly, I will store it in the database, the things I want to send to the other service and then they will be distributed to the router. The router then sends also back to the service that needs to update this aggregate and then aggregate this update to the view store and then we go to the next cycle and query is just a simple query, goes to the view store, returns back data to the client. And one more diagram which is also important is how internally the core works, so basically does a couple of things, in the beginning we validate the request, the important thing is what we do, we check if there was already this request process, so we have a command response log where we check if the request was processed, if not then we go, we log this request in the request log, so all the entry commands that come to the system are stored there, so if we need to debug something later on, everything is collected there, so and since everything is a body, it's super easy to store, whether it comes from Q, post request, whatever, then is this processing request, where is the business logic part and then we start the transaction, so we can start the transaction at the very end of the request, which is quite nice from a performance perspective, we store events, store effects, so it commands to the other services and then we just mark this request as completed so that we have a deduplication afterwards. Well, basically that's it, so we started developing this internally, it was only meant as an internal library, there was no open sourcing processing component also, basically this was kind of an idea to start this process as well, there was no alternative limitation because it has a fixed infrastructure there, so we kind of used this as an opportunity to kind of expand library as well, so we mostly started using it as hobby projects, so for the side projects, so edit the DynamoDB support for example for the store as well, for the event store, and this helped to clean up the project, so we did a big round of cleanup of the project with the proper abstractions basically, then we started adding different implementations and then we were contributing the changes back to the internal, so we chose to have the internal project, so we fixed huge amount of bugs outside that help also to get them back internally for the internal implementation and so on, so we set up the open sourcing process, so basically any team in the whole company can open source what they want if you just follow these steps. Yeah, so we had very positive experience with this library, so we are now like almost one year in production, we store everything, this was this space off on daily basis, so we even had a business site messing up thousands of hundred to thousand records, we could recover them quite easily just creating data from database, everything is stored there, audit was super happy because we stored everything, I even ticked off a lot of the audits just because we said we store everything, so they were super happy, and yeah, so the most of stuff like if we had a production bug that basically clogged up the queues, we could clean up the queues and five minutes later we could just select what happened and recover it back in the queue, so we didn't have to worry about finding what was there in that letter queue, what is useful, what is not useful, so and because of the duplication we didn't have to worry about sending against some messages, so we do this almost every week we have one disaster we need to recover and it's super easy for us to do that. Yeah, that's it from my side, questions? Excellent, next we can set up. So tell us a bit more about accepting open source in your company, you can come up. So this was actually six months process, so. So, yes, so the question was the experience with setting up the open sourcing process in the company, so this was actually a very painful experience, so it took six months negotiation with security, actually first to understand what we want to do, then extend it, why you want to do, then talk to management, tell them why this is beneficial, but afterwards yeah, once we figured out that all the rules that we need to follow, then it was it was quite straightforward, so we documented everything and hope that it was six months process to get it, get it there. So, my question is why the architecture decided to use lambdas of the first, why decided to use lambdas? So, one side was because we had a burst, so we like, ah, sorry, the question is why we decided to use lambda functions, so basically in the beginning we had a burst of data, so for example in the morning we would get a bunch of data we need to process and the rest of the system would process like three requests per hour, so and this was kind of a nice thing because it scales quite fast and the other motivation was because it forces kind of doing it fluff clean, there's no caching, you have to really think about what to do, so it kind of wants to push the developers to go in direction of actually making stuff clean and that they don't depend on something being stored somewhere in memory and yeah, the third thing was it was a cool thing to do, so it was kind of a nice presentation, marketing material for the project as well. So, I mentioned you use optimistic locking, why did you decide to use it? Was it because of the lambda bear or was it? So, the question is why we do the optimistic locking, so we use actually postgres in the beginning, but we used optimistic locking because we didn't want to even start the transaction because until we are done, so because we kind of declare all the dependencies we have, we fetch them, we process the data and then we have everything we need to store the mutated database we have it at the end, so at this point then we open the transaction and then we can do something, so that means we fetch the aggregate, for example aggregate version 72, we process everything, we say okay now we'll be version 73 and if there's some version 73 in the bit in happening then we would have a postgres nicely saying there's a concurrency problem, so we didn't want to lock anything database, we just want to make it simple, so this was super easy to implement. I have a comment on that which is our database uses optimistic concurrency control and it actually gives much better scaling up the traditional locking methods and it's much more robust and it's more secure, so we can have a separate discussion about this later. Yes, let's have this, we'll be an interesting discussion.
How to create the universal operating system
Welcome everyone. My name is Erotra. I'm glad my somewhat pretentious title has a lourd you all inside. I hope not to disappoint. Hence the disclaimer that I have way too many disclaimers to elaborate on. What is an operating system? I had to look it up on Wikipedia. I did have some imagination of what it could be, having used one for a few years. But it turns out, and it's slightly paraphrased because their definition is too long for me, it's a software platform providing access to resources and services to run computer programs. Okay, great. I knew that. That's what I use it for. Excellent. The title is about the universal operating system. And universal to me would imply more generalisation. I've always felt that the computer should evolve or computing should evolve. And I hope that we can move towards freely sharing, using, combining, understanding whatever we do with the computer. And from my personal perspective, from my day job, one aspect would be something about safety. And I've added security, but I'm definitely not a security expert in all the automation that we may computers do these days and hopefully in the near future. Because apparently we are dealing with a few crises here and there. And I believe we have ideas to get those addressed using information technology. I hope to learn from Jonathan at the level of representing information, how this could be used in the future. But I'm sticking to this bit. I'm more operationally oriented, so you could say, imperatively. And I know this is the declarative minimalist computing room. So I'll try to bridge that. The ingredients that I hope that the future universal operating system might incorporate is definitely the microkernel. Richard Stolman proposed for the GNU system a few years back that it could have a microkernel. I would still love to see that happen. In the community, work is being done on that. I hope to start to contribute to that. From a software point of view, I believe everything should be modular. Small pieces because I'm just a human. My head is limited and my understanding and time are short. Things should be definitely decentralized. Client and server would be a natural way to go in the interaction between things. But I want to focus on language semantics that might help us to move towards such a universal operating system. Because I think if we add all of these ingredients, we're going to incur enormous complexities. And I'm not really sure that if we go on in software development the way we do, it will actually scale to the level that we need our information technology and our operating system to scale. And I'm going to do that using a very silly example. This is actually the control of my cruise controller in my car. The picture comes from the internet, of course. I'm guessing most of you have heard about a cruise controller. Basically it's an electronic device in a car and it runs a bit of software. But I want to use it as a metaphor to talk to you about small modular things that can work in a larger environment with other small modular things. And if you add and combine them enough complexity goes up and we need to figure out a way to deal with that. So what is a cruise controller? I'm just going to read this out because I haven't memorized this. Basically when it's not enabled, hence it's disabled, the throttle, which is the thing under the hood that is normally controlled with your gas paddle, is fully controlled by the driver if he or she pushes the gas paddle. When it is enabled and you apply the set button, it captures the current velocity and uses that to maintain that velocity over the course of time. There are exceptions. In my car if I go uphill and I've set it at the lower limit, it will drop down and just drop the cruise controller. And there are other reasons for it to refer back to the human control instead of doing it automatically. One of them is if you press the brake pedal or the clutch pedal, it has to stop because it would be very annoying if your car continues moving while you don't want to. And of course as a human you can cancel it. Okay, this is all really boring. But basically if we would put this declaratively, we just want the damn thing to control our velocity. Done. Very abstractly. And I think this is a way in the future, a declarative way to do that future automation. But I've just been listing all of these pointless details which are still very abstract. If you look at the car in greater detail, there's a lot of imperative stuff going on, stateful stuff. And that's what we're trying to figure out a solution for. So we've been working on a language which we call also pretentiously designed, spelled incorrectly. Because if you look on the internet for the normal word design, you'll either find us. So at least this alternate spelling helps search engines. The semantics of our language allow us to, well, our language consists of interfaces and components basically. Interfaces are behavioral specifications. So they record the protocol. I'll show you an example in a minute. And these protocols are actually contracts of interaction between two components. And our components are of course modular. So they are completely isolated from the world by interfaces. And they are composable. You can stick them together and know that while they maintain the protocol, therefore they can cooperate properly. We have a formal definition of our semantics, meaning we actually express it in a formal process algebra. And I'll get back to that. We can simulate our behavior at the interface level and the component level. We can actually implement running code through code generation. And we can actually automatically verify a bunch of aspects of interfaces and components. And I'll try that to do that by example. So let's start with an interface. The pick. Sorry. Something like this. The picture of the buttons on my steering wheel I showed you is captured in terms of syntax here. There's the enable, disable button. The select, the current velocity to be maintained. As a resume button or resume function, a cancel. And on the dashboard, there's this LED that indicates that it's active or not. But the human is expected to interact in a specific way with the rest of the car. And that has been captured in this behavior bit. And our language takes a imperative approach. So we define state. I just scrolled across that. So we have two state iterations and we maintain that in variables, state and set point. And now we describe the interaction of the human with the cruise controller in the car behaviorally. In other words, if the initial state is disabled, we would accept an enable and become enabled. So dive into this further. I'll show you a picture of what this could look like. Let's see. Sorry. So if I show you the state diagram of the text I was just showing you, this is generated from that definition. This is what it would look like. This slightly more human readable. And we have sort of an intuition for this, I think. I'm going to make it slightly more complicated. Let's look at the component, the cruise controller component itself. It is specified similarly. Almost there. Yeah. We use the same language or the same concept. We define the behavior of the component itself, but now it receives its messages through ports and the cruise controller is supposed to interact with the different actors in the system, which is the human behind the human machine interface, the pedals, the throttle, and we have a timer, which I will not go into. I won't go through in all details through this behavior. I just want to show you the following. Sorry. I have to give one more example. The thing that I really want to add, which we have recently done, is an extension. Let me start over. This is what it looks like in, this is the formal semantics of that behavior, which we can actually feed to a model checker and check properties. And let me feed it to the model checker. It checks all of our default properties and the user defined properties, which I will now show you. So what we have just checked is that the component adheres to all of the interface contracts and it actually adheres to the invariant predicates. One of the invariant predicates, which is, you may have heard, there are cruise controllers who accelerate unwontantly. I have tried to encode that in the state of the environment, which the cruise controller is trying to control. And in this case, it would be, if the human has not activated the cruise controller, it should never actively control the throttle. So that's recorded like this. And I can actually make that fail by commenting out a throttle reset. And then that property will help us find a sequence of events that would lead to this illegal behavior, this unwanted behavior. Okay, this was very detailed. I'll try to wrap it up. Oops. So I have to make the link to the universal operating system ahead. I want to foresee that we will build a modular operating system. And because of the modularity and the distribution, the cooperative complexity goes up. And I think we've figured out a way to leverage model checking to help us there in the future. In the near future, trying to, I'm looking forward to adding that to her development, engaged development. In the coming year, we already, we had already planned to extend the scope of verification, including data contracts. But if you want to know more, just come and find us online. Here are the details. Excellent. Thank you. So this system is GPL, it's out of the open. Yes, you can find us on Savannah. It runs on gigs, a gigs install design. Can you tell a little bit more about the automatic verification of the model? Right. That's the magic board. Yes. We actually transform the model into MCRL2, which is a model calculus that allows you to do specify formal properties and capture the formal behavior. So what we effectively do, the execution semantics of the code that we generate is modeled in MCRL2. We verify the entire state space of that code, which is more efficient than trying to test all the code. And we have a composition guarantee. So when it finds nothing wrong, that there is really nothing wrong. And it's not a matter of, we didn't already have a time to find something at once. Exactly. But there are always aspects that you cannot represent, which are also important. You're welcome. More questions? Is it a result or possible outcome for the model? Does it commute to the whole solution space? At the component level now. You should repeat the question. Sorry. Your question was to verify all of the properties, does it expand the entire solution space? Exactly what we do at the component level. So the interfaces allow a certain behavior. And you want to expand that entire behavior, synthesize that, and go through it and figure out if there are any problems hiding there. That's what we do. Final question, is it used in production? It's used in production. Oh, yes. Our biggest customer currently is a thermo-efficient scientific. They make these huge electron microscopes. And I believe they've got about 1.2 million lines of our code running. Another question? Yes. Thank you. Is it also possible to create distributed systems with design? Currently, no. But I hope to integrate with what Christine will be talking about very soon. And that will solve that bit. Great. Thank you.
How much math can you fit in 700K?
So during these two minutes, I'm going to ask a few questions. I think the sound is better like this. Can you hear me? Yeah. So I heard a comment that color was not allowed, so I hope that you won't mind if I use 3D instead. But the screen and the actual device I'm going to talk about is black and white. Who uses a calculator from time to time? Who uses a calculator from the smartphone or whatever? Yeah, it's the majority. Who uses HP style calculators? Not that many. It's mostly. Who uses calculators for binary computations? Okay. Complex numbers in matrices, graphing. Okay. Just checking. So I don't think that the camera can zoom that far, right? So I can't show that. I suspect. Yeah, it'll be hard. But this is the device I'm talking about. You're going to speak. I'll hold up a sign. 5 minutes for question time. Yep. It's the dots. Is it also? For me it is. I don't know what's wrong with my timer. It's Android. Okay. So I'm Christophe D'Alincia. I'm working as a senior principle software engineer at Red Hat, working on confidential computing. I'm giving a talk on this topic this afternoon. But today I'm talking about a pet project of mine called GB48X, which is an open source HP48 style calculator for modern ARM hardware. So I talked about this last year, and I'm going to show how much progress we made since then. I start with a reminder of what GB48X is. We are going to review last year's future plans to see how well we did. I'm going to talk from one engineer to another. That's why I asked the questions at the beginning to see why we need all this math in the calculator. I'm going to extoll the virtues of 1980s era efficiency when there were only keyboards, no touchscreen, no fancy mouse, all that stuff. I'm going to explain how using much bigger numbers led to much less memory usage. And we are going to see a number of bells, whistles, and engineering units along the way. So I hope you enjoyed. Strap on. What is GB48X? The idea is really to revive Schullet's Packard's iconic reverse polish list on modern ARM hardware. So that's what the original box looked like. And a quick primer on the project. We want to simply put, reinvent the best calculators in the world. Nothing yet, less. It's designed to run on existing hardware from a company in Switzerland called Swiss Micro that does these kind of devices. So you see the DM32 on the right and the DM42 on the left. The specs for the project are from the HP manuals, and there are dozens of them. Unfortunately, they contradict one another because values calculators do not do exactly the same thing. So it's implemented in a language called reverse polish list, or RPL, which is a stack-based language, very powerful. It's based on common line and menus that you activate with keys below, function keys below the keyboard, the screen side. It has many data types and mathematical operations. I'm going to talk about this later. And many enhancements in the project compared to what HP did. Now, is this still minimalist? Well, you bet, because that machine has 70K of free RAM and 700K total for the program space, hence the title of the talk. So it's a low-power Cortex M4 at 80 MHz. The battery life is up to three years on this kind of battery, and one of the things that is nice is that the screen is passive, so when you switch off the calculators, it displays a picture, and the picture stays there forever. So that's where I have pictures of my wife and my calculator. The machine has only 96K of RAM, and if you remove the bitmap, which is a high-res bitmap, and the operating system needs, then you get to the 70K I was talking about. So 96K is 1.5,64 for the old-timers among us. It has only 2 megabytes of flash. It has 8 megs in the chip, but 6 are for a flash disk, and so there are 700K remaining for your program. That's less than a Macintosh floppy disk. They were 800K. The project did hit these limits quite hard. I'm going to explain how we worked around that. So last year I explained that I had to restart from scratch from a project called new RPL because we hit these limits. This year around Christmas, I hit the limits again, so I had to restart from rescratch, at least as far as the similar computations are concerned. So I'm going to explain that. So let's review last year's future plans. I think there is a problem with this one. Is this one okay, or is it... Yeah, okay. So I said, you know, back in 2023, I was young and naive, and I said a lot remains to be done. So I was talking about adding complex numbers, vector and matrix arithmetic, about 1500 functions that were left to implement, and key features like plotting and graphing. So what did we do? Well, a lot of this was done. Complex numbers are available, and they are actually much better than the original. For instance, you can have polar and rectangular. You have the usual notations. You have stuff like that. We have vector and matrix arithmetic fully implemented, and we have algebra, but also with exact computations like fractions inside matrices. So you never get a rounding error unlike on the HP calculators. That's the test suite. So the test suite runs on a simulator on Linux or Mac OS, and it currently runs about 2,200 tests. Not everything is tested. That, for instance, is implemented but not tested yet. And we have plotting and graphing, at least the basic features, like drawing stuff, etc., with some nice enhancements compared to what HP did. Like, for instance, we can have plots with various sizes and plot patterns, so I'm going to show that in a moment. And that lets you draw multiple things on the same screen and see what the different pieces are. It just was very fast on the screen here. So how did we go to use only 70K? It's a story of ultimate over-engineering. It's C++ with garbage collection and ubiquitous bad packing all over the place. Let me explain what I mean with that. A C++ object typically looks like this. You have a class, and the way this is represented in memory is you have a virtual table pointer, and then you have the value for the object, so in that case, for the integer, you'd have an integer value or an enzyme value. And then there's some overhead for malloc. It's self-operated or whatever. You have, for instance, a linked list or a free list or something like that. So overall, for your object representing an enzyme value, you typically use 12 bytes. 12 bytes, that's on a 32-bit CPU. That lets you represent all values up to 4 billion, and it's fixed size. You can't remove it in memory. Not good. Let's do better. So the representation we used looks like something like that. We use LB128, which is a system that is used, for instance, in Dwarf all over the place. And there let's us code the ID that is used to identify the type of object as one byte for integers. We have 128 types that we can represent with one byte. And the value, if it's less than 128, is also on one byte. So that means that I use only two bytes of memory that's a 6x factor compared to the other representation for all values below 128. And I can move to infinity because the LB128 is a variable size encoding, so I can essentially have numbers that are as big as I want. It's now a variable size object, and I can move it. So it's a vast improvement. That lets me have a memory organization where I have at the bottom of memory all the global variables, the global objects that I keep. It's essentially a name, a value, a name, a value. And then, so they are all packed together. And then on top of that, I have temporaries that move with a temporary pointer that moves as you allocate objects. And then there is an editor, scratch pad, and the transient stuff on top of that. Because it's all contiguous, the way to reach the next object is to skip by reading the ID and computing the size to get to the next object. So on top of memory, you have root pointers that point back to, like, the stack, the local variables, that kind of stuff, that point back to this memory area at the bottom. And the root pointers can point inside objects. That's a very important property for performance. For instance, if you follow the one link, you'll see that it points just behind, I think, like, with curly braces. It means it's part of a list, and I can put the value that is inside the list directly on the stack. So I can do the computations faster that way. And there is also a series of smart pointer classes, the names and in other score G in the source code, that let me have garbage-collected smart pointers. The allocation is super cheap, because it's essentially I'm moving the pointer at the top of the scratch space, like this. So it's just one addition and one comparison, and the comparison is to see, okay, am I out of memory, do I need to garbage-collect? So a very, very cheap allocation. The garbage collection itself, as you, you know, your memory grows and you allocate more and more stuff, so at some point, memory gets slow. The unreferenced temporaries, you no longer need them, so what you do is you copy the reference object down and you adjust the pointers, and then you move the editing part of the scratch pad down, and you reclaim your free space that way. So the good point of this approach is that there is no memory of a head at all. There is not a single byte that is used for metadata or linked list or anything like that. The sub-objects, so pointers to objects inside a list, for instance, don't cost me extra at all either. If you know something about garbage collectors and you think of a market-suit garbage collector, for instance, it needs some metadata about sub-objects, and so that means you have extra costs for objects inside objects. And it's a single-pass garbage collector, so it's simple code, easy to maintain, but the downside is that it's slow. It's essentially a quadratic behavior, number of stack objects times number of objects instead of linear or close to linear that you could get otherwise. So it's a usual trade-off of space for speed. So why use C++ at all? Well, it's because of template metaprogramming, and let me explain why this matters. So the guy that you see in the photo there is a guy named David van der Vorder, and he's a Belgian guy who initiated me to C++ metaprogramming back in 1998 when we were in the C++ HP compiler team. So the guys you see in the background are the HP compiler team back in 1998, and that guy is super, super smart and initiated me to template metaprogramming before it was even possible, so we were dreaming about doing these things. But now you can, and let me explain why it matters. I'm going to represent code as data using metaprogramming, not because we can, just for the sake of it, but because I have to. So let me talk about bug number 12 in our project. You compute 1.2 plus 3.4, and it hangs on battery power. So how do you reproduce this bug? You don't use the technique shown on the right. Instead, you simply type 1, 2, 2, 3, 4, plus, and the calculator sits there, not doing the computation. And your users call you and say, did you even test the thing? So you scratch your head, how did I miss that? Well, the fact is it hangs only on battery power, and as soon as you plug the USB cable, the computation resumes and you get the result. You can guess that I did my testing with the USB cable on. So what is this bug? This one was a bit hard to find. It turns out that the chip has an execute in place feature that works, it's supposed to work on the external chip, something called the QSPI interface, except it just lacks battery power, or a power juice when it's on battery. And so essentially it sits there waiting for the cycle to complete, and it completes it when you plug the power. Okay, so that means I have to move as much of my mathematics into data that I can read from the QSPI as opposed to code that I cannot put there. That's why I only have 700K, otherwise I'd have two makes. So how do I use C++ metaprogramming to do that? Let's see a description of an interesting math rule, and that's how you expand polynomials. So you know the rule, you see the first rule, for instance, X plus Y times Z, you turn that into X times Z plus Y times Z, and you see that's exactly what you see in the code. So the code contains essentially the mathematical formula as you're applying. That's neat, right? Now, here's a guess. How many bytes of code does that generate? Give me a guess. Nobody wants to want to guess. Okay, that's the assembly code. 12 bytes. So that code generates 12 bytes of code, but it generates tons of read-only data, which is good because I can move that to my QSPI. So the magic is this ugly metaprogramming code that generates constant arrays, and I taught the C++ compiler how to generate RPL objects from C++ expressions. Isn't that cool? And so that's how you get 12 bytes of code, tons of data that I don't care about, I have plenty of that data space free and no executing place needed. So in the end, how much math in 700K? Well, it turns out that for another reason, I'm now back under 500K, so I'm within the limit that we all heard about, the 640K that ought to be enough for everybody, right? So from one engineer to another, what do we have? So we have base numbers for engineers in the computer field, that's really fancy. In any base, I can compute in base 17 or 34 if you want, or three. With any size, you can compute on 13 bits or 512 bits if you want. We have complex numbers that's useful for electrical engineering, and we have phases that are dealt with with exact results if we can, so like exact fractions and stuff like that. We have linear algebra, and here two exact results when we can. Statistics, which is useful for field science. Degree minutes, second support, so that's if you're doing, you know, maritime navigation or stuff like that, that's really handy, you have a really nice shortcut for that. Unit conversions, if you want to land something on Mars without crashing it, because some guy in the US is using really ridiculous units. And symbolic processing, which is useful for math gigs. About 1980s era efficiency. I have this magic menu, it's the key at the top, next to the A symbol, and essentially it selects the right menu depending on the type of the object on the stack. So very few key strokes to get exactly the functions that are most useful for what I'm working on. Equation data entry, I use a single key to enter the symbol that they limit expressions, and when I'm inside, that's the quotes in RPL, but once I'm inside an expression, I no longer need these quotes, so I hit the same key and I get parentheses instead. And same thing with the equal sign that you see at the bottom, it evaluates an expression, so it's the eval function of RPL, but if you're inside an equation, then it says, well, I'm inserting an equal sign because I'm trading an equation, and if I'm inside parentheses, it's inserting a semicolon instead to separate function arguments. Exactly symbol data entry, that's for your gigs. So when you type a hash sign, the cursor changes to a B, that's for base numbers, and it says now the ABCD keys, you don't need to shift them or anything, you just get ABCD. DMS data entry dot dot dot, and yep, and a one key function. Okay. Just my conclusion that I cannot answer the question because I still have 200k to go, so see you next year, guys. Thank you. Thank you. Thank you. Thank you. Thank you. So the next speaker can set up. Is there any time for questions? Yes, there's five questions. We'll have one for the next speaker. Yeah. But I don't see the next speaker. No questions, seriously? Who wants to help with this project? I'll just give my laptop. You know, it's 20. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. We'll just give my laptop. Does that rock? Does the calculator have a beeper? Yes. That's a good question. So let me... I'll use the voice. Oh. I think it will be a full row. So here we go. Okay. Okay. Okay. Okay. Okay. Now it's your world of activities.
RISC-V Bootstrapping in Guix and Live-Bootstrap
or say geeks integration or something like that. But in general, those APIs won't be used by Goblin's programs for the most part. But we will provide the compatibility. Because we already started the next door. All right, very good. Thank you very much. That's the, that thing. Hi, can you hear me? Yeah, right? Okay. How many people here is aware of the problem of the bootstrapping? Place hands. Okay, that's good. That's better than I expected. That's fine. So, first of all, this is a disclaimer. I wrote everything I'm going to talk about in my blog. And also I give a talk the last year. So if you really want nitty gritty details about the bootstrapping process, go there. This is not going to be a very technical talk. Okay? It's going to be just an explanation of what we did in the RISC-5 world in the bootstrapping process on gigs and live bootstrap. So, this is me, right? I'm a telecommunication engineer, and a freelance programmer, and I work a lot in gigs. So maybe you remember me from the last year. I gave this talk. There we explain the bootstrapping problem is you have more interest on that. There's more slides on that and a, quite a long explanation of what we are doing and why. So this is the context. I work with NLNet. The last year, so they paid me literally to make some work in the bootstrapping process. I back ported some support for RISC-5 to another GCC to the 4.6. And also I back ported support for tiny GCC boot, which is a fork we are maintaining in order to be able to bootstrap the compilers. So I'm going to talk a little bit more about this later. So this was explained in the last year, so that's nice. So this year, I decided to continue with this project, but it was completely burnt out, and I needed help because people always helps, right? So I added more people to the project. These two are most, the ones that took more work in this port, and they literally gave me the energy to continue, right? So Andrews is very interested in the project because he works in Live Bootstrap and Stage Zero, which are projects that are very related with this. We are going to see them later. And Janneke is the author of MES and also the maintainer of tiny GCC boot. We are going to talk about that just now. So let's see in pictures, right? There are some colors, but I'm going to point. So if anyone has problems with the colors, it's no worries. So this is what we had before my project, right? We have Stage Zero POSIX, which is source code, right? Then we built with that, we use that to build MES, and with MES, we try to build a bootstrapable tiny GCC, which is a fork of tiny GCC, but that is easier, right? The C code it uses is simpler to be able to build. Then we try to build tiny GCC, then we go for a very, very old GCC from the 90s, right? And then we go for a modern GCC, maybe with many steps in the middle in all of the parts, and then we try to compile the world with GCC. So now the colors. All this is the current bootstrapping process that is in live bootstrap. We have it in Geeks too. So this just works, but only in X86. So I'm working in the RISC-V port of all this. So RISC-V, the status of the RISC-V part was these two parts of the top, they were already having some RISC-V support. It was working pretty fine, okay? The bootstrap of all tiny GCC had zero RISC-V support. Tiny GCC, it was supposed to have some RISC-V support, but it was worse than we thought. GCC didn't have it because these are very old GCCs. They were written before RISC-V was invented, so no support there. And the modern GCC that supports RISC-V is the 7.5 version. Then the world, some things support RISC-V, some others don't, but that's not my problem. I'm only working from here to the top, so don't worry about that. So after my effort, my previous effort, I took the support from this GCC, which is kind of a modern GCC, 7.5 is not at all, took that to GCC-4, this is a note here, this is written in C, this is written in C++, ha ha, I had a lot of fun there. And also, I took the support from here and I moved it to this one, right? So this was also, I think it's like a 10-year difference between these two, so the APIs, the internal APIs changed, many things are very difficult. GCC is horrible to read. Maybe here there is the maintainer, I'm sorry, but it's really hard to read this project, I'm sorry. So at the time, we didn't know that this was orange, this is not fully supported in RISC-V, we thought it was completely green, fantastic, no, it's not, so problems there. And this one, I finished this backboard and I thought I was going to have issues with this, but it happened to be pretty much okay, so nowadays this is way greener than we thought at the beginning, so this is before what we did this year, right? Starting in June, we started working on this with the people I already mentioned, and now we settled to this point and this is already in LiveWoodstrap and we have it in Geeks, in Core Updates branch, this is already upstream to Geeks. So until here, everything works, so thank you very much. Good, so this part we already tested in Adobe Machine, this part we tested in Adobe Machine in real hardware, in RISC-V board we have, and this also works, this GCC 4.6 compiling stuff for RISC-V. So a compiler that was written before this architecture was invented is compiling to that, so that's also very nice, we have it, yeah? So this is more or less what we did. So there are problems though, the arrows here are still red, I don't like that. So why are they, they are still red, why? So TinyCC request some changes in the C library where we have here, so we need to change those to make them work, right? Also, the old GCC requires make, which I managed to compile the other day. And it requires some other stuff, right? It requires patch, we also need G-SIP, which I didn't have the time to compile, and some other things. So also this jump is going to be kind of complex because GCC really has a very complex build system, maybe you tried, it's a really complicated thing, right? So it should just work, but it probably won't. So, questions now, and I have some extra slides for later, but anyone has any question? No? No, okay, extra slides. So we had some limitations in the backboard we did, and this is what we have been playing with since June this year. So when I made the backboards, I was working only using a cross compiler. So if you're working on a cross compiler setup from x86 compiling stuff for RISC-5, you are going to have a lot of problems, why? Well, first of all, you have the bootstrapping problem we're going to show in the next slide. And also, I was using G-Lipsy, which is a very powerful Lipsy, and we don't have that in the bootstrapping process. There's no Lipsy, so we need to play around with all the stuff we have, like Meslipsy, which is written by us, so it's probably not going to be great. We're not that good after all. So also, there's the RISC-5 assembly issue. In TinyCC, the RISC-5 assembly they have, it doesn't use the same syntax as gas does. So our library was expecting a gas syntax, and this doesn't provide that. And also, it doesn't support the extended assembler. So we can't really mix very well C code with assembly code, and we need to play around with all the variables, protect them, and make all those things we have to do by hand, and that's a problem. So this is how TCC is built. The graph I showed you before is just a lie, but it's a good lie. So this is how it works. We first build Meslipsy, we take some part of the code of the TinyCC boot, and with that we build this one, and with that we build this one, and we change the flags of the code so we add more features. With that one, we build another one. We take the code again, we build another one. We do these six times, and then, of course, all these steps will need to work. There is a lot of bash, bash glue code in the middle to make all this happen, and you have to fix that too. And fixing very old bash code we did for this kind of thing is even harder than reading the compiler, but anyway. So then we check that this one and this one are the same. In the binary level, they have to be exactly the same. That means the compiler is not adding new stuff, so we have settled. So we can just continue with those. My colleague, Andrews, already tested that we are already settling the fourth iteration, but we do six because we did six and we don't want to change. They did. In live bootstrap, they only do four, right? So problems with GCC. I only tested, again, as a cross compiler the last year because I only wanted to see that it was able to compile things for RISC-5. And again, I wasn't doing the bootstrapping process of GCC. GCC does internally, when you build GCC by hand, they do a similar thing. They take all the code base of GCC, they create a previous GCC, then they take the code, they compile it again with the GCC they created, and then again, and then they compare. So I wasn't able to do that. And I didn't work in the C++ support either. So the work we did, we started with tiny CC boot, and we started working on top of it. We had to read, we spent many times, many nights, debugging crazy things, and also because Andrew has a real job, not like me. So we need to coordinate to do these kind of things. It was really hard. So also we don't have debug symbols because our compiler is very simple, and implementing that takes a lot of time, and it's difficult. So we do all of this like one hand here. It's very hard to do with one hand and also blindfolded. So, but we managed to do that. I wouldn't have the energy to do this without Andrew, so thank you, Andrews. Also, well, these are some errors, I explained them in my blog, then or later you can come and ask me about them. This is a lot of fun because the body was never executed for any X, it didn't matter. This happens a lot in RISC-5, sorry, in TCC, and in our backend, it exploded. Why? Because this is undefined behavior, and all the compiler was based on this. So they used these to clear bits, and we needed to check all the appearances of this and fix them all. So, funny stuff. Yeah, and many other things we found a lot. So you can read about them there. There's a very long explanation about all of that. Yeah, so we finally managed to build it, we have it, we have a recipe in Laboodstrap and in Geeks. Yeah. So, about mess. We had all the stuff in mess because it was affected by our work in TCC boot, so we started fixing stuff. Why there were errors in mess? Obviously because we were not perfect. Yannick almost is, but still. We had some issues because the bootstrapping process of I386 didn't use all the C constructs that appeared in RISC-5, so it started fixing many things, like the switch cases, they were wrong. The initialization of the structures were initialized to 22, I don't know why. So, these kind of things. And I am almost there. Well, TCC is the same. We finally managed to compile it in a different machine, with C++ support, all of that. Okay, fantastic. So, last words. So, people is important. If you're alone, you don't work well. I had issues, I was like completely depressed, burnout. So, bringing people, giving me energy, the knowledge I lack, and emotional support, good stuff. Also, money is important. You all know this, but if you're getting paid, you work better, you don't feel stressed, you are not just trying to eat the next day, to just get paid, do your work, that's fine, that's good. You can focus. So, thank you to Andrews, to Janneke, also on the net for the money. And you for listening. Thank you. And it's our question. I don't know if we are in time. We have time for questions. Okay. We'll set off to the data. Questions? Regarding both the people and the money, will you be continuing your work? Yeah, so regarding money, the people and all that, will I continue with the project? I'm not sure. I'm not sure. I don't think I'll be doing it. I don't think so. I don't think so. I don't think so. I don't think so. I don't think so. I don't think so. I don't think so. Will I continue with the project? We have funding and stuff to do still, until I think the project finishes in one year. So we're starting in June. So until June, we're going to continue. I'm still working on it. I have, most of the budget is not spent because we need to finally combine until GCC. So the project is to... June, we're going to go. Yeah. More? No? Yeah? I was listening to a ZIP project. It's used an interesting approach to this, where they use WebAssembly. So how is it working? Your project was you use the latest GCC to compile GCC to WebAssembly. And then your problem on risk five, is you just need to bootstrap a WebAssembly runtime, which is very small, to run GCC on risk five. Do you think this kind of approach might work in your environment, or is that just very specific to SIG's problem? So the question is around how the people at SIG resolved their issue with the bootstrapping. And they are yet using a WebAssembly environment stuff. And if we can do the same, or if that makes sense in this. So our idea of this is that we want to build everything from source in your machine. Why? Because if you get a Linux distribution, you download a Vivian or whatever from the internet, you are getting a lot of binary blobs. So the idea is just to get the source. So that's not very compatible with the approach you are proposing, because you won't get sources. You will get some kind of a wasn't thinking. And that's not easy to inspect. So what we have here is that you can inspect everything, starting from a very small binary that is written with comments, so you can read the comments on the binary that the bootstraps everything. So the idea is philosophically different. And I'm a little bit upset with the problem with SIG, because I really like the language. And now adding this wasn't thinking in the middle is making us very difficult to add SIG to Giggs, because we will have this kind of binary in there. And we don't really like that, because we want everything to be sourced. But yeah, the idea is good. But philosophically, it doesn't match with what we are doing. Any other time for another one? Piot? Yeah. No more? OK. Right? Thank you. Thank you, guys. You have one, Piot. OK, you have one. Sorry. What about the arm board? The arm. I don't know. I'm not sure. Maybe you should ask another people here, like Danny. But everything we are doing, the RISC-5 board we are doing is 64-bit. It's going to affect all the other 64-bit architecture. So we are doing advances in X, 66, 64, and ARM, and all things. So yeah. Yeah, shoot. Shoot. All yours. Yeah. So for the arm board, we got as far as compiling tiny CCC, and that one compiles an old GCC. And that old GCC has a lot of problems that are well known. Nokia had a lot of fun back then with these bugs. And so we are waiting for you to update GCC, and hopefully that fixes everything. So. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah.
Self-hosting and autonomy using guix-forge
So, good morning everyone. This is talk about Geeksforge. So, first let me explain what Geeksforge is about. So, Geeksforge is a Geeks channel that has services that will allow you to run a complete GitHub like software forge, but fully on free software and using existing free software components like Seagate and Git, of course, the laminar continuous integration system, something like public inbox and so on. So, usually when we try to build GitHub alternatives, we have monolithic systems like GitLab or GitE, Gox and so on. What Geeksforge tries to do different is use old and existing very stable components like Seagate and assemble it all together into a system that resembles a software forge. And it is assembled together using Geeks. So, you have a nice declarative configuration that you can just deploy practically anywhere. So, in a sense, it's like million a box if you have heard of the project, million a box, they set up complete mail server on a system using by integrating many different components. It's like that, but for software forges and using Geeks. So, first I'll start with a quick demo of the Geeks system containers. This is quite widely used as a package manager, but as a means to deploy Geeks, a full operating system and operating system containers, it's not so widely used. So, I just want to quickly show you a demo of how it works. So, this is a really simple operating system configuration. It just has an engine service that listens on 8080 and serves static directory. So, let me build that. So, the static directory has a simple HTML file that I just wrote up. So, first let's build the container. You build it using Geeks system container. And the hyphen capital N is to enable network access. And the container is completely stateless, something like Docker where you have attached storage somehow. So, you have to mount all storage, all state into the container. And that's why we have the expose here. So, you have this script that has been returned. So, if you open it, it's really just a guy's script that sets up the container and has all the dependencies built into the store itself. So, let me now run it. So, pseudo... Yeah. It says that my Geeks is two worlds, older than 30 days. So, I have started up the container. Let's just go to localhost 8080. And it works. So, this is just the static HTML page. Now, let's try to set up a container that actually uses the Geeks 4 channel. So, this is a more complicated configuration, operating system configuration. Here, I want to show you the Seagate service that Geeks 4 provides. So, it's really simple and it just takes a server name, which is the domain name. And then the repository directory where all the gate repositories are stored. And then you have something called a Forge engine X service, which is similar to the basic engine X service that you have in Geeks upstream. But it automatically handles things like HTTPS, acquiring a TLS certificate, setting up a crown job to periodically renew the certificate, automatic redirection from HTTP to HTTPS and so on. So, it does a lot of things in a very turnkey, fully automated way. You just push the button and you get it essentially. And this is the Acme service configuration. So, Acme is the protocol behind the Let's Encrypt. You have to register an email ID of that. So, that's my email ID. So, in this configuration, I'm currently using the staging URL. It's good for testing because you won't run into any rate limits. So, I'll actually take the risk and delete that. We'll try to build with a real Acme server. So, here again, I'll build a container and run it. I'm mounting a couple of state directories, Acme directory and the GitR Poster directory. So, there it is. It started. So, I'll go to git.demo.system.rego.net. So, initially, the container set up with a self-signed certificate. So, it doesn't work. So, let's actually get real certificates. So, find the shepherd. So, the PID of the container is 19.262. I drop into a shell. Get some source and profile. So, GitX4 sets up a script under user bin. Acme is any... Yeah, I'm inside the container. Yeah. So, around the script. And the script has been automatically configured with all the domain names that need certificates. And now it is actually getting certificates from Let's Encrypt. If you can see the logs, it's telling you what it's doing. Yeah, that it has a certificate and it has restarted the Nginx service as well. Now, if I reload this, it should work with proper certificates. Let's try. Yeah, there you go. So, this is Git. And you can browse some repositories that I put in there. So, Git is really simple, but it doesn't come with all features properly enabled by default. And you have to do a lot of manual tinkering to get it to work. For example, by default, it only serves the dumb HTTP transport protocol for Git. So, but the C Git Nginx 4G is set up with the smart HTTP protocol. That's one. And then you have things like... So, this C Git can render org mode readme files, which the basics it can't do. So, this is actually an org mode readme file in this repo. Then you have things like syntax highlighting that is automatically set up again. So, let's just look at the make file maybe. Yeah, so, yeah, you see the syntax highlighting. So, for that it uses Python pigments. So, my point is that Gitx 4G tries to do all this for you and doesn't expose all this complexity to the administrator. And all you're really saying here in this configuration is domain name and the directory where the repositories are. So, it handles a lot of things with very sensible defaults behind your back. So, that's that. Yeah. How much time do I have? Okay. Okay. So, the philosophy behind Gitx 4G is that it has to be really minimalistic. I don't want to be running a full database server just to publish a few GitHub postries and run a small project. And it should be as stateless as possible. Of course, you need a little bit of state for if you need a mailing list or if you need to backup your Git reports, of course. But it should not have hard to backup state like a database that you have to be that takes a lot of cognitive overhead to keep working successfully. As to, should be as donkey as possible, but it should still be able to inspect it and fit it in your head. It should not be something that is so complex that you cannot hold it in your head. And effectively, what the, what Geeks 4 and the, the Geeks 4 channel is doing is that it's, it's crowdsourcing server management in some sense. So, the regular server which you are always, which have, for which you have to mutate configuration files, you are the only one who's in charge of the server. But when you have Geeks 4 doing a lot of things for you, you're essentially getting a community to help you with managing your server. And so hopefully that will reduce configuration errors and let you run a polished server setup without putting in too much work. So that's it. Thank you. Nobody complains when the speaker is too quick, right? Is this a replacement from GitHub? Yeah, it's meant to be. What about the fast pushing process and we having these things? Can we support them with this Geeks 4? Do you mean the email workflow? Yeah. Yeah, so I don't mean to support public inbox based mailing list. Instead of pull request based model. I think that's easy to set up using existing tools and personally I think it's better than the pull request based model. Questions? Yeah. So I think you mentioned it's in a separate channel. Yeah. And are you planning to upstream it and what would be needed for that? So, can we repeat the question? Yes. Yes. Yes. Sorry. So I'm planning to upstream it into Geeks upstream instead of having a separate channel. So certainly there are some parts that can be upstream. For example, the automatic HTTPS that I demo it can certainly it should be upstreamed. But the all the other services I'm not really sure. So I'm not sure how much of this fits into Geeks upstream itself. We already have a Seagate service in Geeks upstream that doesn't do as much as the Seagate service in Geeks 4. So upstreaming this will essentially break the old service. Maybe it should be called something else now. So that's a difficult conversation to have. Could you have a Meta service? Sorry? Do you have a service with all your special services? I do have a 4 service. It's not fully integrated but it aims to be a full Meta service. Yeah. Can you show Laminar? Oh yes. I can show it in the browser. So this is Laminar which is a continuous integration system. So this is a system that we are already running. It's not running on this laptop. It's running on a different server. And it's a really simple continuous integration system that is very easy to set up. Like most continuous integration systems are so complex that they read the very enterpracy projects that are not meant for a single person to set up. But Laminar is really easy and you should have a look at the documentation itself. It's just a single page of documentation and you can set it up. So we use that in Geeks 4. And it fits in with the philosophy of using very minimal tools. We also have class in Geeks 4. Class is another Git reviewer which is written in Python. So you have even a choice for... If you don't like CGIT you can use class and maybe you can support Git delay and other Git viewers too. Sure. So these are the Git logs. Maybe... Yeah, make file again. So class is just a Git reviewer. It doesn't do anything else. Yeah, it supports the Smart HTTP protocol. Yeah. So you mentioned that the TLS stuff is automated as well. But with the demo there was something that seemed kind of manual? Oh yeah. So the manual step that I showed you is only the first time. And after that that same script is ran as a cron job. I need to get rid of the first manual step but I think I need to patch something in Geeks upstream for it to happen. So yeah. Question. Would it be easy to use this process to set up your own channel and then auto build your packages and then deliver that as a substitute? Yeah. Yeah. Get them to end the flow? Yeah. So we already do that in my Geeks 4 instance. And we also have the... So that is the Geeks bioinformatics channel which Pewter runs. And we already do that for all the packages in Geeks bioinformatics. For example, here you see names of many packages. Some of them build, some of them fail. And I think it's... Using laminar and Geeks 4 is simpler than something as complicated as... As Geeks is quicker as CIS. And I really don't want to be running Postgres to have... To just provide substitutes for my channel. So we have a replacement here for many things, right? Yeah. Including GitHub CI. We don't use GitHub CI anymore. Yeah, we don't use GitHub CI anymore. Alright. Thank you.
Open source leadership at scale, how 1300+ people improved Drupal’s multilingual features
Keeping on, keeping on schedule and with one microphone per person. Gabarhojji, very old friend of mine and open source original. Drupal Core maintainer several times. And now talking about a really powerful contribution project in the Drupal community. Roughly right? Thank you, Jim. Yeah. Great. Hi everybody. Thanks for coming. I think it's going to be interesting for everyone hopefully in some degree. Because this is about open source leadership at scale. As Jim said, I'm Gabarhojji. And... They called it one slide, Gabarhojji. No, it's done. So I'm Gabarhojji. My own made up title is Full Stack Community Organizer. Which means that I can put on an event for you. I can manage social media, design graphics, do a keynote, build developer tools and like basically everything in between. Write marketing copy and do everything in between. So whatever is needed at the time. So I've also been working with Drupal since 2003. Much like Mathias with Type or 3 since 2003. Just picked a different system. But it's around the same time and I'm a Drupal core committer and did a bunch of stuff that are... That helped in getting here where I am now. But I'm more interested right now in where you are coming from. Who's using Drupal for anything in the room? Alright, some of you who have no idea what Drupal is. Just here for the title because it was nice. Okay, I didn't explain Drupal that much. That's great. Who are you consider yourself primarily developers? Okay, great. Nice. So those were the main questions that I wanted to have so I can direct the talk properly for the audience. So I got into open source from open content. In the 1990s I went to high school. And the high school got dial-in modems and we got on the internet. And I was really interested in how we can publish stuff on the internet. How we can put something on the internet. And I decided to be the lazy teacher that reads five pages ahead and then teaches everybody else what they learned. So I started looking at how is this done and started to go into documentation and started to translate the W3C standards into Hungarian. And then the PHP documentation into Hungarian. And then distributing that and then starting to look for news and articles and translate those. Summarize and translate those into Hungarian and publish it for the Hungarian community. And so it basically turned out to be a thing where I needed to set up a website to publish all of these things. And I got together with a person in Vienna, Shandor Shromkuti, who I've never met ever in my life, ever since. Now we work together very well online. And we created this website called Vabla Bor that was hosting these things. And I went on a side quest with the PHP community. I became the lead of the PHP documentation and the lead of the PHP.net website in the beginning of the 2000s. And growing this Hungarian community website as well. But the Hungarian website grew out so much that we needed some kind of system to manage the community, to publish these things, manage the forums that we've had and the meetups that we had. And so we needed to have some system that managed this. And that's where I found Drupal. And Drupal was tiny and nice. And this was the whole Drupal conference in 2005, all the attendees. There's a certain person sitting there. So it was a tiny community that was very tight-knit. And we would get together. The software was managed through a mailing list where you would post a patch to the mailing list. And it was reviewed on the mailing list and then committed to CVS. So it was very tight-knit and everything was reviewed by these few people. And so it was very easy to join. And I needed it for a Hungarian website. So my main problems that I would go in and fix were usually about translatability into Hungarian. Or I wanted to have the path aliases in Hungarian. I wanted to have everything in Hungarian. And I always bumped into bugs and I submitted bugs and they got fixed. So I fixed them, then they got committed. So that was basically the natural way to get into the community. Small was easy to approach and they would receive those fixes very well. Fast forward, ten years later, this was the Drupal conference ten years later, there's people up there as well. They're hard to notice. So this is Drupal condomber. So it's kind of hard to get started in this community. Like who do you walk up to that I have this bug and please work with me on fixing this bug? It doesn't work. Like there's no way to do that. Like, I don't know, it's hard. It's like walking up to people on the street and like trying to convince them of something. When I got in here, all of the buses, I think bus 71 was always full. Like I waited for two or three buses. All of them were full. I decided to call an Uber yesterday. And I was myself in the Uber. So I walked up to the people waiting for the bus. Like, how do you want to come with me? And they were like, no. Who are you? Like that's the kind of feeling that you walk up to someone and like, no, who are you? Like, why are you approaching me? So I was alone in the Uber. But when you have this tight small community, it's much easier to work with them. So when we got to this point, it started to get very hard to manage what people are working on and organize that and motivate that. We kind of went by pretty long without more structural organization. But around this time, the project lead, Drew Spreiter, decided to set up initiatives. And so that initiatives could get back to this tight knit small feeling and they could sit together and know each other and have a sense of community in this much smaller scale. And they could work together very well. And when this started, I was approached to work on the multilingual initiative because even 10 years after I joined, the multilingual was still a problem space that had a lot of problems to be solved. And so I was happy to accept that. And so I started working on the multilingual initiative. And everything was rosy and happy and I started working on things. And then a bit later, something bad happened for me. And the least I considered that was super bad that happened is that another initiative was announced. The views in core initiative was announced. And so multilingual was especially in Europe pretty important. But views in Drupal is basically the, so if you don't know Drupal, that in that detail, then views is basically a query builder based on Drupal's very rich structured data. And it's also an output generator based on the query. And you can choose how the output is generated. And it can generate APIs and REST endpoints and lists and sliders and all kinds of things. So basically it's a query builder and an output generator. And views was four times more popular than any of the multilingual modules. And they got funding from Sony and other companies. So they had money. They were four times more popular. And they started to steal some of the people that were working on the multilingual initiative. So I felt totally betrayed and I was super angry. And I wrote this email to project leadership and core committers that this will jeopardize my initiative. It will make my work super hard to do because they're going to steal the thunder and will be very hard to do this going forward. And now I'm here talking about how successful it was. So we sort of resolved this. But I was super angry and very jealous and also felt betrayed. And I think what was interesting is I didn't get responses to my feelings there. I did get responses to the facts that I stated and they were refuted. But my feelings were not contested. And I think what I realized after a while, after I had time to think about this, is the problem that I had is I was thinking of Drupal as this small pie that we are eating from. And if everybody's eating from the same pie, then it's kind of be over after a while and you don't really have more to eat. So if you steal my people, then I don't have people and I'm not going to have people. And so I think that was the key understanding that I had is that I need to think about how to grow this pie. And even though we had all of those thousands of people at the conference, I think we still didn't have a good grasp on how we involved new contributors very well and how we make them successful, which was even more important. And my other problem that I had is I didn't have money. And this realization that I need to grow the pie didn't make me have more money. Some of the companies that were involved in the Multilingual Initiative had money and they were investing into sponsoring some of their people. But I didn't have money on the scale of 1,300 people. Like, that was not possible to achieve. So I need to figure out something else. And so what I started to look at is how to make people happy. Because they would come here if there's something in it for them. They would join us if there's something in it for them. And I did, I read a bunch of stuff and some of this clicked together afterwards and it provides a great structure for this talk. But some of this, I basically figured out on the go. And so I think the best structure for this talk is these three words. This is from Dan Pink's book called Drive, which is one of the three books that I suggest you read on this topic. So Dan Pink Drive. And he highlights that people like working on things when they have autonomy. So they decide for themselves. They decide how they solve problems, who they solve problems with, how they move forward, etc. People strive if they have mastery so they can get better at things. They improve. They can try new things and improve in them and get challenged. And they strive when there's a purpose of what they are working on. And so if we can figure out, if we can correct the code on those three things, then it works really well. And I think we correct the code in the multilingual initiative and this is how we did it. So I think the purpose is sort of easy, at least for the people that were involved in my initiative. They were primarily in Europe but also somewhat in Canada and somewhat in Northern US. And they had personal needs for multilingual. So obviously they had the purpose of solving their own problems. But there was also some higher purpose, like if you just look at where Drupal is used, the UNESCO uses multilingual Drupal to help with education and children and refugees and stuff like that. The CERN uses multilingual Drupal to advance science. Tesla is using multilingual Drupal to promote their technology. And you can like configure your car through Drupal on the Tesla.com website. Rathetti is using Drupal extensively and they invest money back in open source as well. While we are in Brussels, it's hard to avoid the European Commission. It's using Drupal super extensively. This is in Hungarian, aropa.eu. But they have 300 websites that are in Drupal. Most of them are multilingual obviously in Europe. It's hard to do anything without. And they have more than 100 people on staff, developers on staff that are working on their Drupal websites. So it's super extensive. But I mean these companies can pay their way to solve their problems. If you have 100 developers, even if multilingual is hard, you can solve that. If you're a Tesla, multilingual is hard, you can solve that. So that's not really what gave me purpose. What gave me purpose is my high school's website where I started working on open content is running on Drupal. Totally accidentally as I was not involved. Totally randomly. So this is the high school I went to. So it can make a Hungarian Drupal website that's fully Hungarian and works very nicely. It's not multilingual, but it's Hungarian. So that gives me purpose. If we can make it work in a way that the little websites can do it very easy that we succeeded. So that was my purpose in here. The autonomy part, I think, is much harder to solve if you come from a traditional open source developer background. Because I think many people that start open source projects, they're great developers. They have this idea of what they want to do. They have this architecture in mind that they know how to get there and the steps to get there and they are building it. And they want to have people along for the ride, but they don't want to have people to tell what the architecture should be and what the steps should be to implement, et cetera. So to give autonomy, you need to agree or understand that you need to agree on the high level goals and get rid of your idea of micromanaging anything below that. So you need to be comfortable with the idea that you define these high level goals and it's up to the team to figure out the rest. And maybe it's not the same architecture that you wanted. Maybe it's not going to be exactly on the timeline that you expected or the steps that you expected, but other people will implement it. If you share the same goal, there's going to be shared ownership and they will implement it. So I think this is hard for... So this is one of the things that's been... I've been trying to mentor initiative leads on in the Drupal community ever since because it's very hard to get from a developer background and have an idea of how this should be done. And then give up that idea and work on organizing the whole thing instead of implementing the whole thing. But to achieve some scale, you need to give that up. The next one is, especially in the Drupal community, is to set up space because when you have this big thousands of people at the conference, there is no identity, there's no space, there's no feeling of community for the team that you have unless you set up the space. So for example, what we did here is we have a chat room. This used to be IRC now, it sounds like, that is shared in the team. We use chat meetings and threads so that it's easy to get involved with multiple language backgrounds. It's much harder to follow live audio meetings and video meetings when it's not your native language and it's very fast. Chat meetings much easier to follow. We had this identity that was created by one of the sponsor companies, because the logo of the multilingual initiative, we had stickers of this, we had t-shirts of this, etc. When we went to events, we had tables where we set up a big sign that this is the multilingual initiative so people came in, they would recognize us, they would join us. We were always there in the morning, by the way, that's a good trick, so that we were the default choice at the contribution room. When people came in in the morning, they were like, oh, multilingual initiative, great. So that allowed us to have this sense of small community that we need to achieve in this big community, to have a sense of belonging and to have a sense of connection, and so that people stay and will have those personal connections that otherwise are not possible in this big community. We also had our own website, which you may or may not need, but it was nice to have our goals set out there, and we basically pulled issues from the IssueQ and used tagging and labeling on issues to prioritize them and then display them nicely, so we didn't need to do a bunch of work manually on the website itself. Now then you have people, I think the next important thing is to have buddies, set up buddies for things. At least in the Drupal community, there's always at least three people that you need for an issue to be committed. There's somebody that works on the fix, there's somebody that reviews the fix, and then somebody that commits the fix. So if you need three people to work on an issue, you need to set up those three people to be successful, like, that's not going to accidentally happen. Like, if you walk into this keynote room, I have an issue, there's nobody going to listen to you. So what we did is when new people came in and they were like, I want to help, we always assigned them to something that somebody else was already working on, because then they had a buddy that was already invested in the issue that they came in to help with, so there was already shared understanding between those two people that they want to solve this problem. And once we had that, we had these buddies that if one of them went away, we still had a solution for how do we move this along. There was still one person left that could serve as a successor to the person that used to introduce the problem to the next person. So it was pretty useful to keep things going because stuff happens to people. Like, all the main people that I had in the initiative, something happened to them, and it was always useful to have buddies that shared the same goals. And that was basically the only way to get stuff done in the Drupal community anyway. So I think that was pretty much a key to our success. And the next thing that I realized is we need to praise the smallest of results, because people don't really recognize that they are going towards a goal and they are achieving something towards the goal unless you point that out. And often people forget about, like, after a week or so that they did something great, so it's great to get back to that. And in the meetings, we always had a section where we were praising the results from the previous meeting and figuring out who did those things and call them out as well. And the other thing that's super important, I think, is to praise the people that go away, because when they go away, they probably already burnt out two or three months before they just didn't realize it. And it's good that they went away. It's good for them because they need the break. And it's good for you because they're not going to be here and, like, maybe have negative effects on the team. And it's good for you because if you are praising that they need this break now, it shows the team that they don't need to overwork here and they don't need to kill themselves through this project. We'll figure this out. And the person that you celebrated for taking the break may actually come back after they took the break if you've been kind through this process. So it's like there's no other option, I think, that's like the win-win-win-win-win to praise people that go away because it's the best for everybody. So if you do these things, you have those buddies, you have a small tight-knit community, even in the bigger community. You have this space. You give them autonomy to work on their own ways. You just share the high-level goals. And then you have this shared ownership about things. And maybe it's not going to be implemented in a way, maybe it's not going to be implemented by the same people you started with or on the timeline you wanted, but it's going to have shared ownership. And that was kind of useful for me when I had a problem. So a couple of years into this initiative, it was a long initiative. I had breakfast with my wife and she started to having very strong stomach pain that didn't end. And so we stopped our breakfast there and we went to the emergency room. And they figured out that her blood results are getting worse and worse, but there was no blood to be seen anywhere. So they figured out that it's internal bleeding and she was about to die in a couple hours if not operated immediately. And so they assembled a team of doctors that would operate her that night and they saved her life. But she lost one ovary on the way. But she now lives on and we still remember this day. And at the same time, Drupal Karnasdin was happening. I was supposed to be there and do all of this magic with the multilingual initiative. And I was obviously not going to travel to Drupal Karnasdin when my wife was recovering from a life-saving operation. So because we had this shared ownership and shared understanding of the initiative, all the stuff that we were planning for the multilingual initiative happened in Austin. They sent us flowers and guards and well wishes and they sent us this photo of some of the people on the Contribution Day to wish us well. But this was because we built this initiative to do it together. And so mastery is the final one, I think, which is probably the most interesting thing, I think. Because people want to get better and you want to have people on your open source project. And so the question is what is that you want to have people on your open source project and they want to get better at and what do you need that they may want to get better at. So that's what we are looking for. So one of the things that I've been doing at events is therapy sessions because multilingual used to be very painful in Drupal and people had pain. And so I set up a multilingual therapy barf, is what it was called, on the schedule. And I would sit back and I was like, do you want to talk about it? And they wanted to talk about it. And so what this was great for is, A, I got in the users that had pain about multilingual so I could have a requirements list of what I want to solve in the multilingual initiative. They got to talk about their pain so they felt heard. The people that were contributing on the initiative came in to the barf and they felt like they are the experts because they could give advice to the people that had the pain. So I was basically sitting there, I didn't do anything at this barf. I said, do you want to talk about this? And the experts came in from the initiative naturally and the people with the pain came in and I just sat there and I enjoyed it. So that's the investment. So the experts, basically the people working on the initiative came in and they gave advice to people with the pain. And we got to show at the barf, this is what we're working on, this is how it's going to make your life easier. We feel your pain. Yes, it is something that's hard right now, but this is how we are solving it. And so we could build in that feedback into what we were working on. We could review with the people with the pain the solutions that we had. Does this solve your pain or not? So it was very good to get direct feedback, it was very good to have them listened, it was very good to provide visibility to the people that were contributing and get professional recognition for them, sometimes business because they were giving advice to clients that were showing up in the room and they may get a business relationship after the barf. So it was great. So I think it was important to acknowledge that multilingual was a pain and provide this space in person as well. The next thing we did was radical openness about how we organized this initiative. So we created an open source slideshow for example that anybody could present anywhere. And they translated this slideshow into multiple languages presented in Japan and Poland and France and bring it to companies and a lot of places. So we just gave this slideshow and we didn't ask for anything in return. And this brought the news of the initiative into far and wide on the globe everywhere that this is happening and made people excited. And also gave people the opportunity to deliver sessions that have not done it before, they didn't build a slide deck that was compelling or anything. And this was useful for them as well. We made the Drupal distribution which had a demo of how this multilingual thing will work. It had demo content and demo menus and a bunch of features set up so that people can try out how they can do it. And they can try out how this will work and they can test this out and we can get feedback. We created a two hour workshop with a 23 page handout that detailed the steps of how you get to build this distribution basically. How you get to build out a multilingual menu, how you get to build out a multilingual content structure, etc. Super detailed. That's why it was 23 pages. It was like click here, right this, click here, right this detail. So this was very useful for people to do these workshops and teach people how to use the multilingual Drupal system before it was even done. Like we were already training people on multilingual Drupal before we were done. And with the help of Acquia we created a user testing script that could be crowdsourced so that people can do user testing at their meetups at their local events and record them and publish their results and we could aggregate the results and use that to inform how we are doing and where we need to improve the user interface or the flows or how all the things are connected. And so I've been doing a bunch of research and reading in the meantime and read a bunch of interesting tricks on how to involve more people. So this is one of them, car wash loyalty. So that was one interesting story about car wash companies. They want to have people come back to wash their cars. And so they did an experiment where they had a car wash loyalty card with eight slots that you could stamp in and then get a free wash at the end. And they did another card that had ten slots but two were already stamped in. It's the same eight slots but there was two more that were already stamped in. How much better did this one work for people? What do you think? This worked twice as good. So the ten slots two already stamped in worked twice as good to get people to get to ten stamps than the eight slots, eight empty slots. They had the same exact number of empty slots. But this one told you that you are already on your way to achieve your goal. So it was like you already started. You had two stamps even though you didn't do any car wash. It was just two stamps and the first stamp was the third stamp that you got there. And the people that had this card, they got there faster as well. Not just twice as many people got there but they got there faster. So I have to translate this to open source contribution. So one thing that I did is I wrote blog posts about how Drupal Multilingual is going. And I broke down the initiative basically I think 18 posts or so. Like this is what we do for multilingual installation. This is what we do for interface translation, etc. And at the end of the post I had a section of by the way this doesn't exactly work well yet. And these are the issues that you can be involved with. So people read about what's exciting thing coming up. They got informed. And at the end they got rubbed into helping with solving the problems because they already felt like we are getting this great solution and they was like it's just almost there. I just need to help with this one. That helped a lot. So at the end we got 1,300 people involved and that included people from companies like NBC Universal and Pfizer and Carefor and the University of Waterloo, University of Iowa, Biologist, Genetic Information Management, McGill University, Johnson & Johnson, Ticketmaster, Google Summer of Code, Google Code and you name it. So all of those sources had people that were involved. So this is the list of people. Too fast. Wanted to spot yourself? So there's a lot of people. And so basically all it took is for me to understand that this is not a fixed pie that we need to look at how we grow this pie. We need to figure out what's in it for people to come in here and grow and be involved. And for me to figure out that I need to give people the autonomy in this project to figure out how they're going to solve this problem. So they have a shared ownership of solving this issue, for them to have ways to get better at things, for them to master their craft, for them to improve on their own terms, and for us to have a shared purpose on why are we doing this. So if you want to read a lot more about all of these things, these are the three top books that I would suggest in this area. So David Marquez turned the ship around. This is great for handing off autonomy. He's a nuclear reactor, nuclear submarine captain that was training for one type of submarine for two years and then got reassigned in one week another type of submarine that he had no idea how to operate. And so he can need to figure out how to give autonomy to the crew. It's a great book. Danny Alping's drive is about this whole structure of autonomy, mystery, and purpose. And the switch from the Chip and Don Heath is a lot of great stories and solutions and tips about how do you make people do things that they probably wanted, but you need to convince them. So the car wash story comes from there, but it's all... There's no software stories in there, by the way, nothing. But there's a lot of stories about what do people do about glove ordering in the hospital or kids' cancer treatment or a bunch of other things. There's a lot of great stories in there that you can apply in one way or another to open source as well. So there was my talk. Any questions? You've left a speech list. All right. When does your book come out? I don't have one on my own. Thank you. All right. Oh, there we go. Yes. So the question was that Drupal has this challenge in all kinds of other areas as well. So I think that is the 10x or 100x to apply to all kinds of other topics. I think so, yes. So I think we've seen some of the recent initiatives that people were really driven to implement that had similar approaches, like the single directory components initiative, for example, had a very similar approach and a smaller scale. So I think we could apply a lot of these to other initiatives. We've been trying to mentor initiative leads on these ideas. And we've been successful in some of these ways of like how do we involve people from events in initiatives. That's been a track that we've been really successful on working through. But there's definitely a lot more that could be applied from here. Yes. Yes. Please submit a proposal to the 5.3 developer next. All right, please. I was suggested to submit the proposal for the 5.3 developer days. When and where is it? It's the first and second and third of August and it's in Koffrüh in Germany. It's the first three days of August in Koffrüh in Germany. It's for the camera. 5.3 developer days. You. Yeah, me too, but the listeners as well. Great. Yeah, thank you everyone. Have a nice day. Thank you.
Releasing a Linux based OS: an overview of Flatcar release cycle
All right, everyone. Welcome to the next session. Just the usual housekeeping. If you're leaving a little bit early, these chairs are fairly long, so don't do exactly that. There's going to be some good sessions here, and we'll have some time for questions at the end. So let's get started. Right. Thank you. Hi, everyone. I'm super excited to be here with you today to talk about FlatCut, to talk about releasing Linux based OS. And I hope you will learn new things. I hope you discover things. And yeah, if you have any questions, I'll be around for the rest of the day. And I'll be available at the end of this presentation to answer your questions. Before going further, I will quickly introduce myself. So my name is Mathieu. I work as a software engineer inside Microsoft. I'm mainly involved and principally involved in the FlatCut development and every features regarding FlatCut. So for example, I'm involved in the cluster API fields. I'm involved in testing the operating system, building the operating system. And what's matter today is releasing the operating system. If you are here at this talk, I assume it's because you have maybe some knowledge about FlatCut. You're already a user of FlatCut. You just want to discover and you're curious about this operating system. So let's have a quick look of what is FlatCut. So FlatCut is a Linux based operating system. It's designed to run containers. So you only have the bare minimum in your operating system to run containers. The goal is to have the less package you ship in the operating system, the less surface attack you have on your operating system. So that's the point of having this. This operating system benefits from automatic updates, which means once you've deployed your instance of FlatCut, it will get automatic updates from the release server and each release is done every two weeks approximately. So you can be sure to have a new version of FlatCut every two weeks. And finally, this system is immutable, which means SlashUSR is in read-only mode. You can't write anything in SlashUSR. You can't install any package. There is no package manager. There is no APT or whatever. So that's a few difference from a day-to-day operating system. FlatCut is already designed to run containers and nothing more. So the question is, well, just to show you inside the box, so I tried to write something on SlashUSR. It doesn't work because it's read-only. That's normal, even in pseudo. And try to use the package manager, the command.phone for each one of them, because that's the goal. The idea is that you have to trust the maintainers and what they ship inside the US. And if you need something more, you can ask yourself or you can ask to the maintainers or the communities, or you can try to find a way to install these packages. How do we maintain the system? Because you can't update yourself the package. You can't install any package, so you have to trust the maintainers and the community. So on GitHub, so this is the QR code on the GitHub repository, which leads to this list of packages. Basically, we are security-driven, which means each time there is a new CVE, a new issue with one of the packages shipped by FlatCut, we track it into this repository and we update the package. So for example, last week we've got the RunC and Docker CVE that has been made public last week. So it's already tracked, and when we will release the next FlatCut, so I hope this week, you should get this update closed. So the packages are updated by the security-driven base and also by a community-driven base, which means if one of you wants to add a new package to FlatCut, you can just open an issue. Hey, I'd like to have this package or this package into FlatCut. Is that possible? And if it's relevant for the community, if people are okay to have this new package into FlatCut, there is some chance that this package is being included in the next FlatCut release. Most of the time we try to challenge people to say, can you use Docker Image instead of using this package, or can you use just the binary that you will download from the boot of the instance to get your software. So we try to challenge always in the same goal is to have the less package into the operating system, because the less package you have, the less vulnerabilities you have in your operating system. If you want to know what's going on in the next FlatCut release, you can join the Office Hours, so this is done publicly every month. The next one is on February, and during the Office Hours, we just check the FlatCut release boards and we check which new package will be included in the next release, so you can give your opinion, you can give your input of which package should be prioritized or not for the next release. That's always a great time to discuss between the maintainers and the community about the content of the next release. So that's the release board that's available and public on GitHub. And of course we ship new packages and package updates, but also the bug fixes and new changes and new features into the operating system. Now we are ready to release, but before releasing, I would like to demystify a bit the FlatCut release number, because this is something we've seen quite some time that people are getting confused with the FlatCut version number. So this is a FlatCut version number, and the idea is like Semver versioning, but not really. For example, the first digit, the 3760, it's the days since the first CoreOS release, because FlatCut is a friendly fork of CoreOS initially, so 3,000 days, it's almost 10 years, so it was 10 years since the last release. Then the second digit is the promotion level, so are we talking about alpha release, beta release, table release or LTS? And finally we have the patch or the maintenance level, which is the last digit. So if you have a zero, it means it's a new major release because there is no patch yet done for this release number. So based on this, we can play a small game and try to identify which is who is who. So for example, the first 37602.0 is a new major stable, because you have the zero at the end, which means it's a major release. We have the two, which means stable, and we have the first digit that is just showed how many days since the first CoreOS release. But based on this, who is able to find what is the third, so 3,850.0.0.0, what is it? Is it an alpha release? Yeah. Is it a patch release? No. No, so it's a new major alpha. And the last one, so with all the freeze, 3,0,3,3.3.18, what does it mean? LTS. LTS, yeah. And it's really old LTS because there is a bunch of patch releases. So patch releases means basically kernel updates. Each time there is a kernel update for the LTS, for example, we just update the kernel, the CA certifications, and critical security issues like OpenSSL, but that's it. But yeah, for the LTS, most of the time it's just kernel patch release, so that's why you have a big number for the LTS. So I mentioned alpha, beta, stable. How does it look like across the time? So we have a new major that is done every two weeks. Then from time to time, we decide to promote that alpha to a beta. So that's why we have one that happens to this example. And then after a few times, this beta version becomes a stable one. And eventually it will become an LTS one. So that's quite interesting because you, as a user, if you run stable flat-card release, it means it has already been in beta a few months before landing into stable. So that's why also we encourage people to run beta nodes into their workloads, like so they can identify if there are any issues with their workload before it gets into stable. Yeah, so that's the release cycle. Now, what's the release process and how does it work? So most of the time it's done in four days. We never release on a Friday because it's a well-known rule that we don't want to break at flat-card neither. But basically on Monday, we start to build the new flat-card releases. So we kick off the builds for the new alpha, new beta, new stable, and normally the new LTS. So this is done on Monday. And on Tuesday, we check the status of the builds. Is the CI OK? Is the image been successfully? So yeah, we have a checklist of things to see and to check. And we start drafting and preparing the release nodes because that's quite important when you have a new release. It's to communicate to people that there is something new and that would be interesting to know what's inside this new release. So yeah, on Tuesday, we start drafting the release node and Wednesday, we have the go-no-go meeting. So this is a meeting done publicly on Matrix Channel where we discuss about should we actually go forward with the release. Are we in a good shape of a release and can we move forward? So this is the go-no-go meeting. So basically, we just check is everything green in the CI. Are the release nodes correctly prepared and everything is good on the CI? And yeah, we decide to go or not to go with the release. Then we have the actual release, which means we are going to take the new images to publish them on the release servers of flat-card and to generate the new payload because as I said, flat-card is going to get automatic updates. So we need to generate the payloads to get them downloaded by the current running instance. And then we have the announcement. As I said, it's important to communicate to people that there is a new flat-card release. And on Thursday, we have the marketplace release because flat-card is supported on multiple vendors. So we have the AWS, GCP, Azure. So we want to publish flat-card images on this marketplace. If we check the release process for Monday, so one of the flat-card engineers will start the build and he will publish the links. Then on Tuesday, we start to preparing the release nodes. So this is, for example, for the last table and there is some nodes. For example, there is Flakitest with Calico C&I on Digital Ocean. So we try to identify is it our fault because of the test framework? Is it something really critical? So sometimes we have to stop the release because we have an issue with the new kernel that has been identified with the test framework. So this is the kind of nodes we can take during the release process. And after that, we have the Go No Go meeting. As I said, it's done on metrics and everyone, contributors and maintainers are invited to say Go or No Go for the release. So it's a ping into all channels across all numbers. And yeah, so people can give feedback on the release status and we decide to move forward with the release. And when it's done, we actually have the release. It's available. It's on the public website and we communicate on Slack, on metrics, on Mastodon. But there is a new release available and please update and give feedback on the release. And finally, we have the Marketplace update. So this is an example with AWS update on the Marketplace. So what's interesting with this process is that the community is involved at each point and always. So nothing is done in secret or whatever. Every time you can give your input, every time you can see what's the status of the release, are we close to have the release to be done or are we far to have the release to be done? So for example, the checklist of all the best items are on public GitHub issues. So you can easily see where are we during the release process. The release notes are drafted on a HackMD document. So you can browse the release notes and start to write and send some comments on it. And the public discussion are always on metrics. So regarding flat-car release process, it's always on metrics, but also for the public discussion of flat-car development. So every decision regarding flat-car is done publicly on metrics. So there is no, as I said, secret discussion. The only thing that is still secret is the build for now because we still have some credentials for the various cloud providers. So ideally, we would like to have it in writtenly on Jenkins, and people can just see the logs of the build and see how things are going on. But it's not done yet. What we've done is that now if you open a pre-request against flat-car repository, it will start the build on GitHub action. So you can see your logs and you can see if something goes wrong, if the CI is OK. So it just build a QMU image and run the test on the QMU image. But for the release itself, it still relies on Jenkins, but eventually, we'd go public using GitHub actions. And I think that we'll close the talk. So if you have any questions, I'll be around with some flat-car teams, remember, for the end of the day. And thanks for your attention for this Sunday afternoon session. All right. Nice break-up question. Great. What's the elevator pitch for using flat-car above Fedora's offering or micro-OS from Sousa? Well, micro-OS and over-operating systems are quite similar, but flat-car has some multi-vendors, for example, or you can use it on premise, on bare metal, on different cloud providers. Also, you have new features that we try to merge into the flat-car operating system, for example, system DCS-X or other things that we try to leverage. And also, there is this, we try to do things upstream first. We got this talk about upstream versus downstream before that, but that's the idea. We try to go upstream first. So each time there is a new feature, we try to first implement it upstream before trying to solve it downstream. So we try to be more on the community side and try to fix the things on the upstream and not really on the downstream. And then speaking of fundamental differences, for example, with micro-OS, you don't have the same mechanism to provision the instance. For example, with flat-car, we use intensively ignition or afterburn, which is not yet available or experimental on micro-OS. So this is the kind of difference you can see. And if I recall correctly, I'm not sure if micro-OS is using REST 3, but this is the kind of functional features. You can see the difference, but in Vienna, it's the same purpose of operating system is just to give the user an operating system to run containers. That's it. So as it's open source, you have the choice of which solution you want to use. Thanks, Moomo. Feel free to comment this. How much has changed or has it been noticeable since Microsoft took over or the acquisition? Thanks for asking the question. So for the short story, Flatcar was developed initially by Kinfolk, which was a company that has been acquired by Microsoft two or three years ago. And I'd say it didn't change a thing for now for the development. The governance has always been on the community side, community-driven. And I'd say it's even better in a way because now we can totally be dedicated to this operating system and to the support of the operating system. And recently, like a few months ago, six months, something like that, we started to look into a CNCF incubation. So basically, we would like to have Flatcar find a new home at CNCF. So there is an open issue on the CNCF tracker so you can see the status of the incubation proposal. But in terms of governance, nothing's changed, and we're still dedicated for users to get the best Flatcar experience on any cloud providers. Thank you. Over the question? Yeah. Matthew, thanks for the talk and for the distribution, the idea, everything. So I'm not familiar with the project, so I'm attending just to understand what's going on. So everything is a container, right? All tools and everything are running as a container. But I'm curious how the kernel is booted or the NITRZ is done. So, yes, that part is a container or not? I don't think so. So Flatcar is not running inside a container. Flatcar is an operating system, like Ubuntu, like Debian, like whatever. It's designed to run container work loads. So you have the very minimum to run container work loads. You have a container one time, you have the kernel modules that face well. So in the end, it's like any over Linux distribution. You have your kernel, you have the boot process, you have the NITROM FS, and then after you have the user space. Yeah, so if I understand correctly, the stuff that's supposed to be previously managed in containers, in traditional packages, is now containers, right? But if a new version of a kernel is released, then how that's distributed, let's say. So if you have a new version of a software, how it's distributed on the operating system, that's the question. Yeah, you just wait for the new Flatcar release, because it's immutable. So if there is a new open SSL version, for example, you have to wait for the next Flatcar release to ship that new open SSL version. Like so you get the update. So that's like pulled from the Internet. It's not like in a format of a package, right? Sorry, come again? Is that in a format of a package of its pulled from the Internet straight? Yeah, it's not in the form, Flatcar is based on GN2 Linux. So the Flatcar itself, when you build it, you take the source from the value repository using GN2 mechanism, then you build the package. Then once the package is built, it is included into the image, which is the new Flatcar. Then the new Flatcar is released, and this is how you benefit from the software update. Okay. So I think my question is also, let's say not only technical, but more on the political side. So history is that this is a folk of CoroS, where Ken Falk started this, right? Then it was brought by Microsoft, but Microsoft has its own, the CBL Mariner. So how does this fit and what is, let's say, this is a cloud based, let's say, the essential client that you have. That this OS has to be used in cloud, right? Yeah, thanks for the question. So CBL Mariner is dedicated to run on Azure, while Flatcar is dedicated to run everywhere. And it's not monetary to run Flatcar on a cloud provider. So as I said, you can run your own Flatcar image on Raspberry Pi, on ARM64 at home if you want to. Or if you have your own, I don't know, Proxmox, we have some people that use Proxmox to run Flatcar. So yeah, Flatcar is really multi vendors and multi architectures. And so while CBL Mariner is really dedicated to Azure and nothing else at the moment. Hi. So in my previous role, we used Flatcar quite a bit for a while. But then we ran into kind of some trouble with AI and especially around things like we're trying to use like Infiniband. And we were actually kind of running into, I think, getting everything set up with Flatcar. Are you all kind of working more towards like AI workloads and making those easier to run on Flatcar? So I'm not at all AI expert. Maybe Remy, behind you, is a Flatcar member. I'm also a Flatcar maintainer and I've been looking at NVIDIA and GPU support in the past. And we want to get better at that. It would be great if someone from the community would also help because I have limited cycles. But it's something I and we care about, right? Just one last question. No one else has any? Do you support different container runtimes? So at the moment, we only ship the current container D. But basically, in a non-official way, you can use Podman using SystemDCX, which is a system D feature which allows you to mount overlay FS images on the base system. So yeah, with Podman, we ship a system DCX in an unofficial way. So you can just pull this Podman extension and load it on the system and have Podman up and running. There is some tracking issue to have this out of the box, of course. Like so you don't need to provisioning and to pull Podman's system extension. But yeah, ideally, we should be able to say, OK, if you want container D and Docker, or if you just want Podman, use this configuration and not this one. But yeah, you can use Podman. Actually, I did some experiments, there are things with it and it works. Cool, thank you. All right, I think we have time for one final question if someone's up for it. All right, looks like there's no question. Thank you very much. Thank you. Thank you. Thank you.
DNS for I2P: Distributed Network without Central Authority
Okay, let's do it like this. Thank you very much Peter for all the efforts for the I2P deff room and by the way, do you hear me back there? Yes, lovely. Okay, right. I hope that the sound check is good now and we're not muted anymore. I'm the I2P guy. I'm one of the I2P guys and I'm talking about fully distributed networks with their specific problems. Fully distributed means truly fully distributed. So today we're also talking about systems without any trust involved, at least in theory. All right, hands up please. Who's familiar and is using I2P or who's familiar with I2P? Yes. Oh, I love you guys. That's really awesome. We have one third which is familiar with I2P so I'm really rushing through the I2P part. But then I'd like to talk about my depressive, my depressing last 12 months, which gave me really a hard time with implementing a persistent storage layer based on I2P and I will tell you why and I will tell you about all my failures and problems and yeah, I will complain a lot. No, we're talking about bit about Byzantine fault tolerance and the good and the bad of the past year. Right. Diva, I'm working for Diva Exchange but it's only an association based in Switzerland. I'm sometimes a lecturer at the University, at the Lucerne University of Applied Science, and there I'm talking about microservices and fully distributed trustless systems and stuff like that. But I'm singing nobody's song so I'm really totally completely independent and so is Diva Exchange. So we're not some coin guys or token guys, which doesn't mean that this is bad, we're just not like that. So hello I2P network. I2P is well known as a dark net because the media talks about it as a dark net, which means and we'll talk about it later that it has something to do with confidentiality and anonymity. But at the end it's an overlay network. So we have the existing internet and on top of that we place software routers to pack the traffic into packages, repackaging them, encrypting them and sending them over several hops and routers through the network. And like this we receive a confidential and anonymous message transport. I2P is no storage layer. Whatever you hear about the dark net that there is content stored etc. That's not true. I2P is not able to store content by itself. There are storage mechanisms like the interplanetary file system which is linked to file coin and these are storage layers. But these storage layers do not necessarily feature confidential and anonymous transport. Often they even fail on implementing such a layer. Six, seven months ago we made a study at the Devalor exchange and we were interested obviously in how big is the latency of UDP package transport on the I2P network. And as you can see it's slow, really slow. And that's the price for privacy. Anonymity, confidentiality is not for free. There is a price attached and this price tag within the I2P network is time. It's slow. Maybe but that's a theory and we need to look at the university into it. Maybe with a strongly increased number of routers maybe we can increase the bandwidth. But that's just maybe. I don't know. We have to do scientific research on that. But this is the current state. Now a dark net, an overlay network as I2P has cryptographic addresses. They are public keys and often it's a hash of a public key. And these B32 addresses so long cryptographic strings like up here are not human friendly. I will not talk about so-called triangle at 6.30 this evening in this room. You will have a presentation about this topic which is for sure also highly interesting. But we have these hashes and we need to map them to human friendly names. That's our job which we have to do in such a network. And that's why that's the motivation we need at DNS. But the only thing which I2P really has is a local address book. So each router, each note, there is nothing like a central authority. Each private note has its own look up key value store. It's called address book. So there you have a friendly name like diva.i2p linked to a hash or to a B32 address to simplify things. And if I'm loading somebody else's address book, this is a choke because that's a trusted operation. And within I2P we usually say we don't trust no one, it's trustless. So obviously we cannot just load address books from somewhere. Additionally within the I2P network if you're looking at the specifications and if you're looking at how the network is working today, we do have jump services, we do have kind of like registries, but all these services are again a delegation of trust, nothing which we really want. And as you can see ladies and gentlemen, I'm really critical towards the I2P network. I see the central components which we have within this network and I'm criticizing them. But not criticizing in a negative manner, I'm rather trying to make me as a developer and also the other developers to be aware of these central components. Right. Now Goethe, German, des Pudelskerren, the core of it all. Why are we doing this at diva? Why am I doing this? I want to have a service, a storage service and hence a DNS service which is A, fully anonymous, B, immutable, C, really barrier free. And barrier free is an interesting concept if you start to think about it. A coin, whatever it is, Filecoin, Namecoin, Monero, Bitcoin, Ethereum, I don't care, is not by its definition barrier free because, well, you have to acquire it somehow. So there is a barrier and barrier free in the meaning of device change means you have a very low hardware requirement to enter the network just to drop a name, a raspberry or any other low power device and ta-da, your member of the network and you can store stuff. And if the barrier is that low, by definition, the spam will be high. So we have to think about a cost function but the question is how is this cost function going to look like? We're going to discuss this in a minute and trustless. Again, I2P is built, architected, engineered in the last 20 years as a trustless system. Trustless means I really need to only look at my own node and either my node is right or it's wrong. I don't need to care to whom I'm connecting to because every data which is incoming, I have to verify myself. If I'm not doing the local math, I'm trusting somebody else and that's a bad idea in the context of I2P. Trust. I can tell you trust me the earth is flat. Now we all know that the concept of trust means I'm believing in a wrong set of root data or made upset. It's just invented and as if I'm starting to invent root data, I can prove anything because the root data is fake. I don't like that word. The root data is made up. Now if you're building your system of trust, your system will grow and we know in IT, but actually at least the view I have from my specific scientific point of view that the larger systems are growing, the more problems we do have in these systems because we need to introduce regulation that the trust is not abused. More regulation means later even more regulation and so it gets more and more complicated over time. One of the typical solutions, at least what I'm lecturing about is keep your system small. So base your decisions on math, base your system on math, keep them lean and at the end add a cost function to prevent spam or abuse to be a bit more generic. I2P is at least from my point of view, a network which enables small and lean systems. Right, where am I? 1540. In history, building a DNS on a fully distributed network, the approach isn't new. One of the older approaches is our systems which are based on the hash cash function which was properly described in 1990s and then leads to proof of work systems and these proof of work systems, they created currencies, like we all know Bitcoin. Maincoin then came in and other things which are proof of work. What I can guarantee you is proof of work is working. Proof of work as a cost function is mathematically, at least what we know today, perfectly working but it's extremely inefficient because it's a race. It's a race for the fastest solution. This is a bit trivial but at the end of the day it's a race and this race is inefficient. Now, I always resisted to implement yet another proof of work function, not because it's not working, just I didn't want it. What I also not wanted was the Filecoin interplanetary file system solution which is a validator approach because validator approach means nothing else and Filecoin did its moit. They used DRAN to select validators but they're just shifting the problem from their own system to another system and then they say we're solved but that's not true. At the end you just move the attack vector away from your own system just to open up another attack vector and for me, just for me, currency based, proof of work based or validator based is not really an approach and as I am an economist which I studied, I feel very uneasy, very, very uneasy about non-fungible currencies. There aren't many. A few are. Make your own research and you will find out which are really fungible. The others are difficult. Let me put it like this. Then there are Naive concepts which are very nice, highly performing but at the end you need in the area of DNS and in the area of DBA what we're talking about, immutability and integrity. Right, I want to lose a few words about the CAP theorem because consistency, availability and partition tolerance are a triangle within this CAP theorem and it's said that you have to choose two out of these three. Now some blockchain guys said hey we solved it, now we have all three. At least with Butante's fault tolerance I have my doubts and honestly I do not see any concept out in the wild which really solves that problem except proof of work and we don't want that. So this year and that was part of my biggest struggle. We had to leave and that was the talk I had in 2023 exactly here at this place about democratic Butante's fault tolerance which is developed by the University in Lausanne in Switzerland and also in Sydney, Australia and sorry guys with I2P this concept is not working because and we're talking about fallacies right afterwards about the problems with such networks, democratic Butante's fault tolerance was a fail. So we went as Diva chain into eventual consistency because the big problems in distributed computing known since the 90s are things like we have zero network latency wrong, we have unlimited bandwidth wrong, we have a secure network wrong and we all know that as Diva Loppers but sometimes in the lab we go into a perfect world, dream of something, create something and then in the real world it's not working and that's why my biggest tip for every blockchain developer in the universe tested on I2P. If it's still working you probably done a good job and that's exactly one of the core messages. I2P has that many network transport problems which are the price for privacy and which we want that it's a very good test case, a very good transport layer for all the blockchain developers out there including myself. So what we did in the last 12 months with Diva chain and obviously you'll find it on GitHub, we created a transaction based system which is barrier free, immutable, trustless and based on I2P, so fully anonymous. It's working now, it's working since about three weeks. The students, the last three weeks at the University of Applied Science in Lucerne wrote a little prototype with I2P but they had a lot of API troubles and struggles because I made mistakes so it was my mistake and at the end I couldn't present here the final prototype but because of me not because of the students, they did a good job. And what we're thinking today is how to implement the cost function because at the end a barrier free system I already said that will attract a lot of spam, a lot of DNS spam, a lot of content spam, a lot of whatever we can use this system for spam and that's not me as one of the developers that's not my intention. So probably it will be a function of availability and a function of cooperation and when you read this now and when you think this is new, no it's not. Filecoin already implemented this since 2014. The only problem they had was their validator selection so they made the mistake of using a validator function to implement their consensus but this they call this proof of storage, the function of availability and the other one they call it proof of window consistency or something like that but you have to prove two things. First you have to prove in the network, prove means mathematical proof that your content is stored and B that your content is continuously stored. So these concepts here are not new, I would just like think about it a bit more and then implement it. I already talked about my core failures or our core failures in our very little team. Democratic Britsantin false tolerance, a very nice concept, a very nice book, I learned a lot, it didn't work. The eventual consistency approach is working since a few weeks, API is highly unstable, I have a lot of coding work ahead of me, in front of me and I'm looking very much for feedback so if anybody is interested in hacking in, I'm always happy if somebody wants to contribute and the academia feedback was also very positive so I could show a few interesting things in the past months. Please in the last two minutes take out, in the last minute take out this take out, we believe that the eventual consistent DNS or blockchain like system used for this DNS challenge is a reasonable approach, eventual consistent so we drop blockchain consensus and replace it by eventual consistency transaction based. The core challenges as know today, we need to implement the cost function which is reasonable, decisions, decisions in our wording or nothing else, there is a global state where all peers and I2P network agree on a specific state of data and the participation is very welcome. In the presentation on the web which you'll find in this deaf room on the fourth step page you find all the sources and some more stuff so if you have questions please shoot. Yes please. Could you explain what you meant by immutability? His question was could you please explain what you mean by immutability? The answer is once written never change again. Yes please. Right he's asking in our system we're going to have a lot of traffic, we're going to have a lot of records stored right did I did I summarize that correctly and that's a problem right or that's your question in terms of storage right first compared to other approaches Diva chain because DNS is a side project we never like handshake or other projects we never intended to replace the current domain name systems therefore the clear net we always wanted to match I2P names like Diva.I2P because nobody is going to give us a domain to be 32 addresses so no we don't have much traffic there and so the storage problem is nothing I'm currently thinking about but yes there will be sooner or later scalability questions you're absolutely right but in this baby state I don't really care. Yes please. This question was if it's immutable how can I change things? In the blockchain world you never change a record you just let it live in a block or let's call it inner transaction and then you just create a new transaction on top and this new transaction is the new state because in a blockchain you always look from the top and the last state is the thing you believe in because it's properly proved using math. Is your answer given? Okay. Maybe a question from the phone? No other questions? Thank you.
Algo-rollover for .nl
Hello everybody, welcome to the TNS Dev Room, if you just came in. Our next speaker is Stefan, who will be telling us about the NSSEC PSK algorithm roll over for .NL, which normally isn't a very exciting thing, but I trust Stefan, but I've made it a boring situation. That's still fun to talk about. Yeah, thank you Peter. Welcome, my name is Stefan Udink, I work for SIDN, the .NL registry, and I'm talking about the KSK algorithm roll over we did in July last year for .NL. So why did we do this? What did we do for preparations for this change? What was the planning like? How did we execute it? And what did we measure on the internet on our change? So why would we want to change the algorithm? Yeah, the algorithm we used before was algorithm eight, and that's an RSA algorithm, and we wanted to use a SAVER algorithm to keep up with the new standards, because since June 2019, the new recommendation from the RFCs is to use an ECDSA algorithm for the NSSEC situations, and there's currently enough support in Resolvers to do this. As you can see in the drafts, both RSA and ECDSA are supported equally for most Resolvers. And a plus side is also that the NSSEC answers we are giving are smaller than the RSA answers, which gives us less impact when we are hit by reflections attacks. So it's better for the internet. So in way two, the algorithm roll over, we already replaced our HSMs. We used for the signing of the zone with new HSMs from Talis, which could do 20,000 signatures per second, which was a big increase from our previous HSMs. And we started with doing a test run on our test environment without any changes to see how does this work, how much time does it take, because there are a lot of things you can change which would change the time used for some steps in the roll over process. And a normal run took about three weeks without any changes. To be able to do it efficiently, we also made a test lab policy for OpenDNSSEC, which rolled very fast to be able to see what changes were done and to be able to create some scripts to follow everything that is done in the environment. And we also used the local DNS viz installation to see if a Resolver for our setup, it was inbound, could indeed resolve the new situation. And for that, we also created a fake root. So we could play root operator to change everything and we could validate everything to see if everything worked without any issue. That went quite well. And then we went to our acceptance environment in which we used a daily copy of the public NL zone and that has 6.4 million records or at least two main names in it, much more records. And then we had a memory issue. We used our old 128 gigabytes of memory, no swap usage, but still the system holded on something. And after we added swap to the system, it ran again, so it continued. It was not broken. Everything went where it left off. It was strange, but it helped us. So we could prevent this issue in production. Another thing was that normally we generate a full zone every half an hour and in a normal run, it took about 24 minutes to generate the zone, sign it and publish it, including validation. After adding the ECDSA keys, we did a run and then it took 45 minutes. That's not what we wanted because if you want to publish every half an hour, you cannot take 45 minutes to publish something. So we had to find a way to make it less than 30 minutes to do both RSA and ECDSA. And we saw that mainly the validation part costed a lot more time because ECDSA is harder on the validation part than on the generation part. And we made some things in parallel. So we compiled the zone with bind to raw format. We validated with valid NS and all those things we did in parallel and we added parallelization to valid NS. At least it was already available, but we used the switch to do that as well. So we are using all cores on our systems to do the validation and then we got to about 27 minutes of generation. So that's under 30 minutes. Very good job for us. And so we were able to continue with the new zone generation. So how do we plan this thing? We were in June and we know that it took some time. We saw that we might have a ZSCA roll over. We didn't want to do a June ZSCA roll over because Dan Dezon would increase even more because of extra signatures. So we had to plan it and we had also some data that we could use to do the validation. So we had to do a lot of things. And people in the organization that had to approve at IANA that we are going to change the yes in the route. So we expected that the IANA change would take three days and so we came up with this plan. And with all holidays for people, et cetera, we were able to do this plan. And as you can see, we have some asterisks next to some dates. And that's because these are dependent on the IANA change. And if the IANA would take more time than we expected, then those dates would change. And this is something we couldn't predict. But even though we thought three days should be normal and should be okay. And luckily for us, we did a blog post about this change and we were telling the people we are going to do this change. So if something breaks, you know we are doing this and you have these dates to see if everything is going according to plan and we will update this if there's some issue or the dates will change. But it was all good and we planned it very good because all those dates mentioned here were the dates that were used. So we did it according to plan. So on executing a plan, it's good to have written commands just to copy and paste them when you need them. You only have to check, yes, I'm doing this on the correct system. Yes, it's all written correctly, but you don't have to think about it anymore. So during the execution, we did continuous checking with the script we wrote and we did some DNS viz runs on the public DNS viz site to show the people that we are changing and we have some records that we can show. I will show the DNS viz pictures lighter. As I mentioned before, there would be an increase in file size for the zone and before it was 4.5 gig in size during 4.6 and afterwards with smaller signatures, we only had 2.3.7 gigs and that's very nice. Of course, we have a go no go moment because if we would have double signatures, we can still go back without any disruptions, et cetera, and if we would go forward, then we wouldn't be able to go back as easily. So we have to do a bit of a check and that went well. So some pictures of the algorithm 8 situation. During the policy change where you see an addition of the EC, we added the algorithm 13 to the root and removed the algorithm 8 from the root and afterwards we stopped using algorithm 8 and then this is a new situation with only ECDSA. During all this time, we also did some measurements and a colleague of mine, Moritz Müller, did most of the measurements. He wrote a Rolo from Moon Mon quite a few years back and did it again. We messaged some items mentioned on the slide. I want to mention that we only messaged two root servers because all root servers should say the same answers but we didn't want to measure it all 13. What might be interesting is that you see a lot of numbers in this graph and that was a bug in the Rolo from Moon software. You also might notice that there are multiple lines at the top and at the bottom and that's something like this. It's a measurement issue that was caused by using of a small buffer size and still trying to get key IDs from the answer. That's why you see a lot of changes and we saw what we were seeing. This is not correct. What's happening? Because if I do manual checking, everything is fine. What's happening here? Finally, we were able to find the issue and we were able to fix it. Another interesting thing is that during the change, we looked at the response sizes we sent and in this table it's only ns1.dns.nl and other systems have similar, not the same answers because the sizes might differ based on the implementation of the nameserver that's used because of the protocol compacting. Another interesting thing here is that the nx domain and dns keys are increasing during the rover and the ns set is not increasing. It's less and that's because the R6 and the set are in the additional section and during the rollover the section gets increased a lot more so only the R set for ns1.dns.nl is in the answer and the R6 but not for all the nameservers that are in our zone. If we look at traffic, normally we have about one percent of TCP traffic and during the rollover we have about five percent TCP traffic and after the rollover it's back to normal again. Here you see a graph with a logarithmic y-axis and you see that TCP is increasing a lot. It's about eight times more TCP traffic and especially there is no different state in the internet. It might change and so here you see that the KSK is back in the logistic direction and it's not going to change. So we have a level again. So in global we have no measured impact at all as far as we know. I don't know of any trust issues people had or something and you can see on the left picture that the adding of the ECDSA key and afterwards the removal of the RSA key and the right picture is the trust chain that is constant for the resolvers and that's my talk already. Are there any questions? I've got two questions. The first one on slide 17. You mentioned that during the rollover the NS size becomes smaller. Yes. I will ask you a question, complete question. So you said the NS set is getting smaller, yes and the question is? The question is there is no difference between the two. So I think that's a good question. I think that's a good question. I think that's a good question. I think that's a good question. I think that's a good one. So you said the NS set is getting smaller, yes and the question is? The question is there is an RFC out there that says glue is mandatory. If the size of the response is getting smaller because you're not including glue you have to set the TC there. Did you measure for that? So I'll repeat the question. There is an RFC that glue is not made, that blue is mandatory and did we measure some things about this? What I know about this situation is we looked at the dns-vis information that we got and for the measurement for ns1.dns.nl the glue is available but only for that name. So not the glue for the other nameset records. And I don't know if we looked at if the TC flag has been set and if there has been acted on. So and the second question? Yeah, my second question is I noticed you switched to talus. With regards to support from the talus company, did you test that if you got proper support, if you needed it? I will repeat the question. He said we switched to talus and did we test the support we had with talus before doing this transition? No, we did not test the support beforehand and technically we did not switch to talus as in we used to have lunas which was taken over by talus. So we continued with the same lunas, hsm products as we before. We had contact with talus before we switched the hsm's but we did not before the rollover try again to contact them to see how support would handle questions from us. Which might be a very good idea as well. Thank you for that. I'm asking for a friend. Yes. Relate question to that. The tank tank we had in the beginning, did you have any rollback plans in case something went bad? The question is do we have any rollback plans? I mentioned we had a go no go to see if everything is okay, we go forward. If everything is starting to fall we go backwards. After going forward we had some thoughts about how to continue but that might have impact. So the decision what to do when, depended on the situation at that moment. And we didn't write out everything, every possible scenario because that would be too much especially based on our testing and our acceptance and test environment that we had confidence that it would all go correctly. And we would have to look at the situation at the moment to see what the next step would be if something went wrong. Does that answer your question? Yes. If you had choice to redo your procedure, do you think it's worth it to have met HSM at all regarding the other complexity and risk of losing your key in case of backups are not here? Father Van Aving, a hidden signer that is aggabbed by your words for example. If I understand your question correctly is about, did you have anything about backups or? Are you happy with having an HSM reserve having an aggabbed linux that have the casket on the disk and do the signer and just the DNS update that's going on in the world? I don't know. I hope I understand the question, but if I want to answer your question, I'm thinking about we do not have an aggabbed system. We do have regularly backups of all the HSM keys. So in that way we do have an HSM that is aggabbed because the backup unit is an HSM and that we can use to restore keys if necessary. Do you think it's worth it? Worth it. I think it's worth it to have an aggabbed HSM to, it depends on your risk assessment if you want to have an aggabbed system and if you are going to do this thing for instance in public cloud, you might want to have a situation where you have an offline KSK for instance. So that might be in a setup. Did you conduct a penetration test on the HSM beforehand and what are your operations in case the security issue gets known in these HSMs? Did we do a PAN test on the HSMs and the next question was? What would you do if a vulnerability gets known? No, we did not do a PAN test and what we would do if a known vulnerability would be known would require our systems to investigate what happens and how can we react on that? Which information has leaked and how can we recover from that? Those are not known scenarios at least to me at the moment. Why not PAN test? Why not PAN test? I have no idea. Yes? I noticed that an extermination goes up to 14 or two bites. Did you, I'm curious what was your transportation settings for maximum? So your question is what is our transportation settings for the UDP size? We have 1232 as the size of the UDP packets as recommended by an RSE. Other parties that also provide anycast for .nl have slightly different settings for that. So that's why we focused here on NS1.DNS.nl because that's something we operate ourselves. The second question. The second question was you added the algorithm 13DS keys to the rule zone. So you've run a dual DS. What's that to allow removing the algorithm 13DS keys if you had to in a hurry or just as an additional acceptance step before you remove the algorithm 8DS records? Because during the fairly recent transition of carbon net and the EU cells, they basically just didn't swap. The question was why did you not remove the algorithm 8DS when you were adding the algorithm 13DS? Correct? Yes. We did that because we wanted to have a solid path and have a possibility to go back without any issue. So rather than take one big step in that regard, we took two small steps to ensure more stability at least from our point of view and good night's rest for us. Any other questions? Maybe not so much a question but a statement if that's allowed. Yeah. One. I think it's incredibly brave for a national top level domain to take a risk, right? And I think it's very good as a compliment. But because changing an algorithm is different than changing a key, changing an algorithm is fundamentally hard. And for SIDN to do this as one of the early adopters, not the first one but one of the early adopters to do this, I think it's very commendable. And I think you set an example for the rest of the industry for all the other top level domains including ICANN and we're looking at you. The same I would like to. And we're looking at you to see what you're doing well and of course we hope nothing goes wrong but we also need to have that information. And one of your colleagues is working with ICANN to make sure that if we ever do something in the room, that that goes well as well. So he's part of that group as well. So yeah, we're looking at this. We're hoping all the top level domains follow the same example. And yeah, all my credit goes to you guys. Welcome. Thank you. I want to repeat for the online audience. Because if you get a compliment like that, the person, Roy Arons from ICANN said that he said, very brave for our .NL or Azure at the end to do this algorithm change in the forefront of the people who are doing the change and are should be followed by registries to do this change as well. And we have shown that it's possible and without any incident. So any other till these please follow us. Good summary. Thank you. Thank you.
Bootstrapping time on OpenBSD
Welcome to the DNS Dev Room. Our next speaker is Otal, who is an OpenBSD developer. We're going to call him a faithful intersection, DNS second, NTP, and maybe two other terrible things. Yeah. Okay. So I'm going to talk about bootstrapping time specifically on how we implemented that on OpenBSD, but I think the approach could be used in other systems as well. So, small introduction, OpenBSD, a BSD derivative. We focus on security. We do that in several ways. For example, privilege separated demons, which in which we separate the various tasks a demon has to do into separate processes. Each of those processes have minimal capabilities, and they communicate with each other through pipes and exchanging of messages. There's also a lot of other techniques from memory management, which I'm also pretty involved in, and new APIs for that are, let's say, less easy to misuse, things like that. Apart from that, we also try to make a useful system, and so we like to have St. defaults focus on a system that is out of the box, a nice system to work with. By default, we do not have a lot of services active, but if we consider a certain functionality to be included in a default configuration, the configuration you get when you install the system, we are quite strict in that, in the sense that it has to be functionality, which is useful for a very, very large fraction of our users. But also, the actual implementation is maybe even considered more volatile, it's a higher risk, so we focus on extra on the security aspects of that, including the architecture of the software itself and the specific implementation. I'm now going to talk about time, and we'll see a bit later how that also involves DNS, but when, originally, when OpenBusy starts, it gets the time from a battery backed real-time clock, if your hardware has that, because not all the hardware has it, and even if you have hardware that has it, it's not always functioning properly. If you think of it all the hardware, then the case is, my CMOS battery ran out, is pretty well known, and most of the battery backed real-time clocks then give some default value way back in the past. But the booting system tries to read clock, if it's available, if that fails or there's no clock, the time is set based on a time step that is stored in the root file system, which says, well, this was the last time the file system was modified, and basically, if you unknown mount a root file system, which happens on an ordinary reboot, then, or shut down, then that time step gets set as well. So you have, let's say, if you reboot the machine, you probably have a time step, which is a little bit in the past, but, well, reasonably okay. It's a bit behind, probably, especially if you shut down your machine, go on vacation, and you don't have a real-time clock, because then you come back from vacation, and your clock is two weeks behind or so. So that's the problem. We have an NTPD implementation, which I'm going to talk about a bit more in a second, but that originally that implementation did not bump the clock. It would only gradually increase or slow down the clock to adjust it to make sure that the time is corresponding to the NTP-derived time. You could enable that, but it was not a default, because we said, well, we are not going to make a default, because we don't really have enough confidence that it will do the right thing. Why not? Because NTP in itself is not a secure protocol. That's one issue. And also, so we would like to have more than one source of time, not only NTP, even if you talk to multiple pairs. We would like to have an independent way of validating, or that time we see. So we formulated some goals in the beginning a few years back, and we like to say, well, we like to be pretty sure that if you boot up an OpenBISD system that you have the proper time, if you have network connectivity. So that's a nice goal, but we made things a bit harder for ourselves by stating, well, we do not fully trust NTP replies. Like I said, by default NTP is an insecure protocol, and also the design of the protocol is in a way a bit. You can compare it a bit to the original DNS implementations. Security was not a big thing in that time. So we'll talk about it a bit more later. But the goal is still to get the correct time on boot with high level of trust, not necessarily a very high level of trust in the sense that you have a cryptographic proof of that. That's maybe a goal for the coming years or so, but at least we have a high level of trust. Well, if there's no battery backed up clock available or it is not functioning properly, we still like to end up with the proper time. Like example, I gave this cheap boards with Raspberry Pi, for example, or other boards do not have a battery backed clock at all by default. And you can also have cases where very expensive servers forget about time when you switch them off. So the setting is if we can solve the problems in this quite difficult situation where we have lack of hardware support and things like that, and of course the more easy ones where you do have a proper RTC clock or you do have other facilities, then it comes easier. So if we say, yeah, okay, we need to be able to do DNS to resolve NTP peers, it might be that the resolver we are using is DNSSEC enabled. If that resolver is running on a other system, it's quite easy, probably that other system already has the proper time, but if we are running our own system on the same system and we do not have proper time, then DNSSEC is going to complicate matters. So we do want to consider that, at least, what we should do in that case. So a bit of words about the NTP protocol. It's pretty old. Let's say the same era as DNS protocol. There are some design similarities between them. For example, in DNS, a request and an answer basically has exactly the same format. NTP is the same. There's also the focus on UDP, of course, and also the case that the request you sent out and reply that's coming back. In reply that's coming back, a lot of information, maybe even all information that you sent out is also coming back. So you, as a client, you have a reasonable, easy task. You only have to consider the answer because the answer contains all the information you sent out earlier. So you only have to consider what's in the reply packet, do some processing, and you can continue. But of course that is, comes with that you have to trust that reply packet even more than you maybe would want to. Later, there were additions to the NTP protocol. Shared keys were introduced. So if you had a pair, an NTP pair, which you had some form of relationship and you would change some key, you share a key with that other party, then you could secure the NTP packet. So you had more confidence or pretty good confidence that you are receiving replies from a trusted source. Later on there was even more extensions where you say, oh, you invent NTS, which is a network time security, and that includes a key establishment protocol, which is pretty complex. And so far we did not like to implement that yet, but it might come at some point in time because of course that will give you some more cryptographically. And there's a process handling constraints, and constraints is a thing which I will talk about with later. So we have to do not have in our implementation any cryptographic proofs of any validity of the data, but we have a basic spoof protection. In the NTP protocol there's a field which is called transmit time, and according to the protocol the server which answers the question has to just echo that field. And if you, that's 64 bits, so we could, the server is not looking at that field for any other reason than just to echo it. So if we fill in a random, let's say cookie there, we can at least in some way make sure that an attacker which is spoofing us, trying to spoof an attacker which is not able to read the outgoing packets at least, can we protect against that. Of course that comes with storing some state in the client because you have to remember which cookie you sent out, but the protocol without any changes allows for that. When you are actually computing the time, and there's an algorithm in NTP protocol which allows you to, let's say, filter out the round trip times and things like that and get a good idea of the service time, you have to use the original sent out time and of course not the random thing you filled in. So the trust issue is in the NTP, original NTP protocol is a pretty complex statistical analysis of all the replies you have seen from different pairs. We do a bit more simple approach, we send out to several pairs queries, we collect results, we filter out things we consider bad and things which are bad as unreliable servers, servers that do not reply, servers that reply with a bad cookie and we select a median time, median time. And we use constraints which is a completely different source of time information by doing HTTPS requests to certain servers and the nice thing about an HTTP request is that the reply header also contains a time stamp. That is a rough time stamp, one second granularity, so low resolution, but we also do that to filter out bad NTP replies. So we know this NTP reply is outside the rough, our rough low resolution constraints, so we say skip that. There is a small complication there because the certificate check we need to, without any idea of the real time, has to use a time stamp, say is this a certificate which is valid now. But the question, if you do not know what now is, so what we do is we use the time stamp itself and say well it is at least consistent with what they're saying. So the HTTPS request is valid. On the time that server is telling us it is. And we'll come back to that later. Okay, but this is also a DNS dependency and that is because we want to be able to select NTP peers based on name. Of course we have things like pull.entp.org which are very dynamic, change all the time. Also location based, so depending on your query particulars you get a different answer. And you want to have the NSSEC validation. Now the NSSEC signatures contain a validity period with the same problem as with certificates. So we have here the hardest case. If we run the NSSEC enabled validating resolver on the same host as we are trying to boot, we have a bootstrap issue. Luckily there's a way around that. And that is to check disabled flag in the DNS request header. You can say to a DNS resolver I want to resolve this address. But do not do any DNSSEC validation. So that's easy at least from the protocol point of view. You can just set that flag and have at least some idea of that DNS resolving. But in the current API or in the API at that time which also is from the 80s or 90s there was no way to enable that. So now we come to another point it is because OpenBSD is a complete system. We built the C library, we built the APIs, we built applications and the demons that go come for it. We could just add that API and then assume in our application that that API is available. So this is a part of resolve.h, the source code. We introduced a new flag, save and use CD. And that enables us to use the APIs, the DNS resolution APIs which also use a bit of an Eucaly system which also stems from the 80s. That is a global variable or struct called underscore res which allows you to tweak the way DNS requests are done in a libc. These days that will be designed completely different because you would probably have some local object which you pass each time to that code to or have some context or something like that. But this is from the old days where a global variable or global struct would contain the flags to be used. So what we do is if we know that the time is not synced yet, we retry without with the CD bit and the resolution fails. We retry with the CD bit set and hope that it will get better. That way we have an answer. Of course it's not DNS validated. So we are closer maybe but still not there. So what is now the revamped mechanism is we get the time from RTC. That fails with time for the root VST and plasma, completely exactly the same. So the kernel is doing exactly the same as it did before. When open the entity starts, it will get constraints. So that's a new thing. It will try to get a rough idea of the time. And it will also send out entity requests based on DNS requests it has done. And those NTP replies will be validated using the constraints derived from the HTTP requests. And we will move, bump the time if it's going forward and otherwise do a gradual increase, a gradual adjust. We will bump only forward because we do not like to have logs with time going back. So monotony increasing time is pretty important. If we see, and that is probably an indication that something is really wrong, if we have to set the time backwards, we don't do that and scream in the logs and things like that. After that, the regular NTP things just happen, gradual adjustment of using several pairs, etc. So then we have some idea of time and we'll do it one more time. So in the sense that once we are synced, the NTP time and the system time agree, which can take several minutes of course because you have to slow adjust in many cases. We'll do it again. But then we say, well, we know we are synced. So we do have real DNS check validation. We do not have to do fall back to no see. And we use the constraints check the actual time. If at that point things are not okay, then we will of course scream in the logs that, well, we cannot your NTP, your NTP pairs, but then it's a system operator decision to do that. In a local LAN, of course, that might be a very suitable case. And the default config uses several NTP sources like NTP.pull.org, but also time flare offers a NTP server on all their pops. So you get a local, with all the same IP, you get a local time source or local close by at least is the idea. And also a sorted set of constraints of, let's say, well-known HTTPS servers like from Google. And we also use Fortnites servers for that. But they're, let's say, stamp of approval. So this is the default configuration. We also mix the quad 8. We are not using quad 8 for that because we like to have, they are, there's some tie between NTP and not. So, of Google, so we will say one is a completely different system set of systems from Google. So that is why we say, well, if we're using DNS, that will, let's say, diversify the different sources we're getting time from. And a little detail, surface means if the DNS request produces multiple IP addresses, we all query in all of them. And the server is a single source. And sensor is for, if we run on a system which has hardware clocks, for example, GPS based or you have the Meinberg, some set of hardware which, a PCI card you can insert in your system which gets the time from the DCF clock in Germany or other sources. We also use those, of course, as trusted sources. So that's the thing we call time sensors. So that is my talk. I'd like to thank some other OpenBSD developer who cooperated with me on this. And reachable and master done, but also OpenBSD, BUD.org. And I'd like to ask if there are any questions. Yeah. So you mentioned that NTP never sets the time back. But what happens, for example, if you have a hardware RTC clock that's misconfigured, like, for example, set one year in the future for some bizarre reason, and then you're a bit back. Yeah. So the question is, we know our NTP implementation never bumps, really hard set of the clock backwards. If that happens, and if you, but for some reason your RTC clock is misconfigured or set to the wrong time and you, then we require operator intervention. Then it's a human decision to do that. Of course, you can do that with the date command still or with our date where you say, well, get some time from a different system. But that is not a thing which happens automatically. We scream and we say, well, this is not right. But we require operator intervention for that case. I know the question. How much of this is tolerant if you don't have network during boot because it's a laptop that might be going to do a wireless network and that takes 10 seconds? Yeah. NTP deepen. We have a, if we do not have a working network configuration on when the NTP start, it takes about 10 seconds. And if then no actual traffic was seen by the, it says, well, sorry, cannot do it. I'm just going to continue booting because at that point in time, the boot script stops because we'd like to have as many demons starting with the correct time already set. So that's very early in the boot process. Of course, you have, and you have complex configuration with, with freelance and whatever, or then that's not going to work. But the NTP tries its best and then said, well, sorry, I cannot do it. I'm going to do my background tasks like I do always, but I'm not setting or bumping the time. So there. Sorry, you're out of time. Oh. Okay, thank you.
dnsconfd: system integrated DNS cache
My next speakers are Thomas and Peter who will tell us about DNS Confit, which is new to me from quite curious. So, hi everyone, my name is Tomas Korbars, this is my colleague Petro Manchik, we work at Red Hat and today we've come to talk to you about our new project that is called DNS Confit. So let's start a bit with a motivation behind this project. Last year we received a request from a user that required us to make possible for Unbound to be used as a local DNS cache and to be able to consume configuration from the network manager. In the past we had DNSsectorrigger package for this, but we dropped that in rail 9. So should we reintroduce it? We thought about implementing a debuts API into Unbound just as DNSMiles has and then implementing a network manager plugin just as DNSMiles has. But then we realized that if some similar request came in the future for different service we'll be doing the same over again. So we thought about creating a new project that would serve as a conduit between network manager and local DNS caching services. This project is the DNS Comp. Our requirements for it would be to be able to easily exchange the DNS cache, underlying cache, and to add more services in the future without too much work. We need to be able to support split DNS configuration. We need to be able to support split DNS configuration and then we need to be able to auto configure without manual interaction from the user. Also, we would like it to use already present system configuration and defaults and security features that are already built in and we maintain inside of our distribution. The behavior needs to be configurable enough so you can change handling of corner cases and you are not caught of guard by the behavior that you would not expect. Okay. Let's get a bit back in the past and tell something about why Fedora 33 introduced DNS cache and what it brings to us was a possibility of multiple simultaneous VPN connection at the same time. And that's great. It also made possible to configure global servers but reach some names which are accessible only on local network, which is nice for DNS over TLS but that was not enabled yet and still isn't. And it brought us excellent configuration presentation by Resolve's CDL command compared to what we had before. That was clearly better. And it also introduced well-documented bus interface for both configuration changes, for configuration displaying and also name resolution. They have nice article but that's not our job here. So what do we mean by 3DNS for here? When you connect to VPN without some smart solution like this, you send all name queries just a single VPN and use only your primary connectivity to deliver traffic to VPN server and that consumes everything you use. At that time, you cannot use any other connection interfaces you have on your laptop or mobile phone or something else because you use just one DNS or set the DNS that VPN knows. With split DNS, you can send different name queries to different set of servers provided by different networks. You are connected at the same time and most current devices today are capable of connectioning to different networks at the same time including multiple VPNs. All you need to have is non-coflicting names for them. So for example here, names are different and if some names in those domains provide some useful networks, you can access them at the same time. And we could end it here and thank SystemDGuys if everything worked great but sadly that was not the case entirely. I have listed few issues I think are important and still aren't fixed sufficiently but there were more bugs in the meantime somewhere fixed, some are still not. For example, it prevents any usage of DNS on the host which is where it is enabled by default configuration both Ubuntu and on our Fedora because it just doesn't forward DNS-enabled bit set in queries received. So any library which is quite capable of using DNS-sec cannot use it even if infrastructure, your network provides capability for it. Also, at least for Fedora and Ubuntu desktop I think, you would be quite surprised this top level domains often that does not exist because it sends top names without dot just a local interface over multicast protocol and if it doesn't find something which usually it doesn't, it just returns no that does not exist. So com domain does not exist but github.com domain, surprise it does and even on server edition when I think this is really unwanted. And also strange response is when a response fails because of DNS-sec validation fail, it still might contain a valid answer in the response which is unexpected and no other implementation I know does it this way. So DIC plus short DNS-sec failed org even with DNS-sec enabled in system DreslD gives you very nice address and I've listed just few issue numbers. So lessons we take from this is we want split DNS functionality auto configured and we want possibility to DNS over TLS and also that we want nicer front end than we had but system D people are very good at expertise in system integration and they are quite good engineers and I know it but they lack expertise in DNS protocol area and I am afraid it is visible and at the same time DNS resolvers people are excellent in DNS protocol area but their integration into system is often very limited or at least done and we think only the integration is missing and that is what we are trying to provide. So we want to reuse existing functionality. We want to provide some common interface to set forwarding to different servers so it doesn't change much and we want to provide nicer front end for showing what is configured regardless of use DNS cache in the end. So what we need for split DNS we need some local address which receives queries from applications that usually localhost we need ability to configure different domains to be forwarded to different sent of servers and of course some default for root servers to be forwarded to global default and we also want ability to reconfigure the service without stopping it and flashing entire cache as starting it again and from this is list of servers we have in Fedora and I think all of them are able to provide split DNS functionality most of them are also able to provide DNS over TLS functionality but only DNS mask except there is of D have some D bus capability and that is quite limited and DNS mask has own issues. So our approach is use what already exists provide just front end and components coordination do not reinvent the wheel. We do not want to handle DNS queries ourselves in our service we want proper services to do it and we just provide configuration for them and I have already shown almost every open source resolver has that ability and because we are not handling queries we just want to try single thread application and we written just our prototype in Python to verify this would work. What we also want is to reconfigure ETC resolve confile only when we verify basics that service is running and restore it when our service is stopped. I really hate what a result is when you uninstall it you have to fix it by hand. And we want to have stand alone demon because not everything is primary configuration we think should be done in network manager so there is some unified way to configure it whether it is used system be resolved D or our demon it should not change it should be just implementation detail. And we think the common part is the biggest one and just very small cash specific module is required to implement different caches what we plan to support is what we have in the RL that is primary unbound and also bind and DNS mask. And we want to provide basic compatibility for services using system D or the API directly because something already uses that but we do not want to implement every aspect of what they already implemented because we do not think that is necessary. So right how does the flow of configuration look right now network manager receives its list of DNS servers from either DHCP or the connection profile and then it pushes the configuration through the bus API into DNS confi D. DNS confi D then translates this configuration into some internal representation that we think is general enough for most underlying DNS caches and then we use specified module to transform this into the specific configuration that is used by the specific underlying service. For example for unbound it is a list of four borders. How does the system integration look like now. DNS confi D uses already existing unbound service that we ship and support so it respects its defaults security features and configuration that we ship. We inherit the system D result D debuts API so we work as an in place replacement as of now. You use the default system configuration that is provided and then we watch the underlying changes of the DNS cache so you are not caught off guard by the sudden inability to resolve the domain names. Here's the life cycle of our program that I talked about. DNS confi D itself is implemented as a system service so you can inspect it as you would inspect normal system service and it is started either on boot when it is enabled or it is started when configuration is pushed through because it is implemented by the bus and system D triggers us upon the configuration. After we start we start the underlying DNS cache. We look whether it is ready or not because there is some polling right now needed and we wait for the configuration that is provided by network manager. After that we watch for status changes and perform actions as are needed. Here are some memorable issues that we've encountered. The first one is a war for resolve confile because network manager finds out that a system D result D is running or not by checking existence of some symbolic links in system and we cannot own them because they are owned by the system D result D package and if they are not present on the system then network manager tried always to override our modifications of the resolve conf. We got a run by that by implementing a command that pushes lines into the configuration of network manager and we stop it from touching the resolve conf. We argued about whether it is better to execute the underlying service as a sub process or a system service because sub process approach provides easier way to monitor whether it is running or not but then I was persuaded by Peter that the system service is better because we use things that we already have in place. There is the issue whether unbound is truly up or not because the start job was finished but the command channel was not open yet so we faced some instability during testing but we got around that by pulling a few times and we need to update only zones that were updated in configuration so we hold current state that is set into unbound and we update only zones that are required and we thought that implementing this in D bus would be easier than it proved it really was. We've created a way, we are using of testing this we are using TMT test management tool with containers that allow us to simulate some network behavior in a way that verifies the actions of DNS conf. If you'll ever want to contribute set of these tests will verify that you won't change behavior that is already in place or you will be able just to show us where we are wrong and you want us to change the thing. Okay so what is working already? I admit we wanted much more to present here but it proved not so simple so split DNS configuration as you from network manager already works ETC Resolve Conf is changed just when our demon is running and is restored when it is stopped. Unbound support is the only one we have at this moment and implementation uses only D bus interfaces of system D Resolve D and at this moment also only its D bus name so it can be running just DNS Conf or Resolve D but not both and we reused network manager system D Resolve DNS plugin for now because it pushes configuration over D bus but in the future we want to get rid of it and make our own or use more parameters from just IP address and that is what we would like to use unlike the opportunistic way which system D Resolve D used because these RFCs were not defined at that time and we think this is correct way and support multiple cache is running at the same time is not necessary usually but it would be very helpful for some kinds of testing. We would like to have ability to forward over DNS over HTTPS but there is problem not any DNS cache we have in RL supports that and in further there are only few similar with DNS over Quick and auto configuration of DNS sec would be nice we would like to have some successor and better implementation what was once attempted with DNS sec trigger but maybe better accept it and maybe if its time sometime in the future rewrite into Rust and reduce memory required memory for our interfaces that would be all for us so if there are questions please now is the time and if we can't answer them please use these mails or file issue on the project Definitely stick around to the next speaker we will talk about the Rust domain craze and thanks for the call Questions, stay phone Would be helpful for inbound to have a D bus connection where it says when it's ready No I don't think it needs to be D bus connection I think we need to correct LIP system D notify event which it kinds of supports but I think last time we try to enable in federal it start crashing so it's not built in but some kind of support is there we just need support to inbound to tell us I'm the service I think I'm ready and there's system D API for that we need to use that whenever possible it doesn't have to be D bus Visek? If you only want to communicate over DNS local servers you need to crash I understand that you want to drop the MSS Resolve bridge So how do you want to overcome this? This is part of the question Second part of the comment is that we talk about D bus but D bus is something everyone can relate Actually now it's a series of D bus in parallel which means that we can have a resolution since any book before the D bus server is up Which is why it's always useful so we had a plan to add the private interface The second question was do we plan to add running interface? No I don't think we want that First question was get other info API How can you send the additional information for example about multiple interfaces over DNS projects? How can I send from which interface comes the query or how to request query just for selected interfaces over DNS? We don't want to because in what cases this is needed I think network manager needs that just to verify the connection works I think we might have different service which just will query us please Tell me address resolved on this interface and we will send the query just to correct addresses Because we know which addresses are used for that interface but that would be not served by the local cache Because that is not yet configured for that Can I make more sense to take this separately? To do this after session? Because it seems quite specific Yes, yes it might Any other questions? No thank you again
The first 13 years of blockchain name systems
What do I call you, Naiman? I don't mind. I nickname or real name. Whatever is easier for you to pronounce, because a YAL is a bit difficult sometimes. A YAL? Yeah, I mean, a YAL, that's the name. I use Naiman because I live abroad and no one can pronounce it. Sorry? Yeah. Naiman seems a bit easier sometimes. Peter, see a new audio? Yeah. Unmuted. Unmuted? Right. Right. Your controls are here. You have 30 minutes to question a loud speaker. Just join us and welcome to the UNS developer room. This is our final speaker for the day. Naiman and he will talk to us about the history of blockchain and naming systems. Okay. Thank you. Thanks everyone, you know, who stayed till the end. I imagine it was a great, at least it was for me. I'm going to talk today about the history of, you know, blockchain name systems. I'm Naiman or a YAL. I'm from Israel, but I live in Poland. And, oh, that's fast. And I'm a mathematician, but I work on peer-to-peer websites in the last few years. If you don't know what it is, don't worry, because the main important thing regarding the talk is that those websites use blockchain name systems. So I had a chance to talk with the developers of the main ones, even being engaged a bit in some of them, and that's why I do this talk. This is some projects, but I did, which use, you know, blockchain name systems. Don't focus on that, because, you know, it talks not about me. I know that blockchain has a bad connotation, especially, I guess, in this thing. I'm not here to change your mind. I'm here to tell a story. And the story begins in 2001, where a guy called Zuko Wilcox sent a draft to his friends of some article that he wrote, and it began with the words, please do not propagate this information widely yet. I'm still working on it. Did they respect it? Absolutely not. This was propagated so hard that by now there is a Wikipedia page on it called Zuko's Triangle. Zuko's Triangle basically says that a name system, there are three properties that a name system can have or not have. One of them is secure. Secure means that two people cannot register the same name. The other one is human meaningful, which basically means you can choose which name you registers out of the ones which are available. And hopefully because you're human, the name would have some meaning. And the last one is decentralized, which means that in order to register a name or to verify a name, you can do it yourself without needing someone else like a third party. And Zuko's Triangle says that for any specific name system, you can have at most two of those properties. You cannot have the whole three. Sorry. Here are some examples. You know, a name system that I guess everyone here know. DNS. DNS has human meaningful, for sure. It's also safe. It's not decentralized in the definition of Zuko's Triangle. Public private key is safe. Yes, it's secure. Sorry. Decentralized, yes. You can generate yourself. You can verify someone else on your own. But it is human meaningless. Most public keys are a monster. And my favorite one, the state ID, which is safe. But otherwise, it is neither decentralized or human meaningful, which I think it's a shame. I would love to be able to choose my state ID, but states. Zuko's Triangle kind of was considered to be true for the first decade of this millennium. It was what I had that was not involved well known within the name systems community. But you can only have two. And you shouldn't try to build one that has the whole three. 2009, Bitcoin invented. And shortly afterwards, a year later, in some of the Bitcoin IRC chats, people started to say, hey, can we put name on a blockchain? Now, this continued in chats. There was a Bitcoin Talk forum. At some point, the legendary Aaron Schwartz heard about it. And he wrote an article squaring the triangle, which basically says, if we put names on a blockchain, we can actually go around Zuko's Triangle and can have a name system that have the whole three properties. This can be, you can argue if a blockchain is really decentralized or not in the sense that the requirement was that you can register and verify yourself, not register and verify with a blockchain. But for the sake of this talk, we think about the blockchain as a big dump object. It's a tool. It does what you want. I know it's not. I know that each blockchain has its own pros and cons. I'll be happy to argue about each event afterwards in a beer with that right now. Big dump object, by the way. I'm a youth science fiction fan. It's a term from science fiction. It's a subgenre of books that have a big dump object. It does something. That's a Ringworld by Larry Neven, classic science fiction book. If you haven't read it, I read it as a kid. I hope it's still fun now, but I really loved it. So 2011, Namecoin was launched. Namecoin basically did exactly that, putting names on a blockchain. Here is some interesting trivia details. The names that it put on a blockchain were not actually names. It was just like 250 bytes on a chain, so you can put a sequence of 01. Then if it's a name or not, or how you interpret it as an ASCII or a Unicode or whatever is up to you, no one verifies everything besides the fact that the similar bytes was not put before. No subdomains, because all you put is bytes. So there is no subdomains. It's just names that you register on it. They did have something which is called namespace. It was in the software layer, not on the blockchain. I want to put it out, because they had basically two that they were promoting, the developers. One was D, which was for domain names for websites. But the other one was ID. And that's important, because this already shows that the thinking was that those names are not necessarily for domains of computers that can be used for identify people. The cost was 0.1 Namecoin. NMC was a coin, currency of Namecoin. To adjust it was very difficult. I mean, you can raise it in a soft work, but to reduce it, you need to do a hard fork. Also to know how much it costs really in fiat money, depends on the moment that you buy it. And this didn't go to the developers or to finance anything. It was just burnt, because in blockchain economy thinking, burning money is how you make money. Lecture a few days ago. This is the last blocks of Namecoin. One transaction basically just means the miner. So which means no one did anything there. As you see at the moment, I think that it's a project which is still being maintained, but not really being used. And there's a question, why did it fail? Or at least I think it's failed. Namecoin people here, I apologize. And I think that there are two things that they did. Maybe wrong. First thing, they really copied Bitcoin's playbook one by one. But name is not money. Names are not money. It's a different animal. You can believe that 100 coins have value and it's okay, it's not contradicting. You can go to a store, one store that accepts in dollar, you will pay in dollar, another store in euros, you will pay in euros. Another one wants Bitcoin, okay, you will get some Bitcoin. It's not contradicting, but no one wants to think that the same object have two names. This is not how it goes. Like historically, if I would think that some God has one name, and you will think that the same God has another name, there's a good chance we will go to a war. We will not accept each other's belief. But the other reason, which is maybe more deep, is that namecoin developers had a huge challenge of building it. It was the second blockchain. It was the first NFT blockchain. It was the first side chain that had to invent marriage mining. And also after it was launched, it was definitely not scalable, and also don't think very scalable right now. And they spent lots of their time improving the protocol and handling all those technical details. And they didn't have time to also think, how do I make it useful? What is it good for? And you know, pushing it to users. So 2016, as I said, I entered the blockchain ecosystem. I asked people about namecoin. I even bought one, I think, just a name, but I'm not 100% sure what this intended to. The general feeling was that all the good names are squatted. There is nothing to do with it. And the names on a blockchain is nice for playing, but not really a useful use case. And in the same year, E&S was announced. E&S is a very different animal from namecoin because it is built on top of Ethereum. And if you don't know anything about blockchain, you should know that to write an application on top of Ethereum is much easier than building a blockchain. Which means E&S, which is really well written and a nice engineering feat, is still easier to write back from namecoin. So they actually had time to have long discussions how to get people to use it. And they did two things. One of them, they said, okay, names are going to have an auction. So it won't be the fastest person who takes a name, but the one who agrees to pay the most. It's not necessarily the best solution, but at least, you know, they try something. But the other thing that, again, I see is very crucial is that they had updates. They could update their system relatively easily and they were very open about it because when they launched in May the 4th, May 2017, they called it E&S Temporary Registrar. And some of the messages, they even said, you know, we are not sure how to do it right. So that's why temporary at some point, it will be changed, be prepared for it. At the time, 2017 May, it was before the DAO hack. So it was not really common in blockchain to say that you are going to change things. This was still the time of, you know, immutable programs and code is low and stuff. How did it go? Well, it went in the same way like Namecoin, quite successful commercially in the beginning. I think that someone put a bid on the name exchange.eth of $2.6 million. So that's quite well. Like Namecoin, the money did not go to the pockets of the developers, but instead it was locked. So it was a deposit and the moment that the name expired, you got it back, which if you want to fight squatters or, you know, speculators, it's not necessarily the best idea because they have nothing to lose. A year passed and another blockchain name system was announced, Handshake. And I like to say that Handshake took one step backwards, three steps forward. I think it's kind of represented. And the step backward was that ENS was built on top of a blockchain, which could be very flexible. Namecoin, sorry, Handshake said, well, we are actually going to build our own blockchain. It was a very, already in 2018, to have your own proof of work blockchain without updates was outdated. And I said, because I remember hearing about it, and I thought, OK, that's two years, at least too late. But this thing provided them the ability to do something that the other name systems didn't do. And I don't think that anyone else does, besides them at the moment, because I said that decentralized is registering a name and verifying a name by yourself. But actually, to verify something on a blockchain is very difficult. In the worst case, you need to have the whole blockchain, which is huge. In the better case, you only need to have, like, 30 gigabytes of a proof. And that's not very practical for a name system. And Handshake really made an effort. The whole white paper is to us to have short proofs. So of a few kilobytes, that this is the name owner, and that's what the data that they attached to it. The other thing that they did is gift economy. I think I know that this is from Corey Doctorow books, but at the time, this was very popular among the Bernice. Handshake actually is the first one that said, we want to replace ICANN. We want to be the new root of DNS. And then people were buying it. Namecheap bought a Handshake domain for 750K. There were people who were participating in auctions. And I checked. Now people still participate in auctions, not in these amounts, but it seems to be a thing. There were some other funny stories that SiHab joined Handshake and then left two days later, because they thought they get a domain on the blockchain. But actually, they got a subdomain on someone who has a domain on the blockchain. So there was nothing decentralized about it. It was a misunderstanding. But besides those things, I don't think that there was significant usage of Handshake. Definitely not at the time. We'll get back to it towards the end, just so you know when we speak about what happens nowadays. But at the time, it was mostly like the other blockchain systems buying and selling. So the thing was a bit grim at this point, 2020. But don't worry. New decade, things starts to be going to be more happy soon. We have to go before that one year later, where ENS permanent registrar was launched. And they took two years of studying lesson and actually modified things. And the first thing that they did is auctions out. Because for the first few weeks, people actually participate in auctions for some specific name like exchange.eth by the time that I wanted to buy ENS domain, which was Neiman. No one participate on the auction besides me. And it was just an annoying process for the user. So they said auctions are good for the beginning. Afterwards, you don't need them, which I think makes lots of sense. The other thing that they did is that by this point, they were almost broke. I mean, they started with a million grants, dollar, from a few foundation. I'm not sure if I don't remember if they got anything else on the way. But time passes. You have to pay people's salary. They were almost broke. And then they thought, I mean, their idea was to be a non-profit that gets stuff from donations and grants. But 2019, the blockchain had a winter. No one gave them any money. And then they figured out, well, there is all this locked money. And why do we actually lock it? It's not good for anything. It's not protecting against squatters because they can try to squat on them. If they don't manage, they just get the money back. And they did. The next step, money goes to ENS organization with this NGO, which means it's supposed to be fed into the development. And overnight, they became from an organization which is almost broke, the organization that has millions of dollars. This was important. I was already developing for ENS before. But it was a side project. And when this happened, you start to think, as a developer, well, maybe I should take it more seriously because now they have money that they have to give someone. They are an NGO. They are supposed by their declaration legally to give it to the ecosystem. They didn't give to anyone. But they thought, it sits here. Another thing that they did is that they kind of changed or defined what their names are for. And they said, this is a web-free identity, or more specifically, because a web-free is a marketing term very annoying. It's an identity supposed to be used in a pharium ecosystem. And I think that they managed actually to do it quite well. Verdeirector, Brantley Milligan, he did, in my opinion, magic. He has infinite amount of energy. I wrote some message in ENS forum. And immediately he said, hey, let's set a talk and meet. And he asked, do I want to build for them more? He had ideas. He started to do all those things where he asked people in Twitter to change their name to their ENS names to show that people actually use it as identity. In conferences, people start to use it. In the firm conferences is their identity, their name, Naiman.eth. He was really pushing it well. And I got to see it all from front seat, because at the time I was working on this project. It was a search engine for the Centra's Web for ENS plus IPFS websites. So I got to see how every month more and more people got ENS name. There was more buzz. And people actually use it as an ID. I'm not saying it was a huge thing, but it was a thing. There was a use case for this thing. And before, there was none. But still, when people ask me, hey, are you going to do something professional with it? Are you going to build a serious big project or business on top of it, I was saying that I'm not sure, because the root of ENS at the time was held by a multi-seq of, I think, seven people, which is quite risky. Forget the centralization. Not the centralization. It's just quite risky. If I'm doing a project which I put a lot of effort and investment on top of ENS and then something is hacked with a multi-seq of seven people, which is very easy to imagine, then what do I do next? So I was telling everyone. I also told it to the ENS people. I'm not sure it's so directly or implied it. I'm pretty sure I'm not the only one who mentioned that. And then we reached November 2021, when a very significant thing happened. ENS DAO was announced. DAO is a decentralized organization. If you're not from the blockchain ecosystem, it's OK if you don't know it. The idea of a DAO is that instead of out of the crypto Twitter, I mean, my mom's neighbor who has nothing to do with blockchain told me that he bought an ENS thing. And I was like, oh, I'm working with it. That's nice. I think that lots of people who are now active in ENS joined at this stage, not because there is money, just because they heard about it. It's made an impact. It's a big project that gave control to the community. It's also a bit if you want to work on blockchain, but you don't want to get into all the protocols. And you're not interested in money. Name system is something which is a bit easier to understand and clearer. ENS DAO is very active nowadays. So I was a member of ENS DAO for the first year. I was managing a subgroup of decentralized peer-to-peer websites, which is what I did at the time. I don't do it anymore. But I still follow it a bit. I know lots of people there. It's super active. The forum is active. There are calls every day. There are votings. I mean, for the good or for the bad, very really an active community. And at some point, I don't remember right now exactly when they actually transferred the root key ownership to the ENS DAO, which means now it's owned by ENS. There is one problem, or maybe two. And the first one is that ENS voting goes with ENS token. You can buy the token, which basically means that someone who is rich enough or motivated enough can kind of take over the organization. And if you want it to be critical infrastructure of the internet, it's very risky. If at some point it will be, then someone will take over. I mean, if someone can, then they will. I mean, the DAO can at some point decide that you get voting by reputation. But at the moment, this is the situation. And the other thing, while handshake have short proofs, ENS does not have such thing to verify anything on the Ethereum blockchain. You need quite a long proof. It's not very practical for anyone to do unless it's really your passion, like me. But even then, it's super difficult. I don't know what's the technical way to solve it. If any, right now, everyone compromise on that, and they actually verify things with other services. 2023, which for me is today, because we are beginning of 2024. So the state is at the moment that once ENS DAO went on, they have a huge market cap. They were very, even during the crypto winter of blockchain, they had quite a buzz. People started to make clubs like the 10K Club of People owned the name 1.ETH till 9,999.ETH. There was a website for clubs and stuff like that. It made an impact, and as a result, any blockchain now has their own name system, because it's just easy to make. And they see that there is people who will pay for it. I know people that in each of those block systems buy a few names, because normally they are quite cheap, and they are like, well, we don't know which is a good investment. But articles like the top 10, blockchain domain name systems, I admit that for a while I was trying to follow that, but I didn't find any that has technical innovation, which is what I care about. And I reached the point of saying, well, if something will happen, which is technically innovative, someone will tell me. ENS itself at the moment is focusing on a few things. One of them is subdomains. They want the subdomains to be kind of like domain, so you make someone a subdomain, and then they own it. Like it's not dependent on who owns the name. It's completely independent. If they had something which is called name wrapper, it was developed for many years and launched last year. CCIP is basically cross-chain interoperability, which means how to communicate with, from ENS can communicate with other blockchain. I am not a huge fan of that. I think it's centralized, a decentralized technology, but they seem to, lots of people, they like it. And they really want to join ICANN. Like they really want to get control of the .eth subdomain, a TLD, sorry, but the problem is the .eth is there for Ethiopia. Nick Johnson, the owner of eth, had a long thread about it recently, like a month or two ago. So if you want to read the details and where they held up to this discussion with ICANN of getting it or not, you can see it there. For the other projects, Handshake, I went and checked just before the lecture what's going on there, and I got the feeling that's not much change from the lunch, only that it's less enthusiasm now. Like people still participate in auctions with less money. I didn't find any real use case besides that. If anyone knows and I missed it, let me know. Another story that happened, and I'm going to wrap it up, is a stop-all domain. It's another ENS, another a few blocks name system. They try to patent some names, and now they're a legal battle with ENS. And I thought of maybe speaking of what I think happens in the future, but time is up, so I will not. Thank you, everyone. Thank you. Thank you.
Embedded Security 2023
last year. Hello, everyone. Last year, first time, I was talking about errors in embedded development. And I would like to repeat a part of the experience that we have had last year. Please think about an embedded project you are working on or you have been working on recently. Lock it in your memory. No cheating. You lock a project. Now, how many open SSL versions are there in that project? Raise your hand if that's zero. Like 10 people. Raise your hand if there's one. Like 20 people. Raise your hand if you are sure there are two or more. Like less. And raise your hand if you do not know. That's the majority of the room. I think there are a little less people who do not know, but still the majority. Why the question is important? You will see later. And a bonus question for people who knew how many versions of open SSL they had. Who can list the total, who of you has a full list of dependencies of that project? Okay, I round 20 people. Congratulations to you. Now, who is Marta and why she is talking about such things and asking such intimate questions? I'm a security researcher. And then what to expect from the 2024? Now, let's talk from regulations. Regulations that plural are a little bit too much here. One regulation. Because that's a 25 minutes version of the talk. So, their regulation is the CRA. Now, one slight simplification of CRA. To your lawyers, I am simplifying. The CRA is adding security, mandatory security requirements to all products that will be put on the market in the European Union by the requirement of the C-mark. The C-mark, you know it, on all electronics you have the C-mark. It's extending the C-mark to add security, mandatory requirements. Examples of the things that are mandatory. No release with known vulnerabilities. As bonds. Secure configuration by default. Updates by default for all users. And so on and so on. There are two pages of those requirements. In the final version, it doesn't apply to open source project themselves. In most cases, it applies to products that are integrated open source. All products, in fact. It will require paperwork. Mainly risk analysis and the vulnerability management process. And what this paperwork will be, I cannot tell you right now because it's going to be defined further. As for most of the things C-related, you have self-assessment by default. But there are certain classes of products that will require more. Including external security audit. That's an expensive thing if you haven't done one. And that's hot news because we have a final version. It's expected to be voted next month. And from next month, there will be three years till the final implementation. Now, the current version excludes non-monetized open source project. That's a big simplification also. So if you are contributing to an open source project, it doesn't apply to you. But for all integrators, embedded people are integrating open source in their products. So basically, it applies to the whole embedded. There will be risk analysis to do for all components that you include. And that's why the question of what do you have as components in your project is important. And now the big question for the whole embedded open source community. Is everyone going to do this paperwork alone? Or are we going to do the paperwork the open source way and share the documentation prepared for each single dependency? That's a big question for 2024, for all of us. If you want to know more, if I scared you enough, I've written an article published at WN last year, so it covers the first version. And for your trip back from FOSDEM, there's a nice read, the final version of the regulation, just 189 pages. But it's not boring. I didn't fall asleep, it's not boring at all. Now, let's go to trends, apart from the regulation. CV numbers. What is a CV? CV is a way to name vulnerabilities, public ones. It stands for common vulnerabilities and narration. And the number of registered public vulnerabilities is growing up. And in 2023, it went up. Yet again, we have yet another year of a record high number of CVs. I haven't been splitting embedded, non-embeded, but for embedded, that's the same statistics. The number of vulnerabilities is going high in a very important way. Now, a complex problem of funding of security work. In the recent two, three years, and there was a big part of this process happening in 2023, there are external funds paying for security work in open source projects. Two main examples of that, OpenSSF Alpha Omega project that funded, I've chosen examples from the embedded field. OpenSSF Rust, Python, Eclipse Foundation, and the Sovereign Tech Fund that has been part of the work for the Yocta project and other projects too, but in the embedded field. Because of this funding, because of the pressure of the regulations that are happening not only in Europe, in the US there's also different pressure, but in the same direction, we are seeing the update of processes in different projects. An example of that, the Yocta project has now a security team and working security process. In relation to all that, we also have tools that are either being implemented or they are being used more and more frequently. For example, the S-Bomb generation, either in the Cyclone DX or in the S-Bed X format, that is getting more and more common option. In embedded projects, yet another example from our field, S-Bed X is now generated by default in the Pocky reference distribution in the Yocta project. And a similar tool link on the dependency checking and CVEs, you have that in the platforms like the Dependable on GitHub, Standard on Tools also, tools are happening and the pressure to use them is happening too. And another big question for all of us, all that work, it requires someone to do it. To do the security work, to do the processes, to look at the results of tooling, even if they are the CI, you have to have someone looking at the results. How can we do it long term and especially how we can fund it long term? Those external forms may disappear one day. Big question for 2024. Now, for the events, vulnerabilities and incidents, I had to cut things because I want to have time for questions and it's only 25 minutes, so I had to cut. And this is what I have chosen for this year. HTTP2 Rapid Reset, also known as CV 202344487. This one was actually exploited in practice between August and October of last year. And it's a vulnerability in the HTTP2 implementation, or a little bit in the specification itself also. And if a client creates a parallel stream, HTTP2 allows parallel streams for the same connection, if the client creates a parallel stream and just immediately after sends a message to close that parallel stream, this is generating a high load on the server. The creation of stream is pretty expensive. And as a result, you get a denial of service. Most HTTP servers have been affected and there was a big number of releases happening in October 2023. What is interesting in the whole story is that the servers that are more for the embedded market, so with careful resource allocation, with limitations of number of clients, or limitations of streams per client, they had better resources, less vulnerable to this issue. For example, like HTTP, they clearly state that they are not vulnerable to that issue. I'm providing a link to the NVID entry for that problem, with dozens of links for different projects with information, or what they did, or what they expect users to use as configuration options to prevent such things in the future. And then a little bit of fun. It's either funny or it's frightening, depends on how you read it. The whole thing happened in 2022, but it has been published in 2023, so we can say we put it in 2023. This was a long story, but in short, some trains in Poland weren't starting after maintenance. And the maintenance company took a team to the river engineering company, and what they figured out that there were things like, train was locking with a vague error message after staying in one place for a long time, or the train was reporting errors after staying at some GPS positions, which by coincidence turned out to be GPS positions of workshops of the competitors of the manufacturer. Or in some trains there was a log based on a date, well, related to the CERA, but also related to all the things happening on the market. Until now, embedded developers were choosing their dependencies. Well, it does the work, I can take it, if there is a license matters. In the future, it may be that license matters won't be the only condition. There may be also a condition that this project have security policy, is this project providing regular security updates for five years or more, and there may be the need to do the triage in your dependency list, in some surprising places also. On the S-Bomb site, last year we have had S-Bombs being generated in more and more places, generating S-Bombs at school, but it's even more cool to actually use them for something. So I think that's going to happen this year, and then on the pure vulnerability side, we are still seeing products being developed to be in an internal network, not connected to the internet, and then someone puts a GSM modem in there. I am expecting a few funny vulnerabilities like that. Then the hardware series is going to continue, not only chips but also firmware. Have a look at the size of the firmware of your network card, or your graphic card, or your gpu, or other thing, or phone chipset. That amount of software means there are bugs. If there are bugs, they are also likely security bugs. I expect that, maybe not this year, but sometime in the future, the future will have a big issue related to firmware in one of those categories. My personal pick is network cards, a packet to make things funny. Then there may be also issues in places you do not expect them to. Quite many open source projects have never issued a CV before. If they have never issued a CV, users have a tendency to not update them. Not having a CV does not mean that there are not any bugs. In fact, quite the contrary. I expect that we may have an issue of a very serious problem happening in one of those projects nobody has been looking into before. Then everyone will be trying to figure out how many copies of that project they have. To sum up, that is going to be an interesting year. Do you have questions? Thank you for the interesting talk. I have a question about the legislation. Are there different regulations for real security bugs and denial of service bugs? If you have some warmable hole in your software, which is network-connected, or something which is a denial of service, for me it is a different class. In one case, you probably get my point. There are two parts of answer for your question. The CRA is not the only regulation that is currently in progress. You know that there are European elections in Germany. Things are being rushed. There is the CRA, but there is also the PLD. There is the regulation related to the workings, there is the regulation relating to AI, and all of them have certain things. On the typical vulnerability in the US, if it is an exponential like in the case of that HTTP repeat reset, it is a vulnerability. I classify it with a typical vulnerability. If it were to happen in a network device, that also enters into other regulations quite probably. There may be things that apply in different places, depending on the actual use of the same software. Thank you very much for this talk. I think this is probably the most important talk to me, as I am a designer manufacturer, embedded hardware for startups and SMEs. I am desperately concerned about the situation. The timeline you lay out is scary enough, but you will know that we in the UK have IoT connected device law coming into power at the end of April. We have three months to be compliant to this. There is a £10 million penalty, potentially, to us, or a percentage of global revenue. I will say broadly not one of the startups or SMEs we work with, and indeed ourselves, are in a position to deliver on this stuff, which scares the heck out of me. I would love to know who we need to be talking to to work together to try to look at this. I haven't shared the scary part of the series about the penalties, but in all cases, you are not able to pay them, so... That is another example. In different places, there are different regulations being brought in the light. For me, as an open source community, we have the only way to solve it all together and prepare the whole paperwork all together. Otherwise, the big ones will be able to pay the whole paperwork, but the small ones, well, not really. I think we are out of time, unfortunately. Thank you.
The Small Device C Compiler (SDCC)
So, welcome to the small device C compiler. Talks about such short, so I'll try to fit in just a basic stuff. I'll start with a quick introduction on what the small device C compiler is. Then I talk about the architectures we target and then just a little bit of what the future hopefully brings for the small device C compiler. Okay, so STCC is, as the name says, C compiler. It tries to support C standards, in particular either C19, 99, 11 and 23. It's nearly always used as a free-standing implementation. The only exception I know of is that FASIX, an operating system for some 8-bit systems use it as a part of a hosted implementation. Now, those familiar with the C standard know that in a free-standing implementation, you are more restricted in particular in features from the standard library set. You can use course, well, on your device there's no file system, there's no point in using any standard library functions trying to open read or write files. There are some supporting tools, apart from the compiler itself, in particular SM plus, a linker and a simulator. The simulators are usually kind of cycle accurate. We mostly use them for our regression testing internally, but they are also usable by end users who want to run their programs on a simulator rather than on real hardware. It works on many house systems. Most popular would be Linux and Windows, but it works fine on free BSD and so on. We target various 8-bit architectures, probably more than any other compiler does, and we have some unusual optimizations that do make sense on these targets where you really have very little memory and where both optimizing for code size and for memory use are very important and often more important than optimizing for speed. Our user base consists mostly of developers targeting embedded systems. I guess they make about two-thirds of SDCC users, and the rest are retro gaming and retro computing enthusiasts because we also support various older 8-bit architectures. They're similar enough to modern 8-bit microcontrollers that it makes sense to have them all in the same compiler and many high-level optimizations can be shared. But I believe that the user base in the end benefits of having both these groups represented cause sometimes one group or the other is more eager to try some new feature, which of course helps us finding all the bugs in corner cases and iron out everything, while then more conservative users that want to wait for longer than getting in a more polished state. Our latest release was at the end of January, which is very recently, typically we do one release per year. So the project is hosted at SourceForge. We have our issue trackers there. We have mailing lists for communication. The users have version repository. The user weekly for some documentation outside the manual. And we have a compile farm for nightly regression testing, which means every night on many different host systems, both in terms of operating system and underlying architecture. The latest SDCC from Drunk is built and then runs all the regression tests, meaning compiling a lot of tests, running them on the simulators to see if the results are what they should be. There's something between 10,000 or 20,000 tests that are executed that way and also incorporates a large part of the GCC test suite. A quick comparison to more known compilers. We don't see ourselves as a competitor to GCC or LLVM, so the versus up there is just for a comparison. Now we specialize in targets that are hard to support in GCC and LLVM. For GCC or LLVM, you typically want some risk like architecture, many registers, uniform instructions set. Then you can use a tritine style register allocator and that's efficient and everything is nice. The typical 8-bit architecture is not like that. If you want to get into the compiler, there's a compiler developer, our learning curve tends to be less deep than GCC. Our internal interfaces tend to be more stable than LLVM, which for some people is also a nice feature. Talking about the recent release, our main improvements were definitely in the last two years in standard compliance, in particular ISOC23 support. This was partially funded as a project by the prototype fund from the German Ministry of Education and Research and improvements and optimizations, in particular generalized constant propagation to allow us to narrow variables. If people use an int as a loop counter, that's typically a waste of memory in an 8-bit target if that loop doesn't really need the 16-bit that an int has on those targets. The work in optimizations was partially founded by an LNET via the NGI-0 initiative. We also got two new parts, namely one for the WDC6502 and one for the SCR800. One is the MOS6500 derivative and the other is the SET80 derivative. Let's get to the parts. The STM8 part is our best one because we generate really good code for the STM8. It's currently the most advanced part. It has all the bells, whistles and great features. We do very well compared to the non-free compilers. Unfortunately, recently this architecture has become not recommended for new devices. The manufacturer tries to move their customers to arm. But just to illustrate how we do versus three other compilers, which are all non-free, in terms of benchmark scores, we generate the fastest code essentially, except for WEDSTONE, which is a floating-point benchmark. We didn't put as much emphasis on it. And we also generate reasonably small code also for all of these benchmarks here. This is with the current release in January versus the current versions of these non-free compilers. Now our oldest part is for the 8051 and its derivatives. That's an ancient microcontroller architecture that Intel introduced long, long ago and abandoned long, long ago. And there are still many dozens of manufacturers that make compatible devices. It's a very, very popular common microcontroller architecture. It's not as nice as STM8. It was the first supported architecture in STCC. But in the recent years, it has fallen a bit behind new features that got added for other architectures, didn't always get added to 8051. And also many devices made by different manufacturers are also often slightly different, in particular new features like additional data pointer registers, which are used in different ways. We have support for the HTC rate and ST rate. It's current microcontroller architecture by NXP. The problem is there's not really much of a free open source community around this architecture. There's individual bits here and there that someone wrote some free software for it. But in general, it seems a typical sentiment by developers of ST08 programs as well. We get the, at no monetary cost, we get the development environment for the manufacturer. Why should we try something else? And sometimes they complain a bit if the manufacturer drops the part for an older device. As per DOC, a Taiwanese company that makes billions of microcontrollers each year that are not that expensive, they were not really meant to be programmed in C. But we still managed to support them, at least three of the four subarchitectures that exist we already support. The largest one, the PDK says, not yet supported. One thing interesting about these is that they have hardware multishrating support, which we currently don't support. What we can do is write a C program, run it on one core and then the other cores run a sampler software. There's microchip pick. Those used to be very popular because they were cheap. The ports are currently un-maintained, but we still get sometimes contributions from users with patches. It's not like they're completely abandoned. Maybe sometime a maintainer will step out of these user contributions. Okay, now we get to the architectures relevant to the retro computing people. These are a large number of Z80 derived architectures. The SM83 might be known to most people here as a CPU from the Game Boy, even though it's also found in some other Japanese appliances and TV remotes. And then we have the MOS 6502 and its derivatives, which don't even fit on the line anymore. They're found in old embedded systems, especially those R2K to R3K, those other rabbits. They were very early IoT devices because they are kind of enhanced Z80 with ethernet or Wi-Fi on support on the chip. But these architectures are relevant to the retro computing community, which often doesn't use SDCs directly, but instead via downstream projects. They package SDCC together with libraries for certain devices that use these things like video game consoles or historic computer systems. Now, what will the future look like for SDCC? We're definitely facing a problem at the moment because the SDM8, the architecture for which we're doing really great, and those rabbit devices that I mentioned on the retro computing side, are both not recommended for new devices anymore. Meaning that the architectures for which we really, the architectures where we really do great as a compiler are about to be phased out. We will keep supporting them, probably unlike many of those commercial compilers. I mean two of the three commercial compilers for the SDM8 haven't even seen any update for the last two years. But to stay relevant for current embedded systems, we need to try something else. And basically this is the idea. The main thing is putting the focus on the MCS-51, the 80-51 again. It's an ancient architecture. It's not exactly the nicest architecture. But due to the large number of hardware vendors, it's not likely to die any time soon. And looking at the reasons why users choose non-free compilers versus SDCC for the 80-51, the main reason is definitely that the main non-free compiler for this architecture can optimize better for code size. So this slide about the future is basically a very rough outline for plans for the next two years. And generating better code in the MCS-51 port is definitely something that we want to do. We will look a little bit into the SDM8, but due to the lack of community behind it, there's probably not that much that can be done. We still try to keep the SDM8 up to the other ports feature-wise, even if maybe not optimization-wise and code generation-wise. For the PDORC things, it would be nice to be able to support the multishrating better and also support the one remaining subarchitecture. And then there's this F8 thing, which is basically a very early project to maybe come up with our own architecture. I've worked on the compiler for a long, long time and very often there was a feeling this could have been done a little bit better in this architecture, or that could have been done a bit better, it would have made it a much better target for C compilers. The SDM8, for example, is a really good architecture. It has things like stack pointer relative addressing modes. That's one, something that you really want for local variables in C because then you want them on the stacks, so you have full re-entrance, C standard compliance, everything. But it has very few registers. The SDM8 has more registers, but the stack access is a little bit less efficient, because you have to set up a frame pointer, it goes through index registers and so on. The dog things have great multithreading, but they don't have the necessary instruction to support good C standard atomics to communicate between the cores. And out of all those lessons basically learned from other architectures, the F8 is kind of a project to come up with an architecture that, to say that somebody should become, if it succeeds, something for the 8-bit world, something like risk 5 is for the rest of the world, and to see that the time is up. Questions? Thanks for the talk. Can you maybe give some hints about the internals of the compiler? The internals of the compiler, okay. We have a classic Lexiac, sorry, you didn't front-end. Yeah, I just want to say if you are using an intermediate representation and maybe also the simulator, does the simulator, since it has to support many architectural uses on intermediate representation, I would be curious about that. Okay, so the front-end is a classic Lexiac parser. We have an abstract syntax tree that gets converted into the i-code, which is basically a free address code. This then gets annotated with some extra information, such as the register allocation, and then in the individual back-ends, this i-code gets transformed into a sampler code. The sampler code then goes through a P-Pole optimizer, and that gets unwritten out to the linker. The simulators, well, that's not my area of expertise. Daniel Drotos is definitely doing most of the work on that part. They're written in C++. They're using the classes and stuff to abstract things away, but I don't think there's any intermediate representation in the simulator because they need to be fast. We want to run tens of thousands of tests for every architecture that we support every night, so performance is definitely a goal for the simulators. You mentioned code size as one of the areas where STCC lacks behind the proprietary compilers from the vendors. What kind of factor are we talking about, and are you doing regular statistics about the code size of STCC, like in terms of different versions and so on? Yes, we are tracking this throughout work. We have graphs, and we are not lacking in codes that in general compare to other compilers. I mean, we're doing okay for the STM-8. Resonance can generate smaller code, but resonance is in every other way the worst compiler for the STM-8 around these days. I mean, they don't even support C90, and the code is very slow. It's specifically for the 8051 backend that we, Kyle, generate more compact code. I need to just to preface my question, saying that I only experienced STCC through the downstream projects, and I began actually using it in great part thanks to your talk a couple of years ago. But I have noticed that the compilation step takes a lot longer than other compilers would. I suppose it's optimizing and evaluating. Why so? And what would help it? More faster disk, more RAM, faster processor? What would help the completion time stop a bit? This depends on the backend. Most backends use what we call the new register allocator, which definitely was the key to being able to compete this well with other compilers in generating faster code and also being competitive in code size. 8051 does not yet, but for the C80, this register allocator is used. It has a parameter, maxEloxPlanout, that you can set to tell the register allocator how many different possibilities to consider at each node of an internal representation. The default value is 3000. If you set it lower, you get less optimization, lower RAM usage, faster compilation, but there's people that set the thing to a million and let their program that in the end fits into 8 kilobytes compile for half an hour, but they really want it optimized as well as it's possible. So yes, the most of the compilation time is spent in the register allocator and the people optimizer, and for the parts that have the new register allocator, definitely the register allocator, typically more than the people optimizer. And one interesting thing is this can become provably optimal. If you also add F-Verbus ASM, you get comments in the sampler that tell you once if the register allocator found a provably optimal assignment. Per function. Okay, I think that's what we have time for for the questions. So just wanted to say I wish you thank you very much for the fascinating talk on the palace.
Vehicle Abstraction in Automotive Grade Linux with Eclipse Kuksa
All right. Why can't everyone, while the last people join the room, let me ask a few questions to get an idea of the audience that we have here. So, quick show of hands. Who of you knows AGL Automotive Grid Linux? That's quite a lot. Awesome. Another question. Who of you knows Cooxer? Okay, let us change that because that way, fewer hands than for the AGL. But I think that's a good thing. Last and final question. Who's here still from the beer talk? Like room beer? Okay, I'm glad. We actually came out of these talks. So, as you can already see in the introduction slide, we will talk about vehicular abstraction. So, we talk about Automotive Grid Linux and we talk about Cooxer. So, before that, maybe a bit of context. Who am I? So, I'm not the super automotive developer doing a Canon AutoZer for the last 20 years of my career, also due to age. But I started with really coming from the cloud. I used to keep an E2 working on different projects in Github. And I thought, how can we actually make application development for vehicles more fun and efficient? And one really large essential piece here is one challenge because there are no restandardized signals. You can develop an app for one car and it won't run on another vehicle, even maybe from the same vendor. So, what we often see in the industry is this kind of high end-to-end complexity. So, every application is developed for every specific model, every specific car, and we have a huge pay point for that because you cannot port your applications there. You cannot scale, so if a developer is developing an app for one brand, it won't work on the other brand and also maintenance is just a nightmare because just build it for one car and then you completely forget it. So, as always in computer science, one solution to that is abstraction. That's why we also took a lot of effort in the topic of vehicle abstraction here. So, how can we make a world like this happen? So, where we have tons of applications that develop against the same API, against the same data model and that work just on different cars. While it's the same car, it's the same time. I'm talking a bit too much about cars. We run it on different models, different brands and so on. So, basically how do we get to the world where we ride at once and run it everywhere and to also attract third-party developers because this is how you grow the ecosystem and also make it more attractive to develop the unrealized synergies. So, for this abstraction, I would say we basically need two things. One is a data model to operate on and the other thing is the APIs to interact with the data model. Coming to the first thing, I hope, here we go. When it comes to the data model or you might also call a taxonomy, we decided for the CoVisa Vehicle Signal Specification. So, it's done at an organization called CoVisa, formerly known as GenoVee. Maybe that rings the bell for some. And it was basically does. It creates a tree structure for all kinds of data that might be available in the vehicle. So, for instance, you get the tire pressure. You follow the branch of vehicle, chassis, axle, road, one, wheel, tire and then you get to the pressure signal. The same way you have sensor values in here, you can also have actuator values. So, for instance, when we have a seat position, we could just change this value of the seat position and eventually that seat in the car would move to that position. That's the idea of this whole data model. If you want to play a bit with that, there's also a really cool website called Digital Auto that makes nice visualization of that and also shows some example applications how you interact with VSS. Okay, now we go to the first piece. How about the second? And this is where Cooxa or more specific Cooxa Viara comes into play. So, while in this case, vehicle abstraction layer, so we talk about abstraction, so the idea is to have Cooxa running in the vehicle computer. So, some kind of computer which might run Unix or something similar to that. And we also assume this is a place where we decub the hard from the software in the vehicle. So, the underlying assumption is something you can see on the left. So, we have a lot of deeply embedded layers, can, autos, are, lin, sum, IP, whatever you like or maybe don't like, which is maybe really proprietary in some cases and also the signals and the bits are really specific to the car. So, then people would write something that we call provider or also feeder to translate between these really specific systems and embedded systems towards VSS using the Cooxa API. This is where the API is coming from because we use here Cooxa. If you like more on the abstraction side, we also can say like in the deeply embedded layers, we mostly have data like really 1001 or the bits and we kind of need to interpret those. So, we translate it to VSS, get some information out of that and then by combining this information in different applications, we actually create knowledge. And here Cooxa is a nice building piece for that. So, what is Cooxa in general? So, since we are in the open source conference, obviously it is open source, fully licensed on the APHG 2.0 license and as I just mentioned in the previous slide, it is some kind of digital twin based on VSS. So, it shows the current and the target value of your vehicle. I don't want to go into the definition of digital twins but I guess you kind of get what I am getting at here. And so, you only have the current value which is quite nice but you also have the target value. So, coming back to our seed example, when you would change the value, the current value as an application from a seed, this doesn't meet the seed is actually where I wanted to have. So, I actually will set the target value and then it is up to the deeply embedded layers, so the actual vehicle to move the position of the seed over time. So, that is why you can change both value and hopefully at some point the current value will be the target value because that is the whole idea. So, much about the concepts. Let's get to the code. Or maybe I won't show code here but what it is actually written in. So, we wrote this in Rust. If you steadily compile it, it is less than 4 megabytes, large or small depending on which word you are coming from I guess. Like, these are the cloud words and it is small from the automotive words, maybe large to you. And it is quite language agnostic because the interaction with this is with it because we have a GIPC interface with some basic functions like get, set and subscribe and also a number of client libraries using this. And with that, that is actually the basic of Cooxing and I have to be honest with you, if you have been in this death room last year, you would say where is the news because this has been shown there as well. So, let's get to the news. So, what has happened in the previous year? First and foremost, it was using AGL so Scott will talk a lot about that in the next minutes. But we also have some other news. For instance, we now have a Cooxer Android SDK, we have a mock service and we also did some work with later from our side. So, the Cooxer Android SDK, I mean it is kind of straightforward because in the end of the SDK, that is now available in Maven Central and you can interact with the data broker from an Android application. So, be it Android automotive or maybe your own app on your smartphone. So, assuming you have some kind of Cooxer abstraction in your vehicle, you can use a companion app for instance, which we are about to release to the F2O store. Now, there will be a moment for the releases there. We did support request beginning of the week, but we still wait for F2O to actually show this app in their repository. So, stay with me till Monday, then it might be there hopefully. Another thing is a mock service because the guys in the previous presentation had their robot here. We cannot always have a car on our lab to test the application, but we kind of depend on the behavior of the vehicle. So, we need a way to mox this. So, the community came up with a behavior definition. For instance, whenever the signal of a seed is changed to a certain value, like 1000, then the current value should also change to that value. And this is what you can basically mock or emulate with the mock service to show you just an example. Here we have just an example I mentioned. So, whenever the driver's side position changes, then we create an animation to move to that position or to move the current value to that position, which makes it quite easy and flexible to test whatever you desire with your car. And last but not least, this is just a sneak preview into the lab. So, Cooxer is part of the larger community in the Eclipse Foundation. There's an Eclipse software defined working group, or short Eclipse STV. And there's another distribution called Eclipse Leder, which tries to combine some of the major pieces of the ecosystem there. And this is called Leder. And what we managed to do is actually run the Leder-Yogtu layer on top of an HGL, so that you actually get these pieces, like especially Cooxer, but also some other projects like Cantal, to run on the HGL stack. And I think this is a really good opportunity to learn a bit more about HGL here. Oh, okay. I'll take over then. All right. Thank you, Sven Eric. So, I have done a lot of stuff around HGL, so people might recognize me. I'm Scott Murray. I've done Linux for a long time, and I've been at Linux for a reasonably long time as well. I've been working on HGL on contract for pretty much eight years at this point, and doing all kinds of different things for the project around keeping the Yogtu stuff up to date, and also doing a lot of the demo and integration type of things. So, there was maybe almost half of the people indicating that it would be what HGL was, but I'll do a very quick run-through. So it's a collaborative open source project, basically trying to build a base platform that you can build an automotive product on. So it's about 10 years old. We have a vast array of members now, a lot of the major OEMs, and tier one and two new suppliers. It's pretty much a code first sort of thing, where we are more focused on let's build the distro and get it there for people to try and involve. A lot of work went into that. You might have seen HGL demos for several years doing that type of stuff, but our members were basically saying in 2020 that they weren't interested in maintaining that because they weren't going to use it in product. They all have their own application frameworks, or they buy an application framework, and they like to see HGL focus on lower level, show us how to use open source more than writing new stuff. So we started out, our tech demos, or integration demos are more like taking best of breed open source projects and showing people an automotive. Here's how you use these things. And so this really worked out well, because we weren't connecting. We needed something to show here's how you will do vehicle signaling, and VSS and Cooks of Al basically were starting to come out around the same time that we needed a new thing. So I had started playing with Cooks of Al in 2021. Our first release basically was our spring release in 2022. And it replaced our old signal composer and our can service with basically the original Cooks of Al server. And so since then, basically since spring 2022, we have recipes in our layers for HGL to build the Cooks of Al server, now the data broker. And as well, we actually have some signal customization stuff to sort of access an example of here's how you add some custom signals. And we use their can feeder to basically sort of wire up and show here's how you put all these pieces together. We have our own sort of like mocked up HGL virtual car can definitions. And so that sort of acts as an example for people to use. So that was spring 2022, like I said, and that won't go into all the nitty-gritty there. But originally, we were using the original Web socket API, which is a standard thing with sort of companion to VSS. We actually had can working in our demos. And so through 2022 and into 2023, we were sort of keeping up with the Cooks of Al releases. I started, you know, some nominal updates around switching how we were doing our signal additions and stuff. And then this past summer, our pike release, basically, I started the process of switching over to the data broker, which is the rust based implementation. And so I actually got interesting because we're based on Yachto Kirkstone, which is the LTS release, which at this point is two years old. And it has older rust. So we couldn't actually build the data broker. And so that was the thing where basically, a jail, we contributed upstream, I have a layer that you can get for the Yachto Kirkstone like mix in basically gives you a newer rust to be able to build the data broker, which other people I now are no are using for building other rust projects. So that, you know, we're now starting to look at the data broker, this cop coming release. Basically, we were now using the absolutely latest version of Cooks. And that now I fully have us all switched over everything's data broker using gRPC, all our demos are converted. And that basically acts as a thing. We're trying to see this with the automotive community, because, you know, we see a lot of vendor, you know, or we encode that people open source is all like custom IPC and stuff like that. And it's like, Well, no, there are open source projects that are heavily used that do, you know, gRPC and, you know, interact with cloud providers and stuff, you don't have to reinvent the wheel. So Cooks of Al has been a very good thing for us to sort of try and get that to people. So how exactly are we using an AGL? So there's, you know, the assess applications. As Eric mentioned, the concept of, you know, there's actuators. So there's, you know, apps that basically just listen to sensors. So like dashboards type of, you know, things like that. And then for acting on signals, so basically implement an actuator behavior, we have some example services that do that kind of thing. It's like HVAC sort of stuff. There's also setting an actuator value. So that would be like on a user facing infotainment app would be like HVAC controls or, you know, audio or volume that type of stuff. So in our tree right now, we have two demo services that basically do that actuator side of things. So we have HVAC service that basically listens to all the like signals in the VSS hierarchy around HVAC controls. And then in our demo setup, which unfortunately we won't have the full setup here, actually pushes out to drive some fans and things like that. In the audio side, basically I'm listening into the audio like volume signal that's in VSS and we, you know, have some custom things that I'm working to push upstream. But basically actually drive that down into wire plumber and actually like adjust the, you know, the audio setup. The user facing side are demo applications, the QT demo, which I think we might be showing tomorrow. Basically we're using the SS signals for like everything pretty much. So all the applications in that demo, which are in our source tree, you can grab them, basically are all wired up to do VSS signaling. And the code is sort of in a nice little library now and basically allow you to reuse it. On our newer Flutter demo, which I'm not truthful, actually maybe I think we'll have one setup that'll have that tomorrow. Basically it's, you know, it has a unified sort of home screen. It's doing GRPC from Dart. And right now I don't have that sort of library sort of packaged up yet, but that might happen this year. Or we might move it to native code. Tidder, who are big into Flutter, they tell us that's what they do that for some of their stuff. So, you know, we're pretty much, this is what our newer Flutter demo looks like. And so in this demo, like the tire pressure, all the likes, you know, vehicle speeds and stuff like that, and all the, like the AC controls and the temperature, all of that is going through VSS signaling to driving, you know, demons or whatever you want to do. Or KNData coming in actually gets converted back into a signal update. So, so there's some extra, you know, presentations from Sven and Eric myself. And we're going to be in the AW building tomorrow. We're open bed at work today. We're to have that table tomorrow. We'll have our demos. And this is, you want to do your pitch? Sure. So if this sounds interesting, or even if it doesn't sound interesting, there's a huge chance to engage with the community around coaxa and the larger communities in the automotive sector. So we have something called Bosch Connected Experience. It's hosted by Bosch, but it's basically very large hackathon in Berlin in the end of February. So a bit short notice, but I would be really glad to see some of you there. We have the chance to work with a lot of things like maybe actual seats, maybe actual cars, hopefully. Or and also we plan to have some meetable assimilation of a car which is then connected to a data broker. So I think it will also be cool what you can do with combining these physical and also this cyber physical world, if you will. So I really encourage you to do that. If you want to come there, you normally have to apply. But if you just approach me, I think we'll find a quick way to get you in because being you in this room, I think qualifies as a good hacker for that. So was that maybe you there on another community meeting? So thanks a lot for stating this out and we open for questions. Yeah, I think we have a couple of minutes. Yeah, we'll have to share. Thank you. Great talk. Just wanted to understand a little bit about your testing cycle. So if you you're developing something with this and then you test it in a virtual environment and then you want to test it on a real car, like what do you do in practice when you're developing stuff? Do you have an answer to that? So I wouldn't have a straight answer because here we talk more about implementing that abstraction layer and mostly testing it against things like this mock service or with something like a feeder where we have recorded data. But things that you're touching on a small like a really general topic on how do I actually get my automotive software up and running and into the vehicle. So that's a bit beyond the scope of what just the Cookshead project is doing. So not too much I can comment on here, but I think it's a good topic for the communities, either AGL or Eclipse STB because we have some rounds of meetings where we exactly talk about that. Yeah, I would just say that it's still actually pretty early days for DSS. I mean, I know there's a bunch of OEMs and interior ones that are actively working to product eyes. So I don't think we have visibility yet into how they're actually going about and testing. So I think hopefully in the next year or two we'll see more and we'll maybe get some ideas there. Any more questions? Maybe in two or three words, can you share a little bit about the data broker? Is it something that looked like Debus? Is it something like look like MQTT broker? Something else? What it looks like exactly? Is it something that we can reuse elsewhere or is it specific to Cookshead? I would say the data broker is really specific to VSS data. So it's not like you can put any data in there. So the way it works is you start the data broker and you also give it this VSS data model that you have. So the VSS data model is expressed in a JSON or in a YAML file. Then you put this JSON or YAML file into the data broker and then you can basically do get set and subscribe. That's why I put up this slide again on this kind of data which is expressed in the data model and then the data broker implicitly knows about that. When you talk about MQTT, there's also I have to admit other APIs to interact with VSS. For instance, VIST done in W3C and they also looked a bit into how to do that over MQTT. But again here, the data broker is especially tailored to interact with VSS signals. That's why I cannot generalize it too much. Basically, when I go home, I have a project that our vehicle to cloud, Expert Group in Agile, wants to see basically pushing from VSS up into the cloud. So I'm going to be building a proxy that will basically take a list of signals to listen to from the VSS data broker or cooks a data broker and then basically MQTT them up somewhere. So then talk to us in a better world. I'll have a story for you then. Maybe one final thing to add to that. As there's one slide, I actually removed it from the slide deck but there has been a huge discussion in the VSS community whether VSS is actually fit in the vehicle or whether you should use VSS more on the cloud back end so that you put all the data from the car and whatever form it up to the cloud and then consume it in VSS there. And the data broker is kind of like an answer to yeah, it's also possible to do it in the car in addition to the cloud. So that's kind of the background story as well. Okay, so I think that's all we have time for for the moment. So thank you very much Sven Eric and Scott and round of applause.
An open-source, open-hardware offline finding system
Hello. So this is our talk about the spot nuts. It's a Techist Tinkering project. So first who we are. I am Pingu. I am 14 years old. I'm a member of the Techist community. I began hacking like four years ago or something like that. I'm interested in Python, home automation and stuff and obviously Penguins. And I also work at the Alexis project. And my name is Snick or Dominic if you like longer names, three to two. I am more or less the founder of the Techist community about which Penguins will say a few words right after my introduction. And here I'm working in the intersection between education and free software. It means I'm showing young people what free software is, what the values around free software are and also helping develop and promote free software for education institutions. And in my day job I mostly spend my time as a trainer for Linux administration, PostgreSQL, Rust and Python related topics. Yes, we mentioned Techist. It's a community based in Germany. Our goal is to create a comprehensive technical word for and with children and like to empower young people to question things and hack and build stuff like this project or the Alexis project. So here you can see where we were. This is an Alexis meeting. This was in I think at the Frostconn, the second largest conference in Germany. This here at the left side is our summer camp, name taking sun, where the trials come and then learn something like I think they are soldering things together and then programming it. So now what is an offline finding system? It's basically you attach something to something like a small tech, you attach it to your backpack, then you lose it and then you open some app on your smartphone or on your laptop and then you can find it or search it or don't find it. And the more technical offline finding thing. So the tech sent a signal via Bluetooth because it's offline. So there isn't a connection between the tech and the internet. Then an app like a helper app on your phone gets this Bluetooth signal and then says, hey, I found this tech there. And then I as the owner can go on my smartphone, search for the tech and then my phone search in the database for the tech. So how we got into offline finding. I'm very steady. So my scooter, like my scooter to drive in the city got stolen and then I had a Samsung smart tech like an offline finding tech attached to it. And then we drove to the approximate location and then with the feature that we can send a signal to the tech and the tech response I'm here, we could see where the tech was basically so what where signal was. And then we did three literation. So we went from multiple sites to it. And then there was a signal at one point and then we got a scooter back. And also there's our sketchy chef there. And he always loses stuff and wants to get it back or find it. So offline finding basically has three components. But the tracking tokens, the small devices that you attach to the things that you want to find, they aren't connected to the internet because then it wouldn't be offline and sort of use like some like and then there are the smartphones or some small helper devices. They get the signal from the tech and then send it to the internet. And then there's obviously a server where the messages like I'm here and there is a pack are sent to and then I can get them back from there. So there are obviously some challenges. Some are privacy related like a stranger must not abuse the beacon for tracking over the long term. And they should not identify the owners because then I could know where the stuff of some people is. And the back end, the server couldn't identify the owners either because then I as the owner of the server could identify the owners. And yeah. But some are also technical like the encryption without knowing the receiver because then I can identify the owner then Bluetooth because of the range and yeah because of Bluetooth. And then because of Bluetooth also the energy efficiency. Yeah, because at one point we tried out in ESP. How long would it last? And I think we did it with Shah 256 hashing and like lasted for a couple of hours. Because it's small and I think a couple of hours aren't enough for checking device. Yeah, design overview. All right. Thank you. Yeah. So after we somehow got snubbed by this by this topic around offline finding how this works, of course we wanted to try how far we can get building such a system. Of course, somewhat motivated by our grumpy, sorry, I mean, a sketchy, sketchy chef who asked, hey, is there some system like this based on open hardware, open source? I'm not so very excited about Apple controlling where I lose and find rediscover my stuff. So first, what we first it was we looked at how the Samsung smart tech system worked, which is the sort of tech that Pingu had attached to the scooter. And we found out that it sends these strange beacons of some sort using Bluetooth low energy. I will come back to that in a minute. And in the course of time, while we looked at how this works, we've it's more or less became obvious that actually this sort of system is an enter and encrypted mailbox system, because there is an owner device and this has a public key and yeah, what you can do with a public key, you can receive some sort of messages. And there are helper devices that can see these beacons and more or less just send any sort of message to the helper device. So if I lose something as the owner and let's say Pingu wants to help me find it, then they walk around in the city and their smartphone receives the beacon signal and now they somehow need to get the information back to me, telling me where they saw my beacon. And that's where these texts come in and they are as probably as dumb as you can imagine, they just send out a public key and yeah, so all the information you need to somehow get the location sent back to me. It's a macro incident that these messages carry location information. We could just as well put anything in there if any of you are into this sort of systems. Apple had a few vulnerabilities discovered in their implementation. One of the most interesting ones in the recent weeks was that people actually used the beacons themselves to transport key logger information out of otherwise air-gapped environments. I think using your favorite search engine or the search engine you distrust, least will bring some really interesting information up about this. So what we really want to build is a mailbox system and some sort of key management system because that's the really interesting part as far as I'm concerned, how we solve these privacy issues and some of the technical issues with cryptography. So this is the big picture. If this works I can zoom around in this a bit and now it shows that I should have used the headset. Can I do it with one hand? Yes, I can. So here's the big picture and what you can see here is all the red circles are showing secret keys that I use in the system, the green circles are showing public keys that I use in the system. Let's get a short overview of how this works. So we have the owner device and we give the owner device a sort of main key. This identifies the owner device and the easiest thing we could do now is we could make this Bluetooth beacon and simply copy the public key of the owner onto that beacon and attach it to some bag or scooter or some flash squirrel or whatever you don't want to lose. So at this point we more or less are done with the mailbox part and with the encryption part but we got into all the privacy troubles because what you now can do is you can follow the tech around. It always broadcasts the same public key information. You can just walk around the city and always rediscover where one person is moving and make a nice motion profile of this person. Also you could discover several tokens that are linked to the same owner device and get the information that all these tokens belong to the same owner. These are two of the most inherent privacy issues that you obviously don't want to make when designing such a system. So the next thing we do is we derive using hash based key derivation some keys or one key pair for each token so that we can unlink the tokens from each other. And the rest of the system in case I think many of you will have heard about this term a ratchet algorithm and the rest of the system more or less is very close to what for example the signal messenger does with the scriptography. We transfer this this key pair this device key pair to the tech and now we do one key derivation every let's say 15 minutes at least that's what Apple does. And the interesting part here because I never worked with cryptography on this level before is that now we can derive new key pairs on the tech and it will send out another elliptic curve public key every 15 minutes. So we fix the privacy issue of following someone around. Now you can follow someone for 15 minutes and after 15 minutes you see another beacon and you cannot distinguish whether this is the same tech which rotated its key pair or some other tech of another person. Yeah that's more or less the main secret of the system and then if I find the tech I can send a message to the public key it is currently broadcasting and there are some other things mixed in here but I don't want to go into too much detail about this part right now. And the second secret is that when I try to retrieve my location information that all the messages that other send to me I just ask the server for all the information sent to all the public keys I know my tech will have generated within the time frame. And this request can also be encrypted because we also use another set of keys so that the server can also not find out that all these keys are linked to my device. They should have zero knowledge about the ownership relation between the techs and the owners. Okay our experiments are implemented in Rust. We have split it into the spot nuts crates. Hazel OS is what is supposed to be running on the techs and the helper device in Rust based mobile app and in case you happen to need or happen to find the time to review an implementation of signals, X-Ed SDS implementation in Rust. We also factored out this crate so you can tell us what obvious mistakes we made in the cryptography there if you like. And the JG crates are a general implementation of this message of this mailbox system which can be used for the offline finding system but actually for anything that is supposed to carry public key information to someone and allow them to anonymously send back some sort of information. So what we have? We have this implementation of this general JG key exchange and Maywork system with a library usable as an alpha version and a small server implementation that actually does not care whether it is used for offline finding or whatever other purpose. And we have an experimental version of Hazel OS for ESP32 with the limitation that Pingu already mentioned that we get the ESP32 development board to run for something like five hours. So how long did you take to get your scooter back? Did you manage to do it in five hours? I don't think so. Okay you have to be quicker next time when you switch from some. Best thing so we can either fix the technical issue or you can start a running career so whichever is easier. Okay so next things we want to do is we want to find a decent microcontroller. I happened to give a Rust training last week and one attendee told me this ESP32 this is nothing to do with microcontrollers. This is a toy. Get a more hardcore microcontroller and I think this is what we will try. And for Hazel OS to this we need to build an experimental companion app. Maybe design a nice PCB so it don't have to attach a breadboard with a development board to your scooter or stuffed squirrel or whatever. And maybe we can find others interested in open offline findings standard because Google and Apple and Microsoft and you name it are working on something like this but of course it's not so very openly developed. Spotnuts is a tinkering. Thank you for the talk. The question is how do you allow the helper device to send the message to the owner device and at the exactly same time don't allow some stranger to track the owner. Somehow at the feeling that at least one of my slides went missing when refactoring the slide deck. There's back an infrastructure. One thing I mentioned is JGD which is just a small mailbox server. It just has two API endpoints. One receives messages. It does not care what these messages contain. They are just JSON encoded encrypted messages to the public key we saw and the owner devices they just ask hey do you happen to have received any message for this public key I think I might have had. So the thing here is you can actually even in the Apple ecosystem you can ask the server for all messages you like. You can just send public keys there and they will give you the information about all messages that were sent encrypted to this public key. The nice thing is so you can download the whole database from Apple servers as well. The nice thing is you can do anything with it because obviously you also need the second half of the key pair. If you don't have it you get a nice bunch of random data. Over here. Hello. It's here. Over here. Would it make sense to make this key rotation time period not fixed at 15 minutes because if I was following a tag I could time the key rotation based on the period and then know that it was rotated at the exact 15 minutes. Yes. Bit of silly question but have you considered Linux mobile support for the helper device? Can you repeat the question please? Have you considered supporting Linux mobile phones? Supporting mobile phones to carry the... Is it a part? That's running Linux instead of Android or iOS. It's supposed to be a web application which will need web Bluetooth support in more browsers than Google Chrome but actually there's this Rust library and it should be easy to use it in any sort of app that you like on any platform. That's great. Thank you. Thank you again. Thank you.
From an artificial nose weekend hack to a future-proof IoT device
That was helpful. Thank you. Thanks for joining. This is going to be a talk about a fun project that I started, I think it's almost four years now, so I feel like I'm sort of milking the idea, but it's pretty cool. It's back in 2019, I guess. I ended up building an artificial nose using some cool tech, and I'm going to talk a bit about the tech behind it and how I ended up moving the project from a really, really dirty weekend hack into something that's hopefully more future-proof and using cool things like Zephyr. So, a few words about myself. I'm a Benjamin. I'm based in France for the past year, almost to the day, actually. I've been working as a developer advocate for the Zephyr project at the Linux Foundation, and I do many things, including as a good French person, I guess, baking bread. And I don't know about you guys, but I've been trying to perfect my bread recipe for probably over 30 years. Like, I'm still not really happy about the way it turns out. Like, it's a bit random, right? And so, back, I think, in the really first few weeks of COVID, with like being stuck at home, lots of times on my hands, I was like, maybe technology can help me improve my bread recipe. What if I could figure out a device with maybe some AI in the mix that I could like train to figure out when my sourdough starter would be perfectly fermented? In my head, at least, the idea would be that I would buy AI, figure out when the sourdough kind of looks all right, bake the bread, figure out if the bread is good or not, give it a, like, oh, it's a nine out of 10. Like, it's really crispy, really nice. And then do the training that way, right? And so, the idea would be to smell the sourdough starter to capture some information, which in my head, at least, I'm not a chemist, I'm not a food chemist, but measuring things like the amount of volatilagony compounds and CO, CO2, whatever, there has to be a correlation and like the perfectly ripe sourdough starter, there has to be a way to identify it, right? And so, back in 2019, there was also this sort of cool kid on the block, new cool kid on the block, which was, and which is, tiny ML and things like TensorFlow Lite, finally available on micro controllers, things like that, right? And the thing is, I know really little about neural networks myself, like, for some reason, the math, like, whenever I would open a book about neural networks, and like, oh, yeah, it's easy, you're going to recognize handwritten digits, like, this is a bitmap, you go through some layers, blah, blah, blah, oh, you recognize the digits, that was going way over my head. The thing is, playing with physical things, more tangible things, I actually was a role in just a few hours, really, and with the help of some tools, some of you might have heard about something like, called edge impulse, it's not strictly speaking open source, although it's based on TensorFlow Lite for micros, but it helped me train a model, basically, taking some Arduino, like, an Arduino compatible device, this is a WIO terminal, a Cortex-M4, taking a gas sensor, feeding the data and like, capturing the data quite often, taking this data into some kind of training algorithms, and I would be able to figure out the difference, not necessarily between good bread and bad bread, because remember, COVID, like, flour wasn't even available in the supermarkets, but booze that I had in my house, so I actually figured that it was able to make the difference between not only, like, rum and whiskey, but it was actually accurate enough that two, like, one really pitted whiskey and one slightly less, so it would make up the difference, right? And I started to talk about the project, because I found it really cool, like, not do the silly bread thingy, but something slightly more useful, which is figuring out in the human breath, when there are, when you can spot the markers for fungal pneumonia, Kaleb, the kid almost died, basically, when he was really young and the doctors couldn't diagnose the disease, turns out that since then, there's now literature available out there that says that, yeah, there are some markers, and he sort of built a proof of concept for that, so that felt really good, but what didn't feel really good is that the code that was from day one available on GitHub of that project that I have to put together is horrible. It's like 2000 lines of boilerplate, copy paste, typical Arduino code, right? Like, I mean, I've been gathering bits here and there, of course it works, but it's really, really bad. Small, just like really quickly, because I think it's worth mentioning, how does a machine smell anyways, because we're all, I think, familiar, or we all think of things like temperature sensors and humidity and illuminance, like that certainly comes to mind, because we actually also use them every day, but there's also sensors that can smell, they measure the concentration of particular chemicals in the air. The way it works is basically just a chemical reaction between a tiny slice of metal oxide semiconductor, and based on how many of the offset compounds can be found in the air, you can measure a variation in the change in resistance, right? The more VOCs, voltallogonic compounds would be in the air, the higher the resistance, for example, which means that I could measure, like, start acquiring data, putting my sensor on top of bottles of alcohol and tea and coffee and whatnot, and capture basically what I would call the fingerprint or the olfactory fingerprint of a particular smell, and then with a bunch of AI and ML, basically figuring out what in this raw data identifies a smell, and so my intuition would be not knowing, again, a thing about signal extraction and all that kind of thing, would be, oh, well, but if this is whiskey, then if I were to write down what makes whiskey so special, it would be probably something like, oh, yeah, when you smell whiskey, nitrogen dioxide goes up, carbon monoxide, not so much, VOC goes up as well, maybe in a slightly more steady way, and so basically what happens then, the way the model works, is just that, except that it's a machine doing it, looking at the raw data, doing some basic statistics to extract the mean, the mean, the max, the standard deviation, like, all those things that could potentially characterize the smell, and then this pre-processing, this DSP, if you will, then goes through a typical neural network, so this is fun, you get to the point where you have this funny looking thing, like you can even go the extra mile and, like, sort of, 3D print, the enclosure, and there's, yeah, you have a lot of fun. I ended up building and packing, again, like, in those 2,000 lines of code, plus all the libraries, of course, that I'm pulling, I would have a GUI, I would have Wi-Fi integration, actually, that's something that I added eventually, and, like, whenever I smell something, I can push it using MQTT to a server, there's, of course, tons of hardware interactions, and all that needs to work at the same time, except that if you do it the Arduino way, and the lazy way, I guess, then you end up just doing this, which is, again, not necessarily, like, if you're lazy and just, like, eager to get your POC and your thing working, you end up putting a lot of code in, essentially, a superloop, and so, as often as possible, I need to do all this, which is acquiring sensor data, which, by the way, you don't need to do that often for getting good accuracy, like, the way the device works is that I just sample the gas sensor readings 10 times a second, it's not all that much, so every 100 milliseconds, I would read sensor data, and then I need a bit of time to actually run data through the AI model, which, again, doesn't really take a lot. The model, at the end of the day, is really simple, so you really only need a couple milliseconds there, fair enough, and then there's the world GUI aspect, which, again, if you're lazy, I'm not even, like, whenever a button is pressed, it's not even interrupt driven, so you need to figure out, like, if a button is being pressed right in the loop, not ideal, but you do that, and then, if you want, you then post results to an IoT server, and then you don't even know how long it's going to take, right? Like, if this is synchronous, it might be a problem. Enter an autos, right? That's basically, for the first few years of the project, it was sitting there on GitHub, this really crappy thing where people would open issues to be like, really, I mean, yes, I would put the ready to flash, like, firmware for people to use, but anyone who wanted to basically tweak the code, they were just scared, and so the thing is, I ended up, yeah, using Zephyr to try and rewrite, and also to myself, frankly, to learn some of the best practices there, I ended up trying to leverage some of the features of Zephyr, which is beyond being an autos, which hopefully would help me move away from the super loop, also get a better solution for targeting multiple architectures. Like, originally, I would be targeting the weird terminal, which is some D51 Cortex-M4, but I actually don't mind ESP32, and having the same code, same portable code, and portable build infrastructure, test infrastructure, I don't mind getting that, plus all the libraries that also come pre-packaged, and yeah, that's basically what I did. So, from this point, I guess, the presentation is more about telling you, like, how I replaced some of the concepts or some of the things that I had in my Arduino code, and point you to some interesting areas in Zephyr of, like, features and subsystems that are available that you maybe didn't know existed, and, but frankly, I didn't know existed either. Sensor acquisition, that might be the sort of the easy part, but I really like the fact that now my V2 version, if you will, of the NOS, I have essentially, and literally, a dedicated thread that acquires the data exactly at the sampling rate that I require for my model to perform accurately, right? That's like, that could be an issue. If I do the super loop thing, and for some reason, the UI takes longer to refresh or communicating with the the cloud takes longer, then it will basically shift the sampling rate for the gas sensor data, which basically means that I will start feeding crap into my AI model at all. So, you may want to sometimes put the sensor to sleep and make sure that it doesn't draw energy unnecessarily, so it's actually also integrated in the Zephyr APIs. Then comes the TensorFlow Lite aspect. So, I'm basically pulling TensorFlow Lite as a library in my application and leveraging something that's called ZBUS that makes it, especially for someone like me who's not necessarily a hardcore embedded developer, I basically have this high-level framework where, okay, I have my sensor acquisition thread that does its stuff, basically puts the sensor readings in a ring buffer, and whenever there is data that's available for the rest of the world and the rest of my app to do something out of, then it's effectively like there's an eventing system where, effectively, my inference thread really gets data, like subscribes to sensor readings so that it does the stuff and figure out what is it smelling like, and also uses ZBUS to put the result on the same, like using the same topic mechanism, if you will, so that, guess what, the GUI, for example, can in turn subscribe to this piece of information to do something useful out of. No need for fifo's and cues and semaphores, like it's actually really nice, and the overhead is minimal. So, there's that, and then for the GUI, that's one thing that's really nice with Zephyr is that you have LVGL, it just works, like there's obviously in Zephyr tons of drivers already available for a wide variety of display controllers, but then on top of that you even have, like, the high-level framework that is LVGL for creating a GUI with, like, chart, like, this gauge, this gauge, and I never know how to pronounce it, like, this gauge, and the charts, like, those are effectively widgets that subscribe to the data that comes and is being sent on ZBUS and just displays it, and the code is really, really straightforward, it integrates also with things like the Zephyr input system, like, if you have buttons, keypads, touch screens, that basically send events, you can have the LVGL app automatically react to that, right, so that's nice, and as you may notice, this is not a photo of LVGL running on the actual device, it is a screenshot of LVGL running in a desktop environment, because you can actually run the full artificial nose code in a fully emulated environment, if you will, on a POSIX OS, including the GUI aspect, so that's pretty nice, and like I said, it really feels like you're writing, like, really high-level applications, I have, I'm defining, and, like, I have a listener that wants to be notified whenever there is an inference result that's being made available by, probably, by the TensorFlow light for micro task and thread, and when that's happening, then it's pretty straightforward, you get the data, you really get it actually as an actual, like, typed message, like, so it's something like you can actually really make a good sense out of, in my case, the inference result would contain both a label telling me it's smelling coffee, whiskey, whatever, and a confidence level, based on how confident the model is that it is effectively whiskey or coffee, and so I can actually display that on my UI, and the code is really, like, literally moved from, yeah, 2,000 lines of code, I didn't count, but it's a couple hundred max, so there's that, and then this is sort of nice to have, if you were to do more than just a kind of prototype toy project, you could think about having the device, probably with something less stupid as the enclosure, but in the ceiling of the restrooms here in the building, so that whenever it smells pretty bad, you know that it's time to send someone to clean the place, but you don't want to send someone to clean the place, like, twice a day if, like, nothing happened, like, if it's, you're on the weekend, or it's like a day where there's strikes or whatever, or there's COVID and everyone is at home, so the device would need to be communicating somehow in a way, like, remotely, and for adding that to my project, it was also pretty straightforward, because there was a, like, full blown networking stack in Zephyr for, like, TCP, IP, and, like, co-op and MQTT, and, like, all the variants, all the flavors, and all the kind of connectivity options you may want to use, they're all there, and so effectively, and I can maybe quickly switch to a really quick demo, which is, I have, so, well, this is the version with the enclosure, this is the version, which is actually the WIO terminal, this one is M5 stack core 2, so this is effectively an ESP32, this is the sensor, it's already configured and already connected to Wi-Fi, so if I were to, I think I need to stop sharing maybe, if I were to connect to my MQTT, yeah, connected to an MQTT broker, and in real time, so this is really, like, reaching the internet and then my laptop connecting to the very same broker that this guy is connected to, and, yeah, apparently it's smelling ambient air, I guess it's more, like, nerdy or geeky air, and if I put, so this is, yeah, well, that was fast, actually, this is lemon, and for the anecdote, I, I mean, not that you care, but I actually forgot to bring the lemon from home, so I bought this one just this morning, so it's different lemon, I guess that the one I use for training the model, but it apparently works just the same, so that's, there's that, and what else, yeah, and many, many other things that are pretty cool in Zephyr, the fact that it leverages K-configured and device tree, just like Linux does, makes for pretty neat code when it comes to, oh, I want my GUI to be slightly different if my screen is large, I want to put, to cramp more into the UI, well, that's an information that you can get really easily from device tree, right, if my screen is wider than 300 pixels, blah, testing framework, CI integration, every time I commit something and push something and make a modification to the artificial nose, it gets built immediately, A1, basically, by the way, I wasn't working on Microsoft back then, and they are absolutely no problem with me putting everything on GitHub, so kudos to them for that, so now the new URL, if you wanted to check out the Zephyr version would be the same, with Zephyr in the name, you can find all the parts online, I don't get any royalties or whatever for that, but seed has actually sort of been like nice, ready to use bundle where you can order all the parts, and that's it, questions! Hello, thank you very much, so there is some abstraction where you can use different sensors, but surely the sensors don't give the same values for... Great question, I had a slide, I've removed the slide, removed the notes, I forgot, one thing that I would love to see happen to kind of answer your question is some kind of open data set, open ontology to actually describe smells in a consistent way, because you're right, like you would have sensors that are giving you readings in terms of like unitless concentration, like it's going between zero and 100% of VOC concentration, some would be talking PPM, some would be whatever, some would have like weird calibration things, there's, yeah, it's, you're right, so you would probably need to retrain the model, it's not like you can, at least with this code, it's not like you can easily be like, okay I'm going to switch from Bosch to Aliexpress, and it's going to work just the same, like you need to, yeah, I hope this answers the question. One more, yeah. We would like to know how it did it work with the sourdough and your baguettes? That's super, everyone asks the question, I never, like I never done the whole thing, like because back COVID, there was no flour, it would have been painful to bake dozens and dozens of baguettes and eat them anyways, and this is more fun to play with just random things like spices or booze, and the sourdough thing probably works, frankly, probably could be done more in a more simple way too, like maybe you just need a alcohol sensor and just measure the peak, and maybe that's it, I don't know. Thanks everyone. Okay, thank you.
Google Home, But Better: Building our own Smart Home Display with Flutter
So, welcome for the second time. Thanks for staying this long with me last talk today, named Google Home, but better. Starting really good. Just a second. So, even though it's a short talk, just really quick, a little agenda for today, what you can expect. A bulk section, really brief about me, why am I talking about this, why should you listen to me talking about Flutter. The hardware used in this project, of course, that's one of the interesting parts, but no really big surprises there. It's just what you would all expect. Then we get to the software part one, the embedded Flutter part, and part two, the implementation. And I think this is for most of you, this will be the most interesting parts of this talk. So, first about me. Hi there, I'm Moritz. Yeah, a few years ago, when I was 15, 16, I started out with embedded development. Back then it was all hobbies. I started out with an 8051 Derivate. I think it was an Infinion XC878. I started developing in C. Back then I wanted to mainly build everything around music, high-fee, loudspeakers, equalizers, digital sound processors, and so on. Following through college, I worked as a why we created Snap Embedded. That's what we're doing there. Also, co-organizing the Flutter Munich meetup. So if you ever want to come over or speak in Munich about Flutter, just feel free to hit me up. So, I left Embedded. Now I'm back at Embedded. Why? And this is maybe really short clip showcasing why I'm back at Embedded user interfaces, because this is still stuff we get today in new projects. And it's still sometimes you get a new coffee machine state of the art with a touchscreen and you use the touchscreen and you're like, oh no, God, why did you build this? So, yeah, I don't want to build any more of those things. I run, I want to build the UI like of the one today's talk is about. I think this, I hope this looks a little better than the things you saw before. That's the user interface of the Google Home replica we built or I built that I normally wanted to present here, but sadly there was, it would have been hard to set it up in five minutes, get it here on the table, so we'll rather stick with the presentation. Also, it would have been unfair for all the people online. But nevertheless, I have picture of everything. We're going through that now. So, the hardware. Yeah, as I said, not much more as you would imagine, Raspberry Pi 4. It's still model 4B, 4GB of RAM, that's enough. 2GB of RAM with a desktop environment and Flutter, yeah, wouldn't recommend that on a Raspberry Pi. Of course, the Raspberry Pi 5 would work. It would just be more expensive and it would run just as good. A little thing we have in here, what deals with the but better part with the Google Home or Smart Home devices in general, we can't add whatever we want hardware and as we will not be adding a voice command service on this device, I thought about what would be cooler. Voice commands would are already out there and what do we need to see? What is the most interesting thing and that's for a lot of people, I guess, interacting with custom hardware. Therefore, we integrated an air sensor there, the Pimoroni SCD41 measures CO2 temperature and humidity, connects to the Raspberry Pi with I2C and it comes that is also very handy with a ready Python library that's known to be working with the Raspberry Pi. The touchscreen is just some WaveShare 11 inch IPS panel, capacitive touch, USB, HDMI, really nothing too special. Those touch screens just got really good in the last years using them. Yeah, at least with the Raspberry Pi OS or so, just works out of the box, it's fine, nothing to worry anymore about. Then for the last part, yeah, with Smart Home, what most people think about is turning light bulbs, plugs on and off and for Smart Home projects or whenever you want to do projects on your own, devices that come in really, really handy are those Shelly bulbs and Shelly plugs because they just come in with a built-in web server and you just have a REST API, connect them through your Wi-Fi, they come with an app, super easy and yeah, you have a REST API where you just can interact and it on, off, change the colors, it couldn't get much easier. So, all together without a whole bunch of cables, that then looks like this. So, now that we have the hardware part together, now comes the interesting, the next part, the embedded flutter part and as the talk earlier already pointed out, there's not just flutter to run on embedded devices. If you Google it, if you want to start out with it, you will find a few repositories all dealing with flutter and embedded devices. We just saw one, in fact in the last talk, it was using flutter Pi, so what's with that? Why are there different options? Is this not flutter or well, it is flutter, but to understand this, we may have to, yeah, next slide, we may have to look at the Linux embedded that flutter uses. The main difference, those custom embedders have, the custom embedder connects or the embedders connect the flutter engine with the targeted platform and the main difference we have with those custom embedders, which I have, let's see if this works, here on the right side, fancy, I wasn't prepared for that. So, the main thing you can see is here, something's missing. Flutter for Linux just heavily depends on GTK and GTK, in fact GTK 2, which is getting a pain right now for flutter itself. So, what most of those, or what all of those libraries have in common, we don't really need those GTK parts that flutter uses anyway in embedded hardware. We don't have tabs, we normally don't have windows, we don't need all of that stuff, so they just get rid of it, and which sadly isn't that easy in the Linux, in the, let's call it vanilla, flutter, embedded, but they get rid of it, so you can use flutter on custom hardware without GTK and GTK, and that means you can use flutter, for example, with Wayland, with a custom embedder, as the talk before already pointed out, which is not possible right now with the, let's call it flutter on embedded projects, especially if you want to go in a really industrial style, but we're getting there. Also, a big part that's missing right now is tutorials on how tools are still, there's not so many out there, just Google it, it's, yeah, there's not much out there, but I'm sure we will get through this within this year, or at least maybe the next year, and then flutter will also definitely become available to startups, to smaller, medium-sized companies, there will be tools, software as a services around that, and flutter will get more mature, I think we don't know it, but I guess that flutter will get more mature in the embedded world in the next one to two years. But, so if we want to do a project right now, where we just want to try out how flutter on embedded devices works, at least for this project, when we use a Raspberry Pi, we have Raspberry Pi OS, we can just use flutter as it is, we can build for Linux there, it will work just fine. The newest Raspberry Pi OS changed to, I think it changed to using Wayland, I haven't tried it yet, but apparently it works alright. Flutter needs to do something about GTK2 anyway, so maybe it will be possible with the just normal flutter to build something suitable for Wayland and direct rendering as well in the future. For right now, if you're doing a hobby project, if you just want to try something out with a Raspberry Pi, just go with flutter as it is, it's fine. If you want to go with, if you want to use direct rendering, if you want to go with Wayland, if you want to get something into production grade, then you have to look at flutter Pi, Toyota's, IVI Home Screen, or the one from Sony, whereas the Toyota thing really is amazing and is moving forward at a really fast pace. So enough to this generic talk about flutter, what about the implementation for this project right now? I want to go through it in a few steps and yeah, the first part or the first part that we need to integrate for this project to work is connect to Raspberry Pi to the touchscreen. What do we do for that? We use the Raspberry Pi Installer, Installer Raspberry Pi OS, it just works out of the box. Thanks to a lot of guys that are also here. That's really, really easy. Then we need to get flutter running. For that, we wrote a tool, I just said it with Snap Embedded, we're doing open source projects around that. We basically built a tool in the end, there's a repo with the link called Snap CLI which allows you to, from your host machine, set up a Raspberry Pi that's connected in the same network as you are. It'll connect over SSH, it will install flutter, all the stuff you need, and it will set it up as a custom debug device so that you can just run the code and debug out of VS code on Linux, Mac, Windows, and the code will compile and everything will run in real time with hot reload working with the Dart tools on your Raspberry Pi. If you want to just develop on Raspberry Pi, that's already really easy and straightforward. Even the Dart DevTools work, all of that is already there. Just, yeah, no cross compilation, we don't want to get in that direction yet. The next part is, yeah, it's rather uninteresting. Here you can see a little bit of Dart, that code won't run, I cut out everything that looked ugly. So that's just basically a get request. You connect the bulb and the plugs with your flutter or Dart application, run this function to get the bulb status, set the bulb status, or to set the bulb color. The more interesting part, I guess, and what I wanted to point out, which will also explain how you would integrate a voice assistant to with a flutter application on the Raspberry Pi, is how do we connect this sensor that's connected to the I2C bus with our flutter application. We would have a different approach, or we do have different approaches that we could use here. We could do a Dart implementation of everything directly to the I2C bus. We could go through the data sheets of the sensor, implement everything by ourselves, all the commands do it all by ourselves. We could run up an MQTT prokure on the Raspberry Pi. We could then connect the sensor to this on the prokure, subscribe the flutter application on the MQTT prokure, because MQTT is one of those plugins that work with most of the custom embedders, so that really works out of the box. So that would be possible to take. We could, of course, here, I use Python, but we could use a Python backend, just make another REST API on this device and talk to it locally, I think, in a lot of embedded projects. It's done that way. Or we use Dboss. We have the Dboss running on Raspberry Pi OS. We have the Dboss running on most Linux systems, and we can just clip on the session bus for this purpose. The plugins are also already there. And for this example, this is what we did, because connecting Flutter application with whatever else process is running on the machine, you can just use Dboss. We can just use the Python example library that was already shipped with the sensor, of course. I mean, we don't want to do work twice. So we can connect whatever we want right now with packages plugins that are already available. Resources, thank you very much. Two minutes.
How do you write an emulator anyway ?
Thank you. Well, welcome for this session and congratulations on waking up so early after yesterday evening. It's always hard on Sunday morning to watch them. And thank you for those who are watching online. So who am I? My name is Anis, as Mahmoud said. You can follow me on social media, find my blog here. I'm writing this gamegear emulator called Gears. This is not the subject of this talk, but maybe I'll tell you a bit more about the gamegear hardware so you can see how that helps writing an emulator. I'm not an emulation expert. I know there are a few here which are very well versed, but I'm hoping that helps gives another perspective. I also gave a presentation on the Z80 which was pre-recorded in the emulator dev room two years ago. You can watch the talk. And yesterday on WebAssembly, putting this emulator to the web browser in the rest dev room, we can also watch the recording when it's online. So this is a small demo, what you can see here. This is the emulator that's running in a native window. Yeah, nothing very specific. So first of all, I'll tell you why I'm giving you this presentation. But before that, has anyone here ever written an emulator before? Okay. Oh, that's quite interesting. Who here knows how to program, how to write code? Oh, nice. That's good because that's not the goal of this talk. It's not to teaching you how to code, right? You know how to program. I'm hoping with those skills you'll be able to give you a few pointers to how to start, like where to find documentation, things like that. The goal of this talk is not to be exhaustive, otherwise it will be a full university course over a semester or something. And I want to also tell you why you should write an emulator. That's something that should come from you. And yes, so the focus of this talk would be on simpler platform because it's always easier to start with something a bit simpler. Yeah. So what is an emulator first? So a few definitions. It's something I struggled a bit because they come in many shapes, but in general it's a software, I would say, a software program that is used to run software from another computer or another platform, whatever. To give a few examples, here I showed a few screenshots of existing emulators. You have Gameboy, an gameboy emulator named Semboy. You have another BGB. Some support weird devices like printer. I showed an emulator running on the Android platform for the BBC Micro. There's also an Android emulator itself. So you want to emulate the computer that runs an Android OS. And also put something in here which might be debatable, which is an analog pocket, which is an emulator using FPGA. So you write software defined hardware and use real-time thridges to run software from other platforms. An emulator can have a huge spectrum of, let's say, accuracy and emulation. What does it emulate? Accuracy is how faithful you will be to the original. When you're emulating something, will it be running like one software? If that's your goal, that's all right. You just emulate enough of the platform to run one game with burning all the available software. Or maybe you want to do even more and be able to run any software that's on the target platform identically as if it was running on real hardware. We call it clock accurate, but there are us even in this spectrum. Before we continue, I wanted to show you a crazy example I found a few weeks ago, a few months ago of an emulator. It's a Linux emulator. It's a RISC-5 emulator of running Linux written in Scratch. It's a Scratch programming language. So we can't really see anything here on the screen. So I'll describe. You have a Linux terminal on Game Console. It has already booted. I wrote some commands. And here I'm scrolling, and you can see the Scratch code of the RISC-5 core. So yeah, emulators comes in all shapes and colors. So you want to write an emulator. Let's go with the first level. What will it be? Starting. How do you start? So if you want to start, the first thing you have to do is to pick a target. And by target is the platform you're going, what I mean is the platform you want to emulate. You have to pick this target. You have to pick a host platform to start somewhere. Even if your goal is to write something that's portable and running on everything, you have to again start somewhere. So you pick a host platform and make sure you have a bit of time. If you want something that's complete, emulators are something that's hard to decide when it's complete. You can always have more features, more things. So you don't have to have a lot of time in a short period, but maybe on a longer period it works as well. For example, for my emulator, I started two years ago. I've been working on and off, so it's not something that's taking a lot of time every day. That's what I mean. Where to start? Okay. Start simple. So with the CPU, you pick one CPU instruction. You write some code that will be able to disassemble it, which means you will take the binary form of this instruction, this one instruction. It will be a few bytes, one byte, I don't know, depends on your platform. Can your code recognize this one instruction? It might seem trivial to you. Is this a few bytes? It's how it starts, basically. So you start with this, and then you start adding stuff on top of it. You have your disassembler. It's very useful to debug. Then you add something, which is execution. So you have the CPU. How do you model its state? Okay. What's inside the CPU? Go look for more information. What's the CPU? Build this state, change it, which is basically what an executing an instruction does, and verify it has the state change as you expected. So if you want to add something to a variable, you do an add operation in your language. And then as you, let's say it's a good starter, and as you go, you keep learning new CPU concepts and how a CPU works. So yeah, this is helpful for starting. So a CPU is a processor. It's a half of, usually it's considered a half of consoles. Nowadays it might be a GPU, but as I said, this is focused on 8-bit platforms. As I told you, it has states. A CPU is a processor. So this state is basically what we call registers. It has all the kind of states, but the stuff with registers. I told you about an instruction. An instruction is a minimum operation that a CPU can do. It has assembly visualization, a text. You probably have heard of assembly programming language. That's how you visualize for human instructions. And it has a binary version and encoding. And this binary is, yeah, it's bytes you have to recognize. It has other concepts that are interesting, interrupts. These CPUs usually they can do execute instructions sequentially, and they can also be interrupted. So when they have an event from the outside world, you can change the way it's executing code. Also interesting is how do you access memory? I told you about states. Usually as a programmer, when you write in code, you think about variables and things like that. And this hides that on a hardware. State can be on registers or in memory. And the way a CPU accesses memory is also quite interesting. But the goal is not to teach you those concepts. It's to give you pointers on how to learn. So we've learned about how we start. Let's talk about how do we structure an emulator? So you've been writing a bit of CPU code. How do you structure the whole emulator? Because the CPU does not make a complete thing. I'm giving you here an example of the emulator structure. It's schematic by Rodrigo Copetti, which has been doing a very nice introductory documentation on hardware platforms, various hardware platforms. And here I took the master system one. You can see here, as I told you, that the CPU has the central part. So it's the square in the middle where it's written as Xilog ZAT. Then you have other devices that are interesting. I told you about the memory. Here on this platform, you have two kinds of memory. There's ROM and there's RAM. You have IO control, which is how you plug a joystick or the time it was more controllers. So this is plugged on this platform on an IO controller, and this is connected to the CPU. You have the game cartridges. So those are specific type of memory with things like paging in order to access more memory than the CPU can access. It has also a way to generate sound. Small device, a chip called PSG. This device is from Texas Instruments. It generates a very simple sound. It has a video display processor, which would be the ancestors of today's GPUs. And other things like something to... Okay, it's a video display processor. It's a bit specific here, but it has access to its own video RAM, which is a concept that you have to think through if you want to emulate this platform. And the video encoder is used for TV output. So it depends again on the platform. This is nothing very special. Many platforms of the time had very similar architectures. This is interesting because as you want to structure your code, your emulator code, you will want to... We probably will want to follow this structure. You want to take those devices and maybe organize your code... Take them as a code boundary and organize your code in modules. I don't know whatever your language has, functions, objects, classes, namespaces, whatever is on your programming language. It's an interesting code boundary to know, okay, this device could be emulated like this. There's another device like that. Another trick I'd like to share is when you're writing an emulator, you don't have to think about optimization too much, but you're allowed to optimize a bit. Only, for example, you're writing a CPU. It's a very simple thing. You probably want it to be fast. It's something that will have to be very fast. You might want to, for example, not do allocations on the emulation path. If you know what memory allocation is, it means that you want... It's something that can be quite costly. It's very useful, but when you're emulating, it's not something you want to do every other instruction or every instruction. You might want to use drum tables. This one, on the case, is debatable. Depending on your language, it might be automatic. Quite a common advice we see when advising people to write an emulator is that you should write a vertical slice. What does that mean? That means you have all those things. You know, I told you about the CPU, there's the video display processors, the audio. If you go on to writing an emulator, you probably want to see results quickly. That means that you will write support for a few CPU instructions and then a bit of display code so that very quickly you'll be able to have feedback and see on the screen that's showing something like the Nintendo logo on Game Boy or Sega or whatever. You can do that. That's not what I did. Do what works best for you. For example, I gave you a talk two years ago on the ZAT. It was a pre-recorded talk and at the time I had nothing else but a CPU. It depends on what you want to do. Do not hesitate if you have any questions. No, maybe we'll take them at the end because they're recorded. Sorry about that. Another trick is that I told you a bit here about the disassembler before. You should disassemble and write the text versions, assembly versions of instructions. It will be very useful to have a debugger. You might want to build debugging tooling to debug what's happening inside your emulator early because you will have bugs. You will have emulation bugs. Build this tooling early. Or you can use already existing tooling. Here you have the emulation. It's a great one. It's not open source unfortunately but you should definitely check it out. It's a multi-platform emulator. I think we have the developer here in the room. Definitely use the malicious. I can't tell you how many platforms it emulates because I don't remember but it includes the Game Gear and Master System. It has great debugging toolkit. You can see assembly. You can see video devices. You can see many things. Always make sure you have debugging in whatever form works for you if it's tracing or logging. It's nice but be able to inspect the state that's happening on the target emulated machine. I told you about all of this but that's quite interesting. How does one find information on where to start? That's a very common question I've had. Where do you find documentation on emulation, on how the hardware works? Well, basically you look online. There are many different communities. If you want to emulate a Game Boy you probably want to go to GB Dev. It has information on how to write Game Boy software but also how the hardware works. In fact, I like reading documentation aimed at developers of the platform instead of emulators developers because it tells you how you're supposed to develop for this platform. It also means you'll be able to understand how to emulate it this way. There's also the Geq's complete technical reference. It's considered a definitive guide on the Game Boy. If you want to emulate a Sega platform you probably want to go to SMS Power. It's a community around the Sega Master System and other devices like the Game Gears, Sega Mark 1, Mark 3, SG-1000 and the most recently announced the Sega AI which was a computer from the 80s. I'm sorry I don't have a screenshot here but that was very interesting. It was an AI computer that Sega released in 1986. I invite you to look for it online. So here, SMS Power has documentation on how to develop software for the Sega Master System, the Game Gear, it has documentation on how the Z80 works. It has many links to other documentation on video, audio, etc. One of the guides I used when writing my newsletter, there are three main ones. The hardware reference manual for the Sega Game Gear console. Some people in the community took this developer manual that Sega wrote for game developers and they scanned it and OCRed it and made a great PDF version. I don't know who did that but it was invaluable as a preservation effort and also used that for developing my emulator. It has some things that, well, a small caveat here is when you're describing stuff for developers, you might not go into the details of how the hardware works and sometimes you'll have edge cases that won't be explained to other developers but you need to emulate properly if you want the emulation to be correct. So yeah, in general it was very useful. The CPU of the Master System as a Game Gear is fully documented by Xilog. It has very complete manuals, the company still exists. I suppose, for example, to the Game Boy where all the documentation is unofficial. The Z80 CPU is well documented and even then there are other tricks and things that are not documented in the official manual. It's kind of part of the talk I gave two years ago. You probably want to go read this undocumented Z80 documented and then afterwards watch my talk for things that are not in this document. So finding documentation on a very simple trick, you want to emulate something. When you do research, use technical terms. It might seem trivial like this but even I fell into this trap many times. You're looking for something, how to get the more accurate thing. An example is look online or what exact chip sets your target platform is using. For example, for the audio, it's a chip from Texas Instruments and instead of searching for how to do audio for X console, X computer, use more precise keywords. It gives better results and I'm showing you here, on the left here, I Google for Game Gear sound and you have audio videos, YouTube videos and things like that but nothing very specific and on the right you almost find just SMS power and that's it. So basically the link I gave you. So let's get a bit more into practically what that means. What would be, yeah, practically how does that work, how do devices work. What I'm showing you here is an extract of the ZAT manual and how to do device IO. It's very complex and you don't need to understand it. It's basically electronics but it hides the fact that back then, using devices was quite simple. It would be almost as simple as writing to a memory address and that's how you interact with the device. Now on the ZAT CPU, there were dedicated instructions to do that but it was then quite simple as opposed to modern platform where you have GPUs, memory mapping, DMA, whatever is in a modern platform, it used to be much simpler and that's something you can use when writing an emulator. So in practice you want to write an emulator for a host platform. Make sure you understand your host platform first. You want to write an emulator for windows. Make sure you understand how to display pixel buffer on windows. So you want, you know how to open a window, you know how to, I don't know, allocate a memory area where you can write pixels, what is the pixel format, can you display something, a small image, make an image, can you change it multiple times per second. So yeah, make sure you understand your host platform and yeah, it's the same for audio. Let's, you want to start emulating sound. Make sure you know how to do, play audio on your platform. You have a buffer, can you generate, I don't know, a sine wave or square wave to make a beep. Nothing about this is emulator specific but it's really something you have to do when you want to do interactive development or game development more specifically. So you want to, let's start with the graphics emulation, okay. This is something you will need hardware understanding. You will need to understand how the VDP works, for example, on the game share, how the PPU works on the Game Boy. And so you will need to read the documents I pointed earlier. I'm giving you an example here for the VDP, a few concepts that are interesting where you want to, when you want to display pixels, you need to understand how developers were interacting with the device and how they, for example, they accessed the video RAM, how did they use the registers of the VDP. So, conceptually, I told you it's very simple. You use specific instructions that are basically sending bytes one by one from the CPU to the VDP. So that's how you send commands to it. So you write your registers, you write to VRAM, so that's IO. Internally, it has a display area. Here, this is an extract of the game share documentation where the LCD display area, the small part of the screen, will be part of a bigger buffer and then it's like a viewport and it can scroll on it. It has an infinite scrolling. The top and the bottom are connected. The left and the right are connected, so it means that it's like a torus, mathematically, the donut shape. Other VDP interesting concepts, you have the sprites. I told you about the background. On the background, you display sprites. Sprites are often used for game characters. And they're very interesting because basically the VDP was like a sprite accelerator because at the time, if you wanted to display things very fast, it was not simple and the VDP helped with that. The sprite also helped to do collision detection and things like that. But you will need to understand how color encoding works, how sprite pixels are encoded because it's not really a simple square buffer or whatever. So everything has a specific encoding. It's well documented. Here, what I'm showing you is a tile map. So it's a dump of the video RAM of the Sonic 1 game share game. And this tile map has sprites on the bottom and background on the top. It's not exactly the same as the display screen, but it shows how things are represented in memory and then they can be mapped to the LCD display. I won't go into details about that, but you probably want to have a synchronization strategy between your CPU and your devices. If you want to synchronize the VDP, for example, it's something that's easier to do line by line. So you emulate a given number of instructions and then you emulate one line of the VDP. So it allows doing this emulator single threaded because it's easier to think that way. And it's an available thing strategy and one that can give accurate enough emulation. Sound emulation. Sound emulation is quite interesting. Again, it is hardware understanding, so you will need to read the documentation. I'll give you an example with the PSG. So you write registers. It has less registers. It's much simpler. It's a device that's conceptually quite simple. It has four channels. Three are tone generators. Basically, the generate beeps with a given frequency. And one is a noise generator and it generates noise, basically. So you have multiple things that are interesting. The tones are shown here on the top right. There are square waves, at least in theory, because when you interact with hardware, life is analog, and it's not perfectly square, so it might look a bit more like the wave on the bottom, just below it. And what I'm showing here is a noise generator. It's a very simple hardware device called Line Feedback Shift Register, or LFSR. And it's used to generate noise by basically shifting a set of bits, shifting them right or left. Well, it's the same, but here it's right. You start with one bit and you shift them, and then you output the bit that's on the right. But if you were to do that without feedback, it would just shift the one, and then you just put zero, and yeah, it's done. Except this here has XOR function. So we'll take two bits here, XOR them, and put them back as inputs. And with this input, you're able to generate the random-looking noise. It's not perfectly random, it's not cryptographically random, but it's good enough, and yeah, that's how it used to work. For sound emulation, again, you have to start simple. You want to generate a square wave, as I told you. It's a very good hello world for your platform, for sound. But then you will need to add more things. On the PSG, it varies on other platforms, on the master system and the megadra, as the game is here. You need to think about the tone channels having counters, and not the frequencies. You need to think in terms of period and not frequency. It's almost the same except when you're emulating, you will have edge cases that won't work well if you think in terms of frequency. A quick advice about thinking. You can have multiple ways to think audio emulations with the CPU. My advice would be to use CPU cycles. So when you're emulating instructions, you will need to count the cycles. Depending on the platform, one instruction can be a various number of cycles, from, I don't know, from four cycles to 20 on the ZAT. So you will need to count them accurately enough so that when you're playing audio, it won't be distorted. And in general, it's useful to count cycles properly even for display. I wanted to give you an example about playing samples, but very quickly. They also use a square wave, but it's quite similar. They use amplitude variations. So they play a wave that's always up. So if you were to play it, it would be silent. But they make the volume vary and they make it very, very, very fast, like 7,000 times per second. And it generates an audio signal and that's how you have samples. Samples are, when you hear, for example, Sega sound, something like you would play an audio file today. They didn't support, well, this platform did not support playing a random audio file, so developers had to get creative. Testing, how does one test an emulator? There are various strategies for that. For example, for the ZAT CPU, there are unit tests you can use from other emulators. For example, the Fuse test suite has very good unit tests that are not dependent on the Fuse emulator. You can also use integrated tests. For example, ZXOL and ZAT tests, these are programs that were written for the ZX Spectrum, which was a computer from the 80s. It has, they were generating instructions, lots of instructions, executing them, and then dumping the CPU state and making a very small checksum. And they run that on the real hardware and registered what was the checksum for each instruction set. So these are bit long, ZAT tests that can take a few seconds up to minutes. On real hardware, it was much longer, of course, and they are very useful. Go test, go test with, even if it's another platform, you can reuse the CPU tests very simply by doing a few bytes modification and it works on your platform. How to test audio? Well, this one I was not sure because I'm not so sure how to well test audio emulation. Listen to the music. Does it look like the original one? Yeah, you need a good ear for that. You can use fast Fourier transforms as well, which are mathematical operations used to analyze an audio signal. For example, you generate the square wave. Does it have the correct frequencies? If you go through the emulation path. And then can you hear the samples? I told you about the playing samples. These are, I'd say, the hardest part of the audio emulation because it plugs into many accuracy things. So yeah, can you hear them? Other examples here for the games here. This is DGTest SMS test suite. These are software developed by emulator developers for the platforms that will test various features. Here for the games here is the Sega Master System. For the Game Boy, you probably want to look at DMG AC test too, for example. The Game Boy is a platform that's very well emulated. It's a good choice to start. It has many tests. Blogs, test suite, MoonEye test suite, many accuracy tests. Yeah, you should look into it. Another testing strategy is frame generation. So you're emulating stuff. You're generating a display, pixel buffers. You can very easily dump your buffers into an image and compare this image with a good emulator. And you also can compare it with real hardware. For example, if you use flashcards and you don't have all the original games, it can be useful. In general, I would say test a lot of different software and look how it works. For example, here you can see on the left side, this is my test directory for a few games where I'm basically using that as a regression test suite. Does it always work? So some images have a story like I had bugs that I had to fix. And when it finally worked, I recorded that to make sure it kept working. On the right, what you have is the same boy auto frame generation. It's captured a very small part of a web page where they test all the Game Boy games, a Game Boy color, and other, and they make screenshots. It's very interesting. Other communities who are interesting in testing are the speedrun communities. I'll let you look into that. They also do frame testing, but they record the frames on real hardware. So summary of everything we said here. It was a bit fast. I'm sorry. Pick platforms, host platform, a target platform, something you want to emulate. First of all, always do something simple first and then make it grow. Read a lot of documents. That's probably something that's part of the emulator development. So a lot of documentation. Testing, because depending on how accurate you want to do, you probably want to test your software properly. And don't forget if you ever go and write an emulator to write blog posts about it. So people know about it and come here to first them in the emulator dev room and make a talk. Please. Thank you. Any questions? Testing, testing. Shall I do the question round? We have a bunch of questions. So I'm just going to run around and... Thanks for the talk. It was really good. Two small questions. Approximately how long did you spend on your first emulator? It was like a few weeks, a few months before you got something running. Do you have any recommendations from your experience? I know you did some stuff in Rust. Which languages did you use other languages? How was your experience with Rust? Is it good for emulators? Does it make things harder? Rust is very good. I don't use it for anything. So that's... To be honest, it's a hobby project. We have a question. It's a hobby project that you didn't measure how much time it took to develop everything. Before I had real feedback. You asked me before I had real feedback how long it took. Part of my strategy was different with what I usually recommended. I developed the CPU first. I wrote a talk about it. The feedback here was does the test fit pass? I used different tests and do they pass? That's how you get early results without having a complete emulator. About the programming language. I intentionally did not go into details here in this talk because I want people to be able to write in whatever language they feel comfortable. Rust is great. Go is great. Use whatever language you want. Especially if you're emulating an 8-bit platform. You don't really need to care too much about performance. You should be able to get very good results with whatever language you use. Unless maybe it's... I don't even have a good example. So yeah. Next question. Thanks for the talk. Regarding the audio emulation, would it be possible to just record the waveform and compare them? It can but... I didn't want to go too much into details for that. An example. On the Game Gear and Master System, the sound chip is generating audio at about 115 kHz. On modern platforms, you will run at 44 kHz, 48 kHz. You can have more on most laptops, but it's not what's usually used. So what you will need is a down-sampling strategy. So you will need to take the samples and down-sample them at your host platform sample rate. And this will generate artifacts. Yeah, okay. Good. Thanks. Yeah. Next question regarding the audio stuff. Do you know if there's any ongoing work to emulate the original sounds of the Game Boy or Game Gear because they have built-in speakers which will compress the sound and... Yeah, it will have a specific sound which you can't hear these days. Do you know if there's anything like that? Like you could record the compression and all? I think there is. I think I found a website, I don't remember, on audio emulation specifically. They were developing automatic filters to match the platforms as close as possible using machine learning and with the target of you putting back into DSPs or FPGAs. But I can't remember the name, I'm sorry. But that's something I'm very interested in, especially because audio emulation is not that simple. You need filtering. If you want to go closer to the Game Gear, for example, which has speakers, you will probably need filtering because it has a frequency response which will not be the same as your modern speakers frequency response. So yeah. I'm sorry. I'm sorry. I'm sorry. More questions? Yeah, I almost forgot about you, I'm sorry. So you mentioned tooling and debugging. What? Sorry. Tooling and debugging of the emulator. And you said there were two options like you can write your own tooling and debugging so that you can inspect the state of your emulator. But you also said you can use external existing debuggers. How do they help? How do they help for your emulator specifically? They will help you understand if you have a bug with a given game. They will help you with multiple things. They will help you understand how the game works. So you have a better view of how the game works, how the software is working. And they'll help you understand what you're doing wrong. So you're emulating the game, you have your own logging, your own debug tooling. They'll help you understand what you're doing wrong in your emulation. So it's more for a comparison type of thing. Alright, we have time for a short question. And also can you put your contacts to your first slide I think so people can find you after the talk? Oh yeah. I think that's a good idea. It was a short question, right? Or turn it into a short question. Are there any worthwhile platforms left to emulate that are also approachable? I would say yeah. I gave the example of the Game Boy, it was very specific because it has so much documentation. So yeah, it's a good platform to start. Is there anything left? It's okay if it has a lot of emulators. If you want a platform that no one has written an emulator for it will be harder because you have to discover all this information by yourself. So if you want an easy thing to start, it's not the same as exploring new stuff and reverse engineering and things like that. Well, thank you very much. Thank you, Anis.
Breathing Life into Legacy: An Open-Source Emulator of Legacy Apple Devices
So, we're going to start. So Martin here is going to tell us some stuff about Apple. And I have to confess, I'm very anti-Apple, so I wanted to actually refuse this talk, so that everybody will, again, refuse this talk. So, Martin, take it. Thank you very much. So good morning, everyone. Thank you. So good morning, everyone. Thank you for providing me with the opportunity to speak here. My name is Martin DeVos, and today I will present to you my hobby project, which involves an open-source emulator of legacy Apple devices. And in this talk, I will explain how I managed to emulate these kinds of devices, what it takes, what the challenges are, and what the next steps are. So let me first briefly introduce myself. So I'm a postdoctoral researcher at EPFL in Switzerland. And my main research topic is not actually on emulation or reverse engineering, but it's on distributed machine learning systems, like many people are working on nowadays, like LLMs and stuff. But I'm also a very big enthusiast of reverse engineering. And I actually started doing this during my master thesis already. And during my PhD, I worked on reverse engineering some mobile applications of banking apps in the Netherlands and other countries as well. And that has resulted in, well, my first paper of my PhD. And, yeah, and two years ago, I decided to pick up this project. I was inspired by reading a blog post of someone that managed to emulate an iPhone in software. And that's how I was motivated to work on this project. And this was actually Jonathan Effek. And I think he was one of the first that managed to boot iOS, the operating system of iPhones and other Apple devices, with QMU, which is a very popular open-source emulator. And he managed to boot it to an interactive Bash shell. So he managed to boot this emulator to user land, which is quite an achievement. And I thought, well, I want to learn how that works. It involves some reverse engineering, which is a thing I really like. I like seeing how software works, trying to decipher some of the secrets in the software. And it would also contribute eventually to long-term hardware preservation, because when people would run it, it has some feeling of nostalgia. And, well, I mean, I had my first Apple device, was an Apple Touch, and I decided to, well, work on that. So after reading the blog post, I was a bit puzzled. And I was like, OK, where do I start? How can I set up my own project to work on this kind of stuff? And, you know, Apple has released many different devices over time. And the first question I had to answer is, which device am I going to emulate? And if you think about contemporary devices, they are incredibly hard to emulate, at least emulating all the aspects of these devices is a very, very challenging and difficult task. They contain neural engines. They have phase ID touch IDs, which also has interactions with secure enclaves, but also software-based security measures like trust caches, which is a mechanism that's by Apple that only allows particular applications to have privileges. So I was thinking, if I go back in time and I take one of the first devices by Apple, at least in the Apple Touch family, that should be somewhat, well, easy to emulate. It is a device that was released in 2007, and it doesn't contain, well, the complicated hardware peripherals that I just mentioned. And, yeah, hopefully that will be simple enough to emulate, well, which were some famous last words, because even these devices are very, very complicated, as I will outline a bit later in this talk as well. So I'm definitely not the first one to work on this kind of emulation. So there are some related projects. One of the, I think the earliest attempt actually on emulating the SOC of an iPhone was by a CMW.me, who actually is the founder of Corellium, which you might know as a company that provides virtualization services, both of iPhone and Android applications. Yeah, we had the blog post that I just mentioned, which enforced the emulation of an iPhone 6S Plus. And that work was picked up by someone else and eventually involved in an iPhone 11 emulator. And there's also OpenEyeBoots, which is an open source bootloader for early generation Apple devices. And all of these projects have been extremely helpful in understanding and connecting all the different pieces together, because without them I wouldn't have been able to get this far. So then I had to pick a framework for emulation. And QMU is one of the most popular open source frameworks for this kind of emulation. It provides support for hardware emulation. You can define your peripherals, your hardware components. You can implement their expected behavior. And it already comes pre-shipped with support for many different protocols, like the USB protocol, network interfaces, SPI, SQC, SDAO, etc. So that was all very nice, but unfortunately it has a very, very steep learning curve. So it's quite difficult to wrap your head around particular parts of the project. So most of the time I had to rely on existing emulations provided by QMU to see how that works. And when doing emulation, you also need a way, or you would like to have a way of debugging your software, because you want to see which code path execution is being followed, what the register values are, and what's generally happening in the system. So the nice thing about QMU is that it automatically provides a GDB stop, a GDB server, that I can directly connect to, and then I can step through the code, I can jump to functions, and I can inspect all the register values. And for the reverse engineering part, I've been using Gidra, if I pronounce that correctly. It is a very popular open source tool for reverse engineering and decomposition, and this assembly of binaries, and this has been also tremendously helpful. So here on the right you can see, for example, some code related to the start procedure of the SPI controller, which controls the SPI interface. And if you look at it, it's actually pretty readable. You can do a lot with this stuff, but also the way Apple has engineered our software is very predictable. So they're using the IOCAD framework, which is very similar in structure. I mean, most of the peripherals look like this. You initialize some memory, you set some variables, and that's mostly it. So now let's talk a bit more about the emulation itself. So my philosophy when it comes to emulation is that I wanted to stay very close to the actual hardware, to what's actually happening on the hardware, no matter how difficult that might be. What I noticed is that many existing emulators, they cut corners, which is not unsurprising, right? Because for example, if you run into some kind of signature check, it might take a lot of time to get everything working and to get the right functionality and to make sure that pass steps. So one way is, for example, to just patch out that particular procedure or function call. Why do I want to do this? Because I had a feeling that any hack, any workaround I would do in the very early stages of working on this emulator would bite me back later. So I'd rather do it right very early in the bootpress process, where things might not be as involved as when dealing with a more high level of a user land or application. So I tried to, well, get it right the first try. Well, but as expected, it still ended up with a bunch of hacks, patches, works around, and patched out binaries, because for some things I really, really couldn't wrap my head around, at least not within a reasonable amount of time. So another philosophy that I had, I started with following the bootshank. So I started with the lowest level components in here, which is the Securon Bootrom. This is the very first piece of code that runs on an Apple device. It is actually fused into the chip of any device. If you find vulnerability in there, it's very nice, because you cannot patch that out. That's actually something that happened a few years ago. The Securon loots another called low level bootloader, LLB. That in turn loads the main bootloader, iBoot. Then that's iBoot, the component loads the X and U kernel. When the kernel has launched, it will start the launch D process, which is the very first process that runs on the system. That's launched Springboard, which is responsible for drawing the iconic user interface with the app icons and the hope screen. Springboard in turn starts all the different apps, like the Alarms, Safari, and other applications that you are familiar with. So I started working on the Bootrom first. As a very first step, I had to get the Bootrom, which is fortunately provided online. So that's very nice. It was dumped. The main responsibility of the Bootloader is not only to load the other bootloader, the low level bootloader, but also to initialize some key peripherals, like a clock, the timer, and the USB stack. Because even if everything else on the device fails, the Bootrom allows you to restore the device using some USB protocol. So if something goes wrong, you can use DFU, DFU mode to restore, to refresh your device. Now, I had some instructions running there, but I very quickly found out that when emulating this binary, this Bootrom, that it jumps to some unknown memory locations. And that was a bit problematic, because I didn't really know where it jumped to. And I looked a bit on the internet and I asked around. And it looks like this first generation iPhone is using some proprietary logic by Samsung. So very early generations of Apple devices were made in collaboration with Samsung. So the Bootrom was also made by Samsung. And I didn't really have any idea of what happens, because the Bootrom is very obfuscated and very small, and there are almost no strings and no contacts to work with. And I also didn't have any physical iPhone, Apple Touch device at that time. So I couldn't really figure out or dump that part of memory. And the same actually goes for the low-level Bootloader. I was running into the same problem there. It jumped to some unknown memory locations, so I decided to skip these two parts and go straight to iBoot. Yes, and this is how I load iBoot in code. So iBoot is the main Bootloader. It is responsible for loading the kernel from basically the hard disk. I was very fortunate that the source code of iBoot got leaked in 2018. So that actually was a newer version of iBoot, but at least it gave me some idea of how this all works. So I really tried hard to map all different components in the leaked source code with what I see in Gidra in the binaries. And I managed to boot iBoot and get all the peripherals up and running that iBoot expects. One thing about that is that there is this device tree, which you might also be familiar with if you work with Linux, some low-level Linux. It is basically a big dictionary of all the peripherals and their properties. It is included in the IPSW file, which is like the firmware file that you can download from Apple, and that is being installed. It is populated by iBoot. So iBoot, for example, gets the MAC address of the Wi-Fi driver and then it injects this number in the device tree. So here on the right, you can see a part of the device tree containing some information about the crypto AES engine. So it contains some identifiers and some other things. That was also dumped. So I also used that as reference to get an idea about which peripherals there are to emulate. And I can tell you that these devices are extremely complicated. So this is a diagram that I made of all the components that I managed to get up and running. Not all of them are fully functional, but most of them at least have some functionality. And this is for the Apple Touch 2G, which is slightly more complicated than the first-generation Apple Touch. So these peripherals, most of the peripherals, you can talk to them through something called memory map I.O. So in the memory map, there is a small part that is allocated to a particular peripheral. So here on the right, you can see the addresses of all these peripherals, which I also mostly got from the device tree. And you can write to these memory locations to talk to your hardware devices. And then the main challenge becomes, of course, like to talk with these hardware devices. And you have to do that in such a way that you get the expected responses and that the kernel and the other parts of the boot stage are happy with what these peripherals are saying. So this is an example how you can initialize the hardware components in QMU. You define some methods, some initialization methods, and then you include them in some main file. I won't spend too much time on this now. This is how you implement the functionality of each hardware component. You create a read method and a write method. The read method is called when something, a hardware address associated with the peripheral is read and the write function is called when you write to a register. And you can see, for example, in the read method that you have a switch, so you look at which address am I reading something from and then you return the write response. And sometimes that can be very arbitrary. I mean, I haven't deciphered all the meanings of all registers and what they expect, but you can at least do a best effort attempt in returning the values that makes the kernel happy. And this can become complicated very quickly. So here you can see a part of the SPI controller, which was a particularly difficult component because Apple has some, well, weird things sometimes. They make some modifications to their hardware, which not always follow well-established hardware protocols to say. And finally, you attach the peripheral to the overall machine in QMU. And you, so, and you optionally, you can connect the IRQ like the interrupt request. So interrupts are also functional there. Again, I won't spend too much time on this now. So after iBoot was running, I had to load the kernel and the kernel uses iOkit and it starts all the device, all the device drivers that are declared in the device tree. So whereas the low-level bootloader in iBoot would only load the most important peripherals, this would start all the peripherals. And here on the right, you can see some of the peripherals that I reverse engineered with the Ghidra. You can see LCD display, the power management unit, some other functionality that I didn't even know that were part of the Apple Touch itself. And this mostly follows a very similar protocol. When you start a peripheral, you usually execute some reset procedure or you do like an inter, you wait for interrupt or something to indicate that the device is ready. And after all these devices are loaded, then you start launch D. And this is the part where I spend most time on because I had to like get past all these peripherals. I had to understand how they work. And the further you get into the bootchain, the more complicated things become because then you are really building on, on the correct functionality of say the clock and the timer and interrupt requests, et cetera. So roughly 20 peripherals later, I got most of the things functional, like the clock, timer, the interrupt controllers, they're all fully functional. I'm pretty sure there are a few bugs left, but nothing too major. And only partial support for some of the more involved peripherals, just enough to make it past initialization. And then we're talking about peripherals like TV out, which happens that if you connect your Apple Touch to a TV, the GPU, also the accelerometer, the light sensors, they're not really important at this point. I was very fortunate that I could avoid GPU rendering, hardware GPU rendering with a flag. So the GPU rendering in this emulator happens fully in software, which is slower, but still it's reasonable enough to use the Apple Touch itself. So there's a lot of work to do, but at least at this point I managed to boot to userlens. To give you one more interesting challenge, was the persistence layer. So the Apple Touch contains two types of memory, some more memory that contains small binaries. I think it's at most a few megabytes. And you also have the NAND memory, which is like eight gigabytes, and you can store all your applications and the operating system in there. There are some key difference between the layout of these, of NAND and NAND. So I had to spend a lot of time on when I emulated the Apple Touch 2G to make sure that also works. The main problem here is that once the kernel gets some kind of block, let's say block five, it uses logical block addressing. And that doesn't match with what's how the NAND layout underneath works. So I had to really figure out how something is mapped from a logical block level to the physical block level. And that took a lot of time. I ended up with some scripts in a separate repository that takes a dmg file and that converts it to like a raw file system, a file system as it is like really in the hardware. This is the diagram for that to give you some more context. This is for the NAND. So we have the file system which is implemented in the kernel and that's if it wants to get something from the operating system, it uses a logical block address that goes through two different layers, the flash translation layer and the virtual flash layer, again with their own numbering and addressing and mappings. And that results eventually in some physical page number and a CE which is basically like a bank, a number between one and eight. I think in the interest of time I'm going to skip this but I just want to say that multi-touch, even though it looks very simple, how hard can it be to convert a touch on a screen from to an X and Y coordinate was very, very complicated to get right and for this I actually, for this I actually needed a real device. So most of the things I could do without having an actual device but for this I needed a real device because I had to play with touches and see how the encoding of the touch works. So here on the right you can see, well me playing around so you do press a button and then I recorded what the multi-touch driver gives back to me. So all in all I managed, when doing all of this, I managed to boot to the, I will touch one G to the home screen. Well you can see it's a pretty basic home screen, not many applications. I think I got this running about one and a half year ago and a few months ago I managed to get the Apple Touch 2G working as well running iOS 2.1.1 and the Apple Touch 1G is running iPhone OS 1.0. And that mostly concludes my presentation. I open sourced all the code, I created this GitHub project out of it which is a fork of the QEMU project. I'm not sure if I want to upstream it because it has a lot of ugly code and a lot of well, work arounds. But contributions are very welcome. It currently has support for the Apple Touch 1G and 2G and I'm currently focusing on getting the Apple Touch 2G stable so I can get an app store and third party applications up and running. So that's all, thank you. And if you want to know more, I have some blog posts with more technical details on my personal website. APPLAUSE Right, hello. Yeah, so we have some sign for questions. I hope the people ask questions are here in the front because I don't want to run to the back. But I'm going to start with a question because you mentioned Corellium, which is awesome by the way, they are very expensive but they are awesome, but Apple suites them into oblivion and they are lost, which I'm very proud of. It has nothing to do with it. But so the question is, has Apple made any friendly inquiries? No, no, no. I think this project is still too insignificant for Apple to care about. I also know about the Rockbox, for example, which does Ipod generation emulation. I'm not sure. I don't think they've been sued. But I'm not that worried about it right now. OK, excellent. Questions? Sorry, come to the side. Hi, thank you very much for your speaking. Only one simple question. Because why you choose the iPod Touch and the iPhone platform, it's only a simpler problem, or because there are patents or other problems in that way. Thank you very much. Yes, thank you. So the question is, why did I choose for the iPhone and not for the iPod Touch and not for the iPhone? Well, I mean, when I started this project, I was not familiar with the architecture of both. But I was thinking, well, the iPod Touch contains at least one less peripheral, namely the baseband, the modem baseband. And I was not sure how critical that would be for the entire booting procedure. So that was, I think, my main motivation. But most of this stuff can also be applied to the iPhone. I think with some changes, you can get the iPhone 2G working. Because the iPhone 2G is architecturally similar to the iPod Touch 1G. Yeah. Hi, great talk. What are your future plans for this project? Do you want to support newer devices or expand like the computer in a more modern iOS version? Yeah, thank you for your question. So what are my future plans? I am currently working on getting the USB up and running. There is an independent researcher that also managed to get a syscalls between the guest and the host running. So that's pretty cool. So we can do some syscalls. So I'm currently working on USB. Whether I want to work on newer generations, I'm not so sure. I think it will be possible to emulate them. But I think having one stable and, well, actively used emulator is better than having 10 fragmented, half supported emulators. Because there are like many Apple devices out there. So yeah. OK. OK. Hi, thank you for this great talk. I was wondering, you were talking about getting the app store up and running. Have you considered getting in touch with Jay Freeman, the author of Cydia? Cydia, no, I haven't considered getting in touch with him. I know some people are asking me about, can we jailbreak and then install Cydia? I think we probably can. But there's almost no tooling around this emulator at the moment. So getting these jailbreaks up and running is kind of difficult right now. But I think it's a good suggestion. I think at one point I should. Yes. Thank you. Yes. Anybody at the front, hopefully? Thank you. Hi. And thank you for your talk. I don't remember in 2007 with this type of device, required activation behind us or not? I think they indeed require activation. Oh, actually, that's a good point. I used activation tokens from an actual device. Because I also had to match the serial number, et cetera. So I matched the serial number. I used activation tokens from an actual device. And then it worked. But I could have well patched out all the look down the demon. It's the look down the is a demon responsible for this checking if everything is activated, et cetera. I could have well patched that out. OK. Thank you. Great talk. Have you got the opportunity to play with JTAG debugging to cross check if your emulator works well, like a real device? What are you referring to? Like how can you do this check? I would say you try to execute some peripheral access, both on the real device and in your emulator. And you cross check the read results. The spell is a good point. I think you could do it with open iBooks. So I managed to install open iBooks on the actual device. There you can play around with the peripherals. So I think you can have some kind of trace where you just fire requests to the hardware. And you get some responses. And you can cross check that indeed with what I get. No, I haven't done that yet. But I think that's an excellent idea to make sure that your emulator is mostly compatible or the same as your real device. So I had a small question, actually, because at the beginning you mentioned you're a postdoc. So how much time do you spend on this? It's very difficult to say. Because sometimes I have a week and I spend every evening on it. Sometimes I don't spend any time on it for three weeks. I mean, it also depends on my main schedule for my work. I mean, depends on paper deadlines as postdocs, obviously. Yeah, I think when you get closer to getting something up and running, you tend to be more motivated. And then I spend more time on it than when you're completely stuck. And yeah. OK, does anybody have a question? I can keep going on. So another small question is, because one of the previous talks, they mentioned motivation. How do you get motivation to start something like this? And where do you start? So can you tell us something about that? Yeah, I think for this, well, you mean, first of all, you need some curiosity. You want to know how things work. And you really want to, yeah, you have to be able to dig deep into some components. And you know, there are many components. So you will inevitably run into something that you don't know anything about. So I learned a lot about all the different components that are in there. But another very important thing, I think, is persistence. Because at many times, for example, when working on the multi-tarch or the nonce, I was like, yeah, I really don't know how this works. And then you solve a small part. And then it turns out there's yet another layer of indirection going on. And you have to figure that out again. And then it turns out that something you did earlier was you made the wrong assumption, which breaks all other components further in the pipeline.
Opening Energy: Reimagining this Ecosystem through Open Source devroom
So welcome to the Energy Dev Room. We're all happy that you're here. For those who are speaking today, thank you for generating the content for the talks today. It was really exciting to see the proposals that came in. Just a couple, I guess like housekeeping rules of housekeeping. If there's an empty seat in the middle to make space for people who might be coming in, we're going to ask you to squeeze to the middle. It would be great if you could start already right now because I'm seeing some empty seats in the middle. We see empty seats, yeah. Squeeze together so there's people coming in so they can have somewhere to sit. Thank you. Anything else we need? Should we introduce anyone? Yeah. Who's going to introduce the volunteers? Yeah. Yeah, so the organizers for the room. I'm Rachel Tipton. I work for Open Climate Fix. We have Boris Dali from RTE, RTU. Anna. Yeah, do you want to introduce yourself there? Hi, I'm Anna. Yeah, one of the organizers. We have Dan from the Linux Foundation Energy. Kai from Everest Pionics. Nico, who's been managing all the things the past days. Thank you. Yeah, thank you all. Yeah, and also if there's speakers who have questions about the setup, you can in between the talks if you're a little worried about how you're going to get your computer setup. Feel free to approach us in the blue shirts or any of the other managers will be trying to help you. Okay. And by the way, there's been a small organizational change. The first and the second talk have been switched around, so don't be confused. Yeah. And have fun. Thank you for coming. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you.
EVerest: One stack to charge them all?
Yeah, I'm just going to talk a really quick presentation about Everest. First a few words about myself. My name is Kai. I have a background in computer science and robotics and I've been working at Pyonix on this Everest project since early 2021. So what's Everest? It's a complete software stack for electric vehicle chargers which is running on embedded Linux. It's released under the Apache 2.0 license. And the aim is to support many different hardware platforms and you can also build your own. Yeah, it comes with a lot of different modules already. So, you know, board support drivers for AC chargers, for DC chargers. It's already prepared for or comes with high level communication support. So we have Slack implemented, the Deanspec 7121 and ISO 508-2 and dash 20. There's OCBP 201 and 1.6 support with drivers for power meters for DC power supplies and so on. Yeah, the project is primarily written in C++17. There's also language support for JavaScript and Python and relatively recently we also introduced support for writing your own modules in Rust. Yeah, this is a, hopefully you can read the slide but it doesn't really matter that much. I'm just going to talk about it a little bit like the timeline, how this project came to be. So the first ideas on, you know, how to improve the EV charging ecosystem began in like the end of 2020. The company Pyonix, which started this project, was then founded in early 2021. And about a year later, Everest was announced as the latest Linux Foundation energy project with the source code being published in January 2022. And we had ChargeBite joined the technical steering committee and they also started integrating it into their charge controllers. In the beginning of 2023, we had different manufacturers of charging controllers and suppliers of chips and stuff like that launched several dev kits that are Everest enabled. And in October, we held our first little conference with like 100 people, the aptly named Everest Summit. It's always a bit of a mountaineering pun going on with some of the names. Yeah, and pretty much the same time we had the US Joint Office of Energy and Transportation as well as Charger Manufacturer Quello join our technical steering committee. And yeah, that leaves us pretty much here at FOSDEM 2024 and lots of exciting things basically planned in 2024 as well. And yeah, this is a slide basically showing a lot of the like ecosystem around it already. Like we have involvement from like academia from like enthusiasts just wanting to work on this, but also like charging station manufacturers, component suppliers and like standardization bodies as well. Yeah, then looking at 2023, that was basically the year where the project kind of like took off I would say. I held a short talk at FOSDEM last year in February and you can see like the stream of contributions basically increasing over the whole year, which was pretty cool. Lots of like pull requests to review lots of things to merge and a lot of like community engagement, which brings its own challenges with it. So it was a bit of a fast growing community. Like yeah, in 2023, we basically only had like a mailing list and at some point it was basically unmanageable because of all the traffic. So we kind of thought about how we want to like tackle this, how to make this, you know, sustainable for the future. So we thought about like moving to like a more like chat based solution to the Sulep chat and you can kind of see like the amount of messages kind of going down on the mailing list at the same time as you know, our active users on the chat system got up. So I think this is on a good track and we'll just have to see how this works out over the next couple of months. And yeah, with this introduction of the chat system, we kind of also created a new organizational structure to basically better engage with the community and manage this, you know, growth. So we introduced different working groups. One of them is, you know, focus on car communication. So I think ISO 1511A, Tadeimo and things like that. Another working group that I'm very active in is the cloud communication, which is mainly focusing on OCPP at the moment. Then there's one kind of talking about everything that's related to like the core of the Everest project itself, like builds tools and like the foundation of it, which has a bit of overlap with the CI and testing working group as well. And for everything there is not really a place for this general and Q&A working group. And yeah, what I find really interesting is that it's kind of like a multimodal approach. So we try to have like chat streams where people can ask questions and engage in like a text-based way, but also have like regular meetings, like video calls where people can also ask questions. And this seems to work pretty well. Yeah, let's talk quickly about some milestones in 2023. We had like set out a goal for monthly source code releases. And I think we more or less hit that goal. We had like 10 monthly source code releases in a year. We also just released the January 2024 release. Based on those source code releases, we also provide a Yachto layer for Kirkstone. And we're also kind of thinking about maybe a new release strategy going forward. So maybe doing releases every two or three months and focusing more on like stability of these releases. But this is still a bit up for debate at the moment. Some of the technical milestones of 2023, we worked pretty hard on OCPP 201. So the core and advanced security profiles of that, they are pretty much almost done. And some parties are already going into certification based on that code. And in general, there was very active development on OCPP in the last year, as well as OCPP 1.6, we also continuously improved that. And yeah, on the car side things, car communication side, we now have a pretty well-tested Dean's back as well as ISO 1511.8-2 implementation, including plug and charge. And we had the first successful charging sessions with ISO 1511.8-20DC. And yeah, to make all of this work pretty well, we tried to attend lots of different testing events. So we intended to OCA OCPP plug fists in Arnhem, as well as three different like Charin test intervals, which are focused on testing interoperability with ISO 1511.8. Some of you might remember this, like last year I talked about the open hardware that we also launched, like the end of 2022, early 2023, this year, Jaggen-Yeti board released under the CERN Open Hardware License Version 2. But I'm not going to go into any detail here. So if you're interested in that, there's like two talks I gave last year about this hardware and you can basically find everything that you need on the GitHub page as well. Just another cool thing, we built with this hardware, it's like a DIY DC charger. So we basically plugged this with a wiring diagram, very similar to this one that you can also find on GitHub together and use basically our AC controller hardware to drive a functioning DC charger. Another cool thing that we've been working on last year was, is this what we call the micro megawatt charger. This is a handheld DC charger powered by Everest. And what's pretty cool about this is it's a functioning handheld DC charger. So it started out with just an early prototype in early 2023, still in a box with cables and everything and basically ended up in something that fits inside of the box just for delivery. And what's cool about this is given that it's a functioning DC charger that is battery powered, you actually have voltage on the DC pins so you can plug it into a car and you go basically through the whole charging sequence with the car. Not just protocol testing but you can actually go to the power delivery and then most cars basically say, okay, I can't do much with one watt, I just stop. So why do we do this? It's pretty cool to just walk around on like these testables but also like on a normal parking lot, you know, with consent of the owners to just plug this into the car and generate log files and packet dumps and things like that. And we also try to publish these on GitHub after we recreated them. Then we worked a little bit on EV simulations so we got a small children's electric quad outfitted it with a CCS port, runs a hacked up Everest simulation on it, EV simulation on it. And I think it's one of the only children's electric quads that can charge on a commercial DC fast charger. And we have some more plans with that in 2024 so we want to have this EV simulation like natively in C++, include like an EV manager in there and basically extend it with ISO 15118-20 support. And there's a little bit of work going on on Tademo as well in the moment. And yeah, this brings me to the roadmap for 2024 and like no particular order. Like I just mentioned, the native EV simulation, we want to complete our OCBP201 implementation and start integrating OCBP 2.1 once the spec has been released. And there's going to be a lot of work going on on ISO 15118-20 so we have a C++ based XC parser and a parser generator in the works. We want to also include plug and charge there work on the AC unidirectional as well as bidirectional power transfer. And there's also a first lecture demo prototype for the charger side in the works. And yeah, if this sounded interesting for you, yeah, here so you can get involved. Basically you can find documentation and how to get engaged with the project, you know, like the mailing list, the group chats and things like that on everest.github.io. If you just want to look at the code, it's on github.github.com. And you can also find the open hardware under those two links. And yeah, I'm looking forward to your engagement, maybe contributions and thank you very much. We have about three minutes if anyone has any questions. Yes, I have information about to see the first is recuperation of energy when you are going down and the street is going down is the first question also the deceleration motor deceleration with trucks also in the system. And the other question is about this hardware or software for bicycle with electric assistance. Okay, I think the first two is mostly like on the EV side of things like proper EV side of things. And I mean, we are mostly focused on EV charges. But for bicycles, like, I think there's some work going on in some standardization bodies at the moment to basically specify like charging for small bikes for small like electric assisted bikes as well as, yeah, how do you call these things like the little motorcycles that are electric for the scooters and stuff like that. Doesn't look like it. So how much has the open hardware helped with the project in terms of contributions or say vendor adoption? I think it's really hard to quantify because people can just look at the designs and basically build stuff with it. Like our company, we had some orders for like finished kits of these things and I think we sold quite a few of those. But yeah, I think it helped. But it's more like, you know, we see it as like a dev kit that people can just play around with. And it's really not that complicated. I mean, especially like it's an AC charger. So you need some relays, you need a way to drive these relays. It's like a power meter on there, but usually if you want to build something for yourself, you don't need that. And then the high level communication board, this needs like a modem, like a power line communication modem to talk with the car. But only if you want to do Ice with an 11-8. If you don't want to do this, you can just leave this out as well and build something really, really simple. But for starting to hack around with Everest and all of these more advanced things, I think it helped. And there's definitely some interest there. Thank you very much.
Using FlexMeasures to build a climate tech startup, in 15 minutes
Welcome. Thanks for having me. My talk was actually about one o'clock this afternoon, but I'll jump in now. This is the right – am I too loud? It's fine. Okay. Well, I am Nicholas from Germany living in Amsterdam. I'm co-founder of CITA, Energy Flexibility, and we co-founded the FlexMeasures project. I will briefly talk about the FlexMeasures project. Last time at Boston, we also had to talk about some specifics. I like to introduce a project with some specific applications. So last year, we talked about our Vehicle to Grid implementation, where we use flex measures and home assistant. And today also, I'll go more on the developer perspective as a developer, you would actually work with flex measures. I only have 15 minutes, so I will fly over it a bit. Don't worry. I mean, let's not read every line of code. It's just to give you an impression. How would it be like? With flex measures as an introduction, we have been focusing on behind-the-meter optimization. So that's these other things you find behind-the-meter. So there's enough complexity to run an optimization and find the best running times for the things that are flexible here, which are usually EV charging, batteries. And today, we talk about hot water storage. These things are not exactly behind-the-meter, but they matter as well. In Netherlands, we have congestion on the grid that influences the optimization of what you're doing. It's a constraint and dynamic energy prices. So then, it becomes quite interesting as a problem. Right. So very briefly, flex measures is a platform that takes in a lot of data, like meter data or prices, all these things. And it gives you the best timing for your flexible assets as a very simplified picture of what it is. We have used it in a couple of areas, like I mentioned, bi-directional charging, in industry, in water sanitation, and now we're working on smart heating as well. Here's a little look on our dynamic visualization of what flex measures knows at any given time. So this is from the VEP UI of flex measures. You can replay what happened, what data flex measures knew, and what forecast it knew. But I want to spend 10 minutes, have this very brief tour. What if you were an energy startup? Let's say you work with smart heating, and you want to have the smart scheduling for your e-boiler, as an example. So these are things you would like to do. I will go through each of those. And I'll touch upon a couple of ways to interact with flex measures. You're writing your own flex measures, plug-in. There's a Python client, there's a command line interface, of course, there's an API. And I'll just, while I go through this list, everything will be touched for illustration, what are the things you can do. The brief picture would be that there's a house where there's the e-boiler, so your energy asset, with temperature readings. There's a flex measure server over here in the cloud. And all of these things are going to happen. So there's a little bit of an architecture diagram, but what we'll try to touch here. So the flex measures client will send temperature, it will ask the server to compute a schedule for the boiler. There's a data platform where we can get the prices. We'll have a cron tab because we will have to do some stuff just regularly. And let's keep that in mind. So this is the very first step. You don't have to read everything, but I'm just showing that we provide a cookie cutter template so you can quickly get up to speed, have your own code structure. So you choose a name and a description and you say, yeah, please give me the API blueprint. Blueprint is a word from the Flask system because flex measures is a Flask application. And you get some kind of boilerplate like this. And that's a boiler. This is the one endpoint we're doing here. What if we want to create a new customer for this project? This is a lot of code. This is basically the endpoint we wrote as an example. I'm not going to read everything. Basically, this is how you plug it in. It's going to be plugged in flex measures and available as an endpoint. We're creating a user and an account. And maybe this is the most interesting. So this is basically your business objects. I will go a little deeper here. This is the same code roughly. So we're creating the boiler as an asset. We're creating a couple of sensors. Here's two examples a bit bigger where we really define, we tell flex measures how to handle this. What kind of units are we handling and the event resolution and so that flex measures know what to do with them when data arrives. Schedules have to be made. And then if that happened, if somebody called this endpoint and your account was made and you would end up in the flex measures UI, you can see them here. Next step, let's say we measure the temperature locally. You have your own sensor and you want the temperature data to end up in flex measures as well. Then here's a small example how to use the flex measures client. Basically, it provides you with some nice code to work with more easily, but it actually uses the flex measures API in the background. For fun, we actually had the temperature reading in Fahrenheit, which we say when we send it to flex measures, the data is actually to be stored in Celsius and will automatically get it right. So this is where a lot of work goes, as you can imagine. But otherwise, this is just sending this reading. There's not much more. You'll do this regularly from your local script that runs on your Raspberry Pi, whatever you're doing there locally. One more step. So there's some external information we need. Temperature is a local reading from your local asset. Prices are a good example of information from some other third parties that just has to also be collected in flex measures. One other example is weather forecasts. In this example, I'm showing that we actually wrote a plugin for that. So we're cloning this plugin we wrote. NSEU is the organization of European transmission system operators, and they provide a data platform so you can get various things like prices, but also just a head allocations for all the transmission zones. And so we say we want the Dutch transmission zone. Please give me the prices for that. I'll talk and we configure everything. And actually then this is the command. So through flex measures CLI, this plugin has registered a group of commands, for instance, to import a head prices. Also, all of this is public how we wrote the plugin. So if you call this regularly, let's say one time per day, you'll have the next day head prices always in your system. Small visualization of one day of prices in the flex measures CLI. Excuse me. Okay, now I'm not sure how much time do I have. Eight minutes. All right, that's not too bad. But the main part now is you want to actually tell flex measures to give you an optimized schedule for your boiler. And here I'll show, I could do that via the flex measures client as well, but I'll just show how to use the API directly. This is not so interesting, of course, you have to have an authentication token. But I have to spend a bit more time here. A lot of time we spend when we made flex measures is how you configure the problem. How do you tell flex measures the constraints of the problem in the back flex measures will actually take your information about your setup and your problem. Basically, you could call that business rules, and really translate that dynamically into a linear program. So flex measures contains, I think three different algorithms, basically, we have one that's focusing on storage based problems. And that's what we also use for heat, heat batteries, we call them. We have one for, if you just want to allocate processes. But it's a very important part for developing a new application that you can tell the flex measures server, this is how I want you to treat this problem. Here's the constraint you don't know about, or here's a local thing you don't know about. And that's where we're working on two things, the flex model and flex context. So flex context would be, well, these are the prices that are relevant. We also have a project where we don't use prices, but we use the CO2 signal, the CO2 content of the grid that is anticipated. But the flex model is a bit more detailed. So this is not all the things you can do. But basically, wishing, well, the state of charge of this heat battery is this many kilowatt hours. So that's local knowledge you have. Here's some constraints. I can't go under this. We don't want to go under this. And also, here's a target for you. In the morning, I need to have this much energy content in my battery. I think this could also be a percentage. We're pretty flexible there. Some other constraints. You can see how these translate actually into constraints of a problem. And then you call our API to say, well, for this, the fill rate that I want a schedule for that, please start. And that will actually trigger a scheduling job. And then flex motors will usually pass this on to a worker. So we, in our implementations, we have a web worker and computation workers that will handle those. And then you can call this, get endpoint to check if your computation is ready. It will usually not be ready after three seconds, but soon after. And then, yeah, you get your values here. So then you can implement these settings locally. You can, let's say you ask for a schedule for 12 hours, then your local gateway has the plan for 12 hours. If there's anything that changes on the ground, you just ask for a new one. You'll update as we go. So that's general behavior. I'm almost done with, with a, you know, two of the force here. One thing we want to maybe do is in flex measures have a nice dashboard that has the most crucial data on top of each other for some inspection. And then, well, you can actually put that on the boiler asset. And then you, in flex measures, you have these nicely stacked, right? You want to see what you've been using for optimization on top. Although this comes from a different asset. This is something for everybody. All the assets can use this. And we use, as you remember, we had like four sensors or so that are relevant, but we just decided these two other ones we want to see. So we can easily see that in a period of low prices, flex measures has tried to, you know, fill the, fill the boiler at those times. Some signal here. I'll skip over this a bit because, yeah, I originally had a 25 minutes idea about this. Just as very quickly, we also noticed it's very important to also do some reporting. In flex measures, give some logic about that, that you combine some sensor data so you get the outputs of what happened, for instance, like costs. You know, that's very important. Sorry. And that can become a C like a minus well that you regularly say, okay, now the day has happened, we optimize as we could. Let's calculate how much energy costs we had here. So combine just the prices and the fill rate, which happened. But we also saw already that's that's many more interesting computations that people want. So this is a very simple multiplication. But we've made a pretty complicated architecture so you can actually have a lot of bring a couple sensors together for a new result that even can be used further in your next optimization or so. It's a very flexible system we've built here. And this is the project website. From there, you'll find the the GitHub, you find the read the docs, you'll find more information like I was interviewed for Python podcast where maybe I go into more detail. The mailing list contact, everything's there. You can also just write me directly, of course, if you're interested in doing something yourself and joining our TSC, the technical steering committee, everybody's welcome. And that's it. Yeah, there's lots of things to do, of course, I've touched upon a couple things, applications like vehicle to grid or smart heating and industry. But the roadway is still, of course, filled. There's so so much things in the energy behind the meter and a bit above to optimize. Thanks. We have time for question, then. If someone wants to ask one question, you said that you create a linear program. And what solver do you use to solve this program? What kind of solver? Yeah, we have we work with two solvers now. You could, of course, also use Cplex, but we've used two open source ones. All right, now they don't come to my head. Sorry. Hi. Yeah, we switched to that one. And we had a different one before that are both possible. So you can just those are shipped with a Docker image even so you can just configure that which one you want to use. But you can also we use pyomo as a representation for the problem. So everything that works with pyomo, which is you can use that as well. Thank you so much.
OwnTech Project: An open-source generic reprogrammable technology suite for reimagining the energy ecosystem
So, okay, we got the hard task to be the first one to speak, and so we failed. My name is Jean-Linéie. I'm the CEO and co-founder of OnTech, and today with Louis, we will discuss what we've done so far and what we are trying to achieve. So, we wanted to have a bit of a general introduction of how we see the energy and how it could become more and more open source over the years. The idea is that we see it as a pyramid with the bases being the power hardware and then having levels of sensors, real-time algorithms, industrial informatics, higher level in terms of communication, how we dispatch information from these devices on the field, what protocol we'll use, how we dispatch the energy among different power hardware, and then there is the highest level, which is like simulation, optimization, and modeling, forecasting, and so on. Today it's really exciting because if we look at what is it all about in this session, we have like plenty of amazing projects that are filling these pyramids, and it's really interesting because eventually we can reach that point where we have like the whole chains from the power hardware to the modeling, to the forecasting, to the optimization, through all the complexity as well of communication and protocols and so on. An interesting thing to note is that like the time constraints in the power hardware is not necessarily the same as the one for modeling and simulation for grid, for instance. So, the complexity associated with these things makes the informatics different, it's different fields between like the embedded world to the HPC and the modelization and optimization world. So, there is like an inherent complexity in the energy domain that is really interesting as a technical asset and thing to explore. And this is why I'm really excited today is that like in this session we are combining simulation, communication, hardware, and so it seems that we have already all the bricks and maybe tomorrow we'll build the pyramids. So we the energy people have the power to change the world and so I'm really excited about that. We'll let the floor of speech to Luis now. Thank you, Jean. And this pyramid is built with different bricks and these different bricks are hardware softer like Jean just said and hardware usually is hard until it isn't anymore, until somebody comes along and bundles the hardware somewhere and makes it ergonomic, makes it easy to use. That's what Arduino has done, that's what Raspberry Pi has done, Microbit has done it as well and they have inspired us to do that for power hardware. And that's what we have achieved. We have, there's a box there with one of our circuits and I'll pass it around a little bit later. We propose a community based compact versatile open source and low cost technology for learning and prototyping power electronics. That's the goal, that's what we want to achieve. The idea here is to create a technological sandbox just like Raspberry Pi, just like Arduino have something that is standardized, that is simple to use, that can be used by academia for teaching, can be used by industry for fast prototyping or for using in other applications for makers and fab lads to make fun stuff and burn it. And this is the place where we hope to foster new ideas and come up with new talents, people who are willing to build electric bicycles, people who want to build a microgrid, who want to understand how it works and put together the bricks and build the hardware upon which they can test their forecasting algorithms or test in their models. Now, how does, starting to get a little bit under the hood, how does power hardware work? If we look at it from the perspective of a functional analysis, the power is really the red arrow in the corner. And to get that arrow to work as we want, we have all these different arrows in the middle. And if we take a top down approach where we come, we did a forecasting, we did a simulation which allowed us to do a forecasting, which allowed us to do, to calculate an energy management strategy which we then send via dispatch through a protocol all the way to the target. And when it gets to the target, it gets here through the communication backdoor or frontdoor. And that goes into the industry informatics and the control systems which are operating in real time, locked into this micro or nanosecond level loop. It also receives measurements from its own embedded sensors, but these are not normal sensors that we come in interrogate via Laura once a week. Or these are sensors which are sending information at a one megahertz bandwidth which you are sampling at 50 microseconds or sampling at a very, very precise moment as well. These combined the control with the algorithms that are in here, they create the low level electric signals which then go there and trigger the power electronics for them to work the way you want them to. And then the loop is closed and the thing works. There's a little fiddler secret in the middle, never forget it, the energy has to come from somewhere so sometimes if that little fiddler secret fails, the whole thing stops. So everything kind of stands on the choice of the little component that you made when you put that little fiddler secret there somewhere. And what we did is that we got all the stuff, we put it into a board and you have all the different blocks which are somewhere bundled there together. But you don't have to understand to that level of complexity unless you want to. You see the communication coming in and the power going out, that's it. And that's the idea. We have two products. We have one which is a power product, the twist board which uses the second product, a passage to talk about. The twist board is a module which we can then either rack up together so we pick several twists, we put them together and that allows us to handle more power since they synchronize and communicate with each other. It's a linear progression. The more twists we put together, the more power we can handle. And we created a communication bus at the low level which can talk in CAN, can talk in NRS 485 so we can talk at the millisecond, we can talk at the microsecond and we can talk at the nanosecond with analog. So we have different bandwidths which we can dispatch with through different communication methods and protocols. And we have the spin board which I'll let Jean present you. So eventually in order to control power hardware so fast you need like some special embedded microcontroller. And this microcontroller has some real time constraints to it. So it's not a regular Arduino or Raspberry Pi that will do the job. If you want to have good performances you need really precise timers, really special communication peripherals. And so eventually we came up with designing our own board which is like the spin board. The spin board is both a piece of hardware that looks a bit like an Arduino Nano or a Raspberry Pi Pico. And this thing has tremendous resolution for its PWM signals, so the driving signals that will eventually drive the power stage, but also a really flexible acquisition of signals. So it will connect with the analog signals on the board. Eventually microcontrollers are great only if they work together with great ergonomics and coding a microcontroller can become either a nightmare, either a piece of cake, depending on what is the software and the ideas that you use to do so. So we wanted to comply with the maker movement mindset where you basically take a piece of microcontroller, you plug it with USB to your computer and you start coding in seconds and minutes. You don't have to install all the tool chain and so on and everything is done by the ID itself, so without the complexity to set and so on. In order to do so we use platform IO together with visual studio code so it's a really seamless experience for the developer. But also we have IO level of development that is possible for MATLAB for simulation people that want to deploy some control loops and control those directly in the target. They can do so through an IO level of graphical coding, let's say. And those are there is something from the Linux foundation, the FliarHartus that is providing a framework on top of which we've built APIs. So these APIs are calls that are basically making things seamless for the user so that you don't have to go through the hassle of the 2,000 pages of the microcontroller in order to program the power hardware. You have like high level functions that relates to the power world, so okay, what is the duty cycle, what signals I want on that MOSFET or directly related to the application. I want to increase the voltage, I want to decrease the voltage so I can go in my level of complexity in the language I talk daily and I don't have to go through documentation and things like that. So we have different APIs. One is the microcontroller API if you want to develop your own hardware, your own power hardware and control it through the spin board, you can do so. Or you can directly call another API that is built for the power hardware that we provide as well with the spin module. So this way you can call functions and not signals. And then there is a communication API, how to synchronize things with the surrounding world and task APIs to say okay, I want to dedicate that amount of time to do this calculation and that amount of time to do communication or higher level housekeeping stuff. And then there is a user code that is basically your main as in a not doing no experience, let's say. So this is the pinout. Of course everything is open source, so the hardware itself is a CERN HL license based. The idea here is to push people to share back the modification so that we can move on with a better and better hardware of the time. Of course all the documentation is Creative Commons, all the interfaces and the graphical stuff is GPL. And we have like a dataware and something that you can plug and see like the data live like if you were having a kind of a low bandwidth oscilloscope just by plugging your USB cable and gathering your data from the device directly. In order to make that thing happen, like we've created a foundation that is under the Aegis of the Seniors Foundation, so it's a National Council of Research in France that has put a ton of effort into making this thing a reality. So we got a lot of support from a public lab in France and this is where it comes from. And so the foundation is holding the IP. So if you want to contribute to this project, everything will be under a dedicated foundation that has strict rules to enforce the open source vibes of the project forever, forever one. And then there is a startup that is basically providing the hardware because if you want to develop things, you need someone to be able to provide the hardware to go fast basically. And yeah, so on the foundation side, we create tutorials, content, MOOCs and we make that thing available online. So we create an online space for that. We also coordinate a small embryo of community at the moment, but we hope that it will be more vivid and for some international collaboration around these fields of power for energy. And also we are starting to organize training sessions and events to answer local needs and the idea is to spread and to make things decentralized in a way that everyone can tackle its needs of energy with this kind of Arduino for energy thing. So to give an example of the first use case, at the moment we are working on a use case with a fully open source e-bike and in this e-bike you have inverters, you have battery chargers, BMS system in order to monitor all the cells of the battery, converter as well for the PV panel on the roof. And so we are collaborating with our great open source hardware projects such as Libre and Vosola and we are aiming at replacing all these closed source pieces of converters inside of this e-bike and make it fully open source A to Z from the smallest piece of electronics to the frame to the bike itself. So yes, that's it for me and hopefully Luis will be able to make a demo in five minutes. Yeah, maybe we can combine with a question. And how much? Sorry, can we buy the boards and how much? We started producing so we have our own pick and place machine so everything is made in France at the moment, assembled in France. So we have started assembly, we have shipped our first eight boards to a university in France for students. They haven't destroyed the boards yet so it's a good sign. And so we have pre-orders at the moment and so to give an insight of the price at the moment the power module is 300 euros and the microcontroller is 45, 49 euros. Can it be used in a full tour around the architecture? So yeah, to answer that really fast maybe I will come up back to that slide. So one of the strengths of the modular approach is that we've put a lot of effort into making different modules, being able to share power loads and share communication. And it's a good thing for fault tolerance because if you go modular, if a module fail you can think of clever ways of replacing the faulty modules with another module. Yeah, just one, an application is a complete autonomous, energy autonomous for electricity home with wind power, a little solar panels for the voltaic, also bicycle with electric assistance and so on. So also for with low tension DC for computers and something else and high tensions AC and also taking account of the day of the day, the battery charging with lead, battery charging with lithium or lithium, I don't know, something like that. So definitely off grid applications are key and also for energy independence and so on. At the moment the module that we've developed is DC based, so it's DC to DC. It has a really wide range of operation between 90 volts down to 10 volts also. So it complies with all battery technologies from 12 volt batteries, 24 volt batteries, 48 and 90, like 86 volt batteries. So it covers a range of battery application let's say. In the future our goal is this kind of grid applications and home and energy independence is to go for a microinverter basically and this will be made by combining different modules. So this one is a DC module and then we'll add an AC connection on top in order to cover this off grid applications and energy independence. There is one in the back. Could you create also some BMS or open source? So we haven't developed a BMS but that is already covered by the hardware from Libre Solar I think. Hello, it's a bit of an implementation question. So you are using Canvas, so you are using Canvas for now. Maybe it's because the automotive world is using it. I was wondering if you were thinking about moving to something like T1S. I'm not sure you're familiar with that. It's kind of Ethernet but with the CAN topology, so multi-drop. So it's really nice and kind of microcontroller friendly and IP based by thanks to Ethernet. So I was wondering if you were kind of thinking about it. So yeah, we thought about it because Ethernet is as great features but it tends to be costly. The idea is to go like to lower a bit down the cost of the overall communication architecture and so on. Yet we are making things like to the biggest extent modular in a way that if you want to plug a different way of communication you can do so. You can access all the pins of the microcontroller that we have. Maybe in the future we will make it think we will support different microcontrollers as well that will have more features and more peripherals but at the moment it's not planned. We have like two different things. CAN is for housekeeping and sending average data. And RS485 is for superfast communication. So we go at 20 megabits with RS485. So it's a bit uncommon but it permits to have like one cycle of control communication with different modules. So they can share one reference and a set point but also measurements among multiple modules still at 10 kHz control frequency for instance. No? Sorry, no demo. Can I convey it for you guys? We are here the whole day but the thing just crashed of course. Of course it did. Demo effect. Demo effect but I would like to just share something with you though. I can hear you online. But I would like to share something with you. Can we get into a... Yes. Yes. So we do have a... We do have a GitHub and what I wanted to show you is that on our own tech foundation GitHub there are sample codes, the examples of the data that we have. And in the example repository we have multiple different examples of how do we use the twist board at different applications, DC to DC, microgrid, AC. What I wanted to show you, the demo that I failed miserably to achieve, was the microgrid. So what is that supposed to look like if we get the peer to peer AC microgrid? We have the documentation. How to connect the boards together. And the communication that goes here. And these two boards then will work together to share power. In this case it's a peer to peer exchange. So one board is drawing power while the other is supplying. And this is actually data from the board itself. That means that we can ask the power converter to sample very quickly data. And keep in its memory and then we can retrieve it later. So we can do this kind of test where we get... Every point is about five microseconds apart so we can get a lot of resolution and see what's going on. It's offline because we do it after the work but it still works like that. And for the DC-DC side, same thing, we had the DC... Comes up. Okay. We have the different structures and different examples. They are there. So we invite you to go there, take a look at our Github. Take a look at the spin board. It's there. It's in Kaikaa. The twist board as well. And if you want to talk with us during the day, I have everything that I would need normally for a demo. And we can just sit down and do it together.
Enhancing OCPP with E2E-Security and Binary Data Streams for a more Secure Energy Ecosystem
Okay, welcome to my talk. We already heard a lot about the fascinating domain of immobility from the guys from Everest. While Everest is on the charging station side, this project is more what happens behind a single charging station. So all this back-end stuff up to the energy providers, distribution network operators and so on. And why is this important? Because in the energy domain we have a lot of safety and security regulations coming from the government. We have somehow to comply with it because in immobility it's not only important that you have IT security, you also need to provide energy safety because everything is connected via the grid. And so when too many people behave badly, we will have the next blackout. So nothing changed since last year, so I skip this because I only have 50 minutes. In the past, the immobility was quite simple. We had charging stations on one side and a back-end on the other side. And they more or less had been communicating. In the last couple of years we are HTTP web socket communication. So this is the client, this is the server, everything is fine. But now, the situation changed a little bit. We have no longer a single charging station somewhere at the street. We have normally multiple charging stations at one location. So it's quite useful to have some middle box which combines the communication. So that you save money when you want to communicate with the back-end. This is nothing new. There's a lot of vendors implementing it. There are even specialized vendors for this. It's already in the OCPP standard, but it's not really in greater detail. It's just mentioned that you could do it. We want to dig deeper into this problem and see what we need to realize in this. Next thing, when you have this middle box, it's very natural that you add additional stuff to this. So you not only want to combine the communication channel, you also have specialized energy meters which are now located at the grid connection point. So monitoring the grid connection point, the idea behind this is that you can do a local load management because you have only a limited capacity on your grid connection but want to share it between the charging stations and somebody must be in charge of how to share this energy. There are other projects who do the calculation for this, but this is the communication part. And here, for the first time, if you're a German, people, you know, this fascinating world of smart meter gateways, which is more or less specialized hardware from the federal industry for security in Germany, which regulate this area because energy, as I mentioned, is a safety critical infrastructure. So they try somehow to improve the situation that most vendors don't care that much about security and safety. The first problem would have, because I said we come from a very simplified view on this problem, where the connection from the charging station to a back end. So because of limitations of the OCPP protocol, we at the moment duplicate every connection between this charging or communication aggregation box and the back end. This is not only perhaps a design flaw, which nobody cares about, it's also starting to become more and more a security problem, because the only security we have is HTTPS, so transport layer security and transport layer security and in this box, here you have another transport layer security, so you have a split communication channel. So your IT security is no longer given, because this could be a man in the middle. It's getting even worse, because now we have specialized companies who sit in the middle between your charging stations or your aggregation box and your back ends or even multiple back ends and want to do analytics for you, because normally the charging station management operators or vendors just manage charging stations. They are not that much into analytics. So very often they even those sit in the middle and then you realize, okay, now the problem is getting more and more complicated, because people who only are interested in Excel sheets sit in the middle of your critical infrastructure and maybe they not only analyze what you're sending, possibly here could be sitting Mr. Putin and send commands back, because you have no chance to stop him. So the first thing what we want to have and this is also nothing really new, we want to share this web socket connections. For this we need to adapt a little bit the OCPP protocol. There's already an internal draft how you could do this, but when you look closer at this draft, so internal means internal in the open charging alliance, which is the organization managing the OCPP protocol. You see there perhaps a couple of drawbacks. The first thing what we obviously need, we need to add some additional looting information, so that we know we are sending from this box to that box and my idea is or my proposal is we can do a lot of interesting things if we copy this good old concept of the record route taken, which is also an IP version 4 optional option and so we can implement this much more user friendly. Next thing is in the OCPP internal draft we have more or less source routing, so the sender includes the path through the network into the request. This is well known, it's a valid way to do it, but it has also a lot of limitations because when network is changing very often you have a scalability problems. So it's much more logical to use a normal routing table in every box. You can use this typically outdoor learning what you know from other net switches, which also on learn which communication partners on which port and implement it more easily. Now, but it's getting more and more worse because we are in a modern world. A charging and stage management system today is no longer a monolithic thing on a notebook somewhere in the Netherlands. It's a highly complex system of microservices and these microservices are even from different operators. So we have very very often complex systems where the asset management, so which charging station is located where and coming from which vendor is within an SAP database, then you have another database for all this real time energy measurements and so on and so on and so on. So you realize okay now we have a bit of a problem because we have a critical infrastructure, but in the back end we have a multitude of loosely coupled systems without much of security. So the traditional also PP security model is also no longer sufficient here. For this very simple, it would be nice to have digital signatures. Again, there's an internal traffic and the open charging alliance, but this had signatures on the transport part of also PP. So it's limited to also PP, but it would be much more interesting to have it on the other PP messages itself, because then we can send end to end messages and end to end means in this case, from the EV to the energy distribution grid operator or to the EMP or to the smartphone of the driver and so on and so on. We will later see a lot of use cases how to make use of it. What do you, when you want to have signatures, the next problem is you read. As usual, you reduce the complex problems onto a key management problem. So you need something like signature policies to define who, which signature is valid, which signature should I use, which signature should I verify. When you have this signature is implemented, you can extend it to user roles because at the moment everything in also PP is more or less one user. You have no differentiation of this communication partner is only allowed to set energy commands. The other one can also change communication parameters or whatever. This can be implemented using the signatures. And last but not least, at the moment, also PP is only using the text frames of HTTP web sockets. But there are a lot of useful use cases for binary streams, especially when you look at firmware updates or look for downloads because this is at the moment external HTTP requests. And this makes your network security more complicated. So when you would integrate it into the also PP protocol, you could close down your network and only allow also PP communication and improve the security. So nice, all this little details, but what are the real use cases for this? So in Germany, we have since the first January of this year, there's a nice new law that your energy provider can send you messages. We are a highly regulated infrastructure to reduce the amount of energy you're using because we want to renew a bit of energy and so on and so on and so on. But it's external additional hardware. Why not use the existing infrastructure for this? The reason? Because it's not secure and safe enough at the moment. Would it be secure and safe enough if we could perhaps talk to these guys and say, okay, look, we have now improved our infrastructure why don't we remove this additional hardware? The same is in the same law, we have a possibility that an energy provider can get your measurements. This is again a regulated use case. We would do this with our normal also PP infrastructure with charging tariffs, charging tariffs coming from immobility providers or someone else. They should also be assigned secure data which is immutable and then used in OCPP. Good part is that in the upcoming or in the next version of OCPP, there will be some support for tariffs, not yet end-to-end signed tariffs, but at least half the way. There is this interesting use case where you want to pay for your charging but in an anonymous way, so you don't have an account somewhere, but you pay with your smartphone. In the regulation, they are talking about QR codes. Wouldn't it be interesting that you use this QR code to get something like a direct communication channel to this charging station over all this complicated infrastructure, but it's secure so that you have something like a remote control because nobody stands even not even for 20 minutes in front of the charging station just to look for tapening. They want to have it on their phones, but for this you need a secure channel. The same idea, but for another user group, is the charging station operators on the energy people also often don't know what's really going on on this charging station because it's very limited for the consent over the wire at the moment. They use a lot of AI to invent what might be happening on the charging station, but in reality it would be much nicer if we would have something like this digital twin idea just send everything what is important somewhere where it can be analyzed, but again we have no secure infrastructure in the middle because every shitty marketing company could manipulate our data. Better German calibration law, that is my favorite topic, but we had this already last year. We have national contact points who want to collect all this data and statistics about your charging station infrastructure, how good or not good it is. No security, no privacy at the moment. The same problem as usual. The really biggest problem is this is on the street. Yes, more or less last slide. This is on the street. There's also on the street, so no physical access security here. So even when we have encryption signatures, we cannot be sure if not somebody sending us a lot of crap. Okay, it's a bit harder to manipulate a lot of charging stations on the street, but if you're put in, probably you would try it anyway. So how or what can we do to analyze it here if this is a valid request or a valid information or not? And I try to my best to get it into the OPP standard, but also at the Open Charging Alliance, we have a normal problem. There are many leaching companies and not so much real contributing companies. So if you find this use case is interesting, if you think this is interesting for you, for your company, for whatever, feel free to contribute to this project, feel free to become a member of the Open Charging Alliance and help us to get it out on the street. Thank you so much for your presentation.
CitrineOS
Hello? All right. I'm going to start my talk. My name is Christian. I'm a software developer at a company called S44. We make software for ChargePoint operators and mobility service providers. So basically the cloud space of the EV stuff. All right. And today I'm going to talk about OCBP implementation and the clicker works. It worked a second ago. All right. So if you take a look around at chargers and charging networks, what you'll find often is a broken charger, a charger with a black screen, and especially payment terminals saying oops. I found a study from 2022 that said in the US that 75% or less than 75% of the chargers weren't working or when the users came up, they couldn't get a charge started. So now governments have gotten involved, right? There is uptime guarantees in the UK and the US Neve funding also relies on uptime guarantees. If I remember correctly, I think the off here also has uptime guarantee, but I'm not 100%. And so the most recent thing I found for the US and the company I work at is mainly US based. So that's why there's a little focus there is that in 2023 broken chargers was like the major concern for users to use public infrastructure for charging. And then maybe most importantly, Reddit users are super unhappy. I think some subreddits even banned talking about broken chargers because they were really annoying. And I'm going to click. All right. So one thing that we found or our thoughts on why this happens is a lot of proprietary implementations. So you can see Dolly's interpretation of OCBP proprietary stuff. So if you're not Tesla, which owns the entire vertical, right, they know what's happening at the charging station in the car and in the cloud. Then what do you do? Well, what happens right now is there is a bunch of different vendors. Wherever you sneeze in the EV charging cloud stuff, there's a different vendor. And most of them don't really share what's happening under the hood, which results in, well, a bunch of uncalled for behavior unknown what's about to happen, especially later in the field when it's a user interacting with something and you don't have known input. Then of course, we have OCBP 1.6, which leaves up a lot of stuff to the imagination on when which message should be sent. And then maybe the CSMS thinks, well, I'm expecting ID token now, but get some other message. But one thing that I think is one of the most big problems with OCBP 1.6 is around monitoring. So right now, each hardware vendor builds in their own obscure monitoring messages. And if you want to integrate with like five different hardware vendors, well, then you have to work out how to understand all five different messages. And that basically mean the same thing. Well, that leads to broken parts in the fields and no one knowing about them, which then leads to Reddit users being angry because the charging station has been broken for like a week. No one really noticed. Thanks. All right. So what can we do to improve the state? Well, OCBP 2.01, I think is already a huge step in the right direction. You can see Dolly thinks as well. OCBP 2.01 winning strongly. One thing that I really like about OCBP 2.01 is it has a lot of use cases and it's super structured and you can build your test cases on them. And then of course, there's much more monitoring around the device model that helps in identifying, oh, there's something about to go wrong with the charger instead of just it's broken. But that still doesn't help with transparency. So if everyone just reinvents the wheel once again, just like with 1.6, well, you're still going to run into different interpretations. So we think there should be something that's open source, that's transparent, that you know what's happening under the hood. And we hope that with something like that, there is better cross compatibility between different vendors and the CSMSs can easily integrate with a bunch of different hardware vendors. And next one. All right. So we looked around. We didn't find something that we were super happy with. So we came up with the project Citrine open source. It's written in TypeScript. I know in this room that might not be the most popular choice, but on the internet it is. So that's why we went with it. It runs on Node. We have a API based modular architecture. So similar to what Achim was saying, there's some microservices and you can set it up that, for instance, transactions is super scalable, but maybe provisioning is maybe not as needed. It's released under the Apache 2 license. And most recently it's been adopted by the Linux Foundation Energy. And it's in their hands now. Yeah. So in general, we think OCPP shouldn't be like something that everyone works on once again and again, but like a stable cornerstone that you can adopt, that you can drop into what you want, where you need it. Because the messages are there, the protocol is really specified and redoing the same thing. Well, I can spend my time better. So taking a quick look at what we envision for the system architecture and how it works right now, going from the left to the right. So charging stations connect via WebSockets to the central system that helps us with scalability. You can have a bunch of different instances of the central system that manage the individual chargers. Then we publish on a message broker. What was important to us is to have our underlying technology kind of agnostic. So you can set up Kafka, you can set up PubSub, whatever you want. Just like with memory cache. So you can use your address in memory cache. At least that's what we've implemented for now. And then you can adapt whatever interface you want. And for relational databases right now, we have it hooked up to PostgreSQL, but you can set up whatever relational database you want. Then comes down here, the maybe more interesting part. So we have our modules. And like I mentioned, transactions is a big one. Most of the bandwidth goes there. So we set up the modules based on how much we think they're used. One second, one back. One thing I forgot to mention is we use Fastify as the web framework to interact with our setup. All right. So looking one step further under the hood, we have a JSON schema generation JavaScript that we take the set up, the part three of the OZP spec and use that to validate all incoming and outgoing messages. And we generate our TypeScript interfaces out of that. Then to run, for the implementation of the modules, we work a lot with decorators and metadata on which decorator is used for which message. And that's how we route the messages within the modules. And then one thing that I think is quite nice is that we have some open API documentation that's generated. And you can easily try out some OZP messages from the REST API. So you can either use the API generation, click try, or use postman and just straight up send OZP messages that then get forwarded to the charger. And our system does the interaction with the charger for it. All right. So then looking up and looking at a UI, so right now we've hooked it up to Directis, which is an open source project that gives you some nice UI on top of a relational database that helps with keeping it simple. But you can go crazy on it. You can build your own flows in Directis and do whatever complex things you want. For now, we have it set up so that we have a little testing set up with our app that we whipped up to try charging. Yeah. All right. So where are we at right now? So a few days ago, we released the 1.0 version that goes through the OCTP protocol's testing cases of core and advanced security. We're quite happy that that's working. It's been working for a while, but we only got to release recently. Then right now we're under development is the advanced device management and advanced UI. We also have a few other people that we're talking to about integrating some payment and just general, we've generated quite some buzz with people that would like to add some modules or add just on functionality. And so moving forward from there, we're looking to ISO 1511.8 support. And hopefully in July, that's what we anticipate is that we have the full OCP 201 implemented. And then for the future, of course, similar to what Ahim was saying, well, you can build on your BI tools or whatnot. And we hope that this is a nice interface for innovation on top of and not that you have to hook yourself as a machine in the middle or something similar. And I'm really happy that so many people were interested in this topic. So maybe you also want to contribute. We're fairly fresh. You can find us on GitHub. The top right is QR code to our Citroen OS core GitHub page. The first technical steering committee will happen on March 14th. So get involved, join, bring ideas. And we have a Discord server. So drop by and ask questions. Sometimes we're fast. Sometimes we're slow, depending on our workload in responding. All right. Does anyone have questions? One simple question. We all know every vendor does its own shit. On the other hand, you generate everything from the JSON schema. So how do you implement extensibility? When a message or an unknown message comes in, do you drop it or can you handle it in a smarter way knowing, okay, it's coming from this vendor and therefore I should interpret it somehow? So right now I believe we drop it. Our major taste has been the Everest. And they send normal messages. Am I in the wrong spot? All right. And for the detail on how it will be handled in the future, I'll get back to you on Discord for that. I got to check with a few people on what's happening, what's going to happen there. So you said you can make an API call and you send the, for example, start charging message to the charger. So do you use like then you get the API call, you use Kafka or something and then from Kafka it goes to the charging station? Okay, that's very cool. I'm also doing that. Yeah, exactly. I've seen implementations where they are just white. I've seen implementations where they are just white like a flag into a database that is like very, very important time. And I think that's very ugly. And I think like message brokers are very elegant solution. Yep, we agree. Okay. With message brokers and 15118 you have very strict timing. How do you ensure that your message brokers not too slow? I got a, I got a pun that one. I'm too nervous for that right now. I'm sorry.
Power Grid Model: Open source high performance power systems analysis
Hello everyone. My name is Natesh. I work as a scientific software engineer at Al Yander. I'm also a developer at the Power Grid model project on which I'm going to give a talk now. So it is a high performance distribution grid power system analysis library. Yeah. And the next slide. Oh. Oh. Yes. So in this presentation I'm going to mention why do we need this project? How did we come across building this? So what does the library do? And how does it perform compared to other solutions that are already available in the space? And how do we use this within Al Yander which is Dutch DSO for its own products and applications? There's also some talk about open source since we are open source and we would like new contributors as well. In a traditional way up until a few years ago at least the power system analysis used to happen within the DSOs, DSOs in this way. The electrical engineers would usually have some data files where they run the calculation in a GUI focused software where we have built-in presets for running the calculation and we get only certain results and then we make decisions on whether to add a new transformer, add a new cable and such components within the grid or not. If the grid can handle more solar panels, if the grid can handle more EVs or not was done using this way. But now with the new smart meters and EVs and renewable energy we have to do a lot more and for that we have to have all of the data of the smart meters which is in a really huge volume in a database where also lies our topology and electrical parameters and then we cannot just use a custom, we cannot just use a preset of the calculation method. So we have to have some customization available over there and then we have to do the calculations in the cloud because these calculations are in the set of millions now because we are trying to simulate the entire year for example of time series and the volume increases a lot. So why did we decide to make this and what makes a good power system analysis library? So around I think 2018 also Alianthar faced a problem where we were not able to do this using any of the open source software or the commercial software. We faced these pain points actually and then we decided to make the library which are focused on around them. So we needed a well-defined software API. That's because we want this calculation library to be part of a really bigger application which does a lot of things apart from just calculations and we also wanted this library to be cross-platform and scalable so that we can use it within the cloud. And of course since the volume is in millions, high performance and parallelization was needed otherwise you might have to wait for a month or so to get results which is not adequate and if it's in cloud it costs you money as well. That was in 2018 by the way and after that at that point our power grid model was in our source within Alianthar. We had some applications in 2021 then we made it open source at around 22 and we do have a lot of applications now which I'll cover soon enough. What the library does? So it does some calculations especially the power flow calculations, state estimation and short circuit calculations for both single phase and three phase grids. We have many algorithms with which we can do this and these sum up the calculation functionalities in a really short way. We have a huge focus on the software side of the library because of the pain points that we did mention before. So we have a native shared memory multi-threading and that enables us to do the parallelization for batches and in as many cores as possible when we do deploy it in the cloud and yes the implementation is in C++ and the API for the users is in Python if they wish to use it and it's well documented, it's quite stable and then we have the binaries available in PyPy and Anaconda for the Kondaforge and we have support for Windows, Linux and Mac OS all three of them. And since making this library is not just enough we have to show that these calculations are actually correct as well and for that we have done the validation of the library against some theoretical hand calculations at the start. Then Vision and Chaya which are commercial software and also PowerFactory, we validated the library against them and PandaPower which is another open source library. So we validated against these software and then we use them as a reference for each new revision of Power Grid model. So it's part of our CI pipeline if any of the new features do not comply with it, it won't, yeah that should not be worse than. How does it perform compared to other libraries because yeah there are a lot of libraries within this domain. We have some more presentations now as well about them and each one has its own specific plus point and the plus point of Power Grid model is its performance. For the performance benchmark the link is in the presentation if you wish to do the benchmark yourself. We try to compare it with PandaPower and OpenDSS to get an idea of how it performs and we found that the performance in case of PandaPower is almost 20 times of their calculation which was a huge boost and will really help in doing these calculations much faster. So these were the symmetrical calculations and the asymmetrical calculations is where Power Grid model shines as well because we as a distribution, I mean when it started as a distribution analysis library within Alieander this was really needed at that point. So the Newton-Raphson for PandaPower is around 100 times and with OpenDSS we have to compare it with iterative current that was four times faster than that library. We have data conversions as well because we don't have the best data model to store it and hence we have we have conversions to SIM and other softwares that are used for power system analysis. SIM because we can then integrate with other applications throughout this ecosystem. And we currently use it within 10 plus applications within Alieander so it's a mature project at a production grid and yeah there are many applications grid planning, automatic network design, automatic network design, monitoring asset allocation and congestion management. Since I do have some time within automatic network design for example we try to forecast what the effect of the grid based on the EV growth will happen in the coming 30-40 years, EV growth and the solar panels and based on that we simulate this and then we identify the bottleneck, add the cable, run the simulation again and in this automatic way we design the whole network. That's what this application does. There are actually multiple congestion management applications as well. So one is the active one with which we do real-time congestion management. We take in the measurements from the previous 48 hours and predict if there's going to be a congestion in the coming 48 hours based on any plan maintenance if there is any and the other type of congestion management that we also do not present here. It's on the assessing the measurements of the entire year of this past year and then what would be the congestion in the coming year and based on that we might offer new contracts to our customers because the grid in Netherlands is highly congested right now. We have a lot of people waiting for new connections but we can't add them and hence power grid model really helps in making all of these calculations. For the open source you can just use the library and provide feedback. That's a great contribution in itself. Report any bugs as well. That's really helpful too and you can also do the validation for the library with any test cases of the 80 cases that I mentioned. You can provide more and validate the library. You can improve if you have an idea for a new way to make the API. You can suggest that too or you can also add new algorithms and make the code more efficient in the C++ code. That's also possible. We have a list of good first issues within the repository too if you wish to have a look. We have a few partners. There are DSOs, TSO research institutions, universities and other open source projects as well. The DSOs do use them. Aliantha does have those products as well as an access and study and are also trying to add to their operations. That's all from me. Do we have any questions? Hello. Thank you so much. This looks really, really, really cool. I have one question. If I am running a project, hello Chris Adams, Green Web Foundation. If I have a new project on to build a big solar farm or put a 100 megawatt data center somewhere, can I use this to model how I might integrate with your grid to say this is why you should let me build here or possibly this is what it's going to be the implication if we keep growing at this space. Yes, definitely. We do some calculations on our side. We would be able to, I mean, like Aliantha does it on its side if it can integrate the customer. On the side of the producer, the producer does it so it can identify if it's profitable to make this investment or not. What would be the ROI in the coming years based on what the grid looks like? That's definitely what the producers still do and they do use the model over there. Hi, Peter Dutfield from Open Climate Fix. Thank you for the talk. How did you say some other TSOs have used this? Have you had any feedback from them and how they found it? Well, I said that they are active partners so they did not actually use it. They are TANET and RTE as well. I'm trying to look if they could use this model. But some of the core features of TSOs, we do need to add them as well. That's one of the requirements from the TSO side. Once that happens, TSOs would use it as well. But the focus is primarily on the distribution system analysis side. In Germany, we have this TSO tells you please reduce your consumption. Can I use your project for this calculation? Is it fine enough or is your project just a scope of the complete DSO or a larger part of the grid? Can I use it for a single grid collection point or just for larger parts? Let me think if I got the question correctly. If you do have a single connection point and you wish to use the library, then the motivation would not be so that what would be the transition somewhere. But if it would be a profitable thing for you, right? Did I get it right? No, the DSO uses your library to calculate that tomorrow there's not enough energy. So he wants to tell some customers please we to use your consumption tomorrow. Is the library able to calculate this for single connection and grid connection points? So that I can really can say you and you and you have to reduce tomorrow? Or does it just calculate a very yes, is it just for a large part or also for a very narrow part of the grid? Now I understood. Nice point. The library does not do that. It just calculates the yeah, the power flow results, the voltages, the powers. One of the applications that I did mention about the active congestion management, we tell the customers to reduce their generations. We have certain contracts within Alieander to do that, but it's not part of the power grid model. Yes.
GridSuite and PowSyBl: an Open Source approach to develop advanced tools for grid analysis and simulation of power systems.
Okay. So, hello everyone. So, I'm Jean-Vatiste and Geoffroy is here. So, we are at the software development department from RT. And RT, was it RT? So, we will give some elements of context. So, RT is the French TSO, so transmission system operator. We handle from 20 kV to 400 kV, so it's the high voltage. And we must provide electricity 24-7 for all the costumers and all the inhabitants in France and of course in Europe because we have to cooperate. And the particularity is that we are asset owner of the grid, which means that we are responsible to invest and make sure that the equipment will be okay to work to complete our mission as a TSO. And we are also responsible to adapt the structure to make sure that we will ease the transition energy. So, we need some interconnections and we also try to adapt the grid to connect for example offshore wind generators. So, we have many, many challenges in the fast changing world. So, we have of course new energy mix with big goal around neutral carbon neutrality. Sorry, for 2050. So, it's a big challenge. And we have also some codes, some regulation that make drastic change and we must adapt to that. So, it's a more package where we have a lot of work to do in Europe. And for that, I will read the sentence because it's very important. Today... Okay, so... Today... So... Oh, it's okay. So, today's need is what is very important to understand is that today's need is not to build a tool that answers present needs, but to build a tool that is capable to integrate quickly and efficiently tomorrow's needs. And if you have the idea of the tender to create... the way to create new tools, sometimes you make specification, then you made a tender process, then you ask a vendor to develop, and this cycle is maybe like four years. The problem is that we don't know what we will ask to do in five years because everything is changing very fast. So, what is the strategy for it to answer those issue is to use open source. Andrew Froy will present you two tools that are based on open source. So, Possible, which is one of the first projects that was started, the Linux Foundation Energy Initiative, and then what we can do, what we can build on top of Possible. So, I'll let the floor to Joffre to present in detail the tools. So, hello everyone. So, the first project, Possible. So, Possible, it means Power System Blocks. Blocks, so this is a software component that we have as a foundation of many other applications, especially at RT, so we have something like 15 projects that are based on, as a list, a few components that has been developed in Possible. So, what is the content of Possible? What is it? So, this is many things, but it's the first way to model the Power Grid. So, we have a data model that allows us to build, to have a green model and to use it to make, for example, some evolution, some change for this grid model and to study what will be the impact. We also have some components for visualization of the grid that will be integrated in some higher level application. Also, what is very important is to be able to feed this data model with some data. So, for that, we have some converter coming and to standard data format. So, the most used one and the most famous one is the SIM, the SIM data model. So, we have premium support for SIM converter, SIM data model support. And also, what is very important is to have some interoperability with commercial tools. For example, the one, there's two very widely used commercial tools which has a PCC from Siemens and also PowerFactory. So, we are able to import some data into our data model from these tools. And also, we also have some converter from academic data format, for example, MATPOWER, which is widely used for research and science. And with this data model, we are able to run some analysis functions, for example, powerful calculations, security analysis. So, security analysis, for example, is a nice function that allows to test what will be the impact of some contingency. For example, we have a line loss, an outage on the grid, and we want to see what is really the impact of this outage on the flow, on the voltage to see if we have some trouble. We also have some scientific analysis, short-circuit calculation, which is also very important, and also dynamic simulation. So, time domain simulation. This is why we are integrated with another project of the Linux Foundation Energy, which is Dainaw. So, this is mostly written in Java. And this has been designed to be as light as possible. There is no dependency with complex framework or anything that takes decision to how you are going to use it in a higher level application. So, GridShoot. GridShoot is an example of a tool that is built on top of the component of possible, and that allows people to make some grid study. And very different studies. It starts from a real-time study, for example, security analysis, to a long-term development study. For example, we can with this tool, study what will be the impact of a connection of a new renewable generation power plant on the grid, and to assess that everything is fine if we connect this generation in a specific place on the grid. So, this is a tool that has been moving to production very recently. So, at the end of last year, since a few weeks, and we still have some very early users. And what we plan to have is 400 users in the coming two years. And so, this tool will replace an existing tool, which is at 30, since 15 years. So, we have a team of more than 20 developers, and it is a growing team. From the technical part, the technical stack that we are using is for sure 100% open source. So, this is a micro-service-based application, very scalable application, based on Java, Spring Boot. We have everything based on REST API and also asynchronous messaging with RabbitMQ. On the storage part, most of the micro-services are based on a PostgreSQL and Elasticsearch. As it's quite difficult to manage such a distributed application with a lot of micro-services, we have everything is deployed using Kubernetes cluster. And on the front-end part, so this is the web application using our RAC.js. And also, we use a little bit of WebGL for high-performance representation of the grid. So, an important issue that we had with this is that we have some Java component, which is very convenient to integrate it into, I would say, a classical enterprise application where often the backend is based on the Java ecosystem with Spring, Quarkus, or some kind of framework. So, this is fine, but what we had is also needed to use these components for high-performance also for research and data science community. And most of the people from the data science communities are on the Python ecosystem. So, the question was for us how to use the same piece of code in these two ecosystems and how to share the code in Python and Java. So, what we have done is to use another fantastic open source tool, which is the Gural VM. And Gural VM, this is done by Oracle. And this is several things, but we are using a component with native image that allows to compile Java code into native code. And thanks to this, we are able to build a C library for everything that we have impossible. And with this library, we can build a classical Python extension module based on the C library. So, some useful links. So, for sure, there is a GitHub repository for both project possible and Gridshoot. Maybe I can focus on the Slack channel, the two Slack channels. So, see where is the place where we answer questions and we discuss with the community. And also, there is an online demo of the application Gridshoot. So, if you want to test it, you can do it. So, we have an instance of Gridshoot that is deployed in the cloud. And you can connect to this one just using, for example, your GitHub account. Also, there is a YouTube video if you want to show a live demo. So, this is a screen of the application. What you can see here, just to explain what is it. So, on the left side, we have the data manager. So, starting from a case, from an initial, a green model, we have a way to create some tree of variant, of modification that allows us to test different changes in the network. And for all these variants, we can run some calculations and analysis and then compare what is the best and what is the best one for us. Then you can see on the right side that we have some way to represent, to display the grid. So, this is a full representation of the French high voltage grid. We have some substation diagram like here. We have some what you call the network diagram, which is part of the grid that is shown in the bus, in the nodal view. And then we can run some calculations. We also have some table to see the data in a tabular form. We have some specific user interface to show the results, etc. So, this is one for the presentation if you have any questions. Or if you want a demo of this tool, we can do it after the presentation. If you are interested to have a more detailed view of this tool. I was just wondering what format your network data is in. And whether you could, for example, take in the open-street map network data and try analysis on it. It's not complete, but can we do that? So, this is not OpenStreetMap, this is MapBox. But we can change the type provider to use whatever you want. So, here we have used a very light type representation of the grid just to have a better view of the grid. But we can use the OpenStreetMap. How do you make the link between the grid and the end-consumption items or human on the grid? Do you go to the machine-to-machine communication system so that the people stop consuming? Or do you make advanced polls in order to know the consumptions within one hour, within one month? So, this tool is a bit of a snapshot of the grid. So, this is done by some other tool which are before this one. So, we have the SkyDal, for example, which are doing the acquisition of the measurement. And that has a database of the grid model. And from this, we have some snapshots and then it can go into this tool. Okay? I don't know if I answered the question. How do you handle the stress when, for example, the grid is about to fold? Do you cover any cases with humans at the end of the grid? I don't know. Anyway. Okay. So, we will answer the question later. Thank you.
LFEnergy SEAPATH - Easier Operations in Electrical Substations through Digital Twin Empowerment
So, hello everyone. I am Paul Le Guin de Carnaison. I am an embedded software engineer at Sauvard Fair Linux. And today I'm going to speak a little bit about the C-PASS project and how we brought in. My company Sauvard Fair Linux. We are based both in Montreal and in France. And we are experts in embedded software and free open source engineering. And we've been working on the C-PASS project since the last couple of years. So what is the C-PASS project? C-PASS means software enable automation, platform and artifacts for the line. What is it? We are in a context of energy transition, as you all know. And there is a lot of constraints with this new energy. And the main constraint is that we have a multiplication of distributed controls. We have more and more power stations. And so we have an increase of the data management's need into this power station. And so the idea is how can we bring some free and open source into this power station. And this is where C-PASS is here. So to remind a quick reminder on the aim of C-PASS, the goal of C-PASS is to develop a reference design with an industrial grade on open source and real-time platform. C-PASS allows us to also virtualize platform and inside this virtualize platform we can run automation platform for our power station. And so we can share multi-application provider and this combines performance and safety. For 10 minutes presentation, I can present in deeper the C-PASS project, but my colleague Erwan already did it last year at 4.10.25. So if you're interested, you can see his presentation. So the main idea of this presentation is how did we bring some functional tests to the C-PASS project. And for this, I want to take a simple case 2D. So here is the power lines you can see in the campaign. And you have a tree after the storm that falls on your power lines and there is two lines that touch each other. And this is a big issue in your electricity system. And so you have systems that must cut the current very quickly to avoid any other on people or on the infrastructure. And so how can you have all this safety equipment with C-PASS? So I have a very simple representation of all of this is working. We have first a protection algorithm that makes a decision if there is or not a situation where there is another or not. And this algorithm is running inside a virtual machine and this is where we have the C-PASS project because this is running inside a C-PASS cluster, inside a hypervisor, et cetera. And we have on the opposite side an hardware which is doing the monitoring of our architecture. And the communication between the C-PASS cluster and this hardware is done with a protocol. And you know it, it's the ISC 61850 and this is a protocol which is based on TCP and it generates packets that we call sample value and this is the communication between the C-PASS cluster and our hardware. And so why did we need functional tests? C-PASS as you see is designed to work on a very critical infrastructure which is a power distribution and if we are some issue on the power distribution there is a need of the protection of the people and the protection of the infrastructure because we have electricity as that. And so in case of failure the safety protection must react as soon as possible. And so this is why we need a very, very, very low latency transit of this sample value that transit into the C-PASS cluster. And the last thing is that your power distribution in your country is running every time you have electricity in your home every time and so we are context of 24 hours and seven a day context. And so we have to ensure that this latency as low as possible every time. And so we are in a deterministic system where determinism is the primary goal. And so we have a big infrastructure, we have expensive items and so maybe you are wondering how can we in our labs at our desktop or can we simulate all of this chain simply. So this is the work we've been working on. And so I represent here a very simple scheme about how can we reproduce this protection chain in our lab. The first piece is what we call the Publisher Machine and the goal of the Publisher Machine is to generate the IEC 61850 sample values. And then we have the C-PASS cluster and the C-PASS cluster is composed of two parts. The first is the IPI Vizer which are running virtual machine and we have the virtual machines which runs all of the software which are an SV client receiver that will proceed the sample values that have been sent by the Publisher and a protection algorithm which takes the decision based on this sample value if we are an issue or not. I took here a presentation with two IPI Vizer and three VM but it could be a totally different architecture. So what tool did we use to do that? First on the Publisher Machine we use the pick up format. This is a very ideal of a format because we can reproduce some TCP traffic generation and for example we can reproduce what could be happen on electricity infrastructure for example a 50 hertz electrical signal and then we replace them with some tools. Here I use a TCP replay to send this packet with the spacing we want. We can use some PTP packets to synchronize all of this but keep in mind that it's not a recroid, it's not an obligation to use. PTP is only used on C-PASS when you wish to use some C-PASS features such as VM synchronization and VM migration but this is not an obligation to be used. And then on the C-PASS cluster we have first the IPI Vizer side and we have to have very low latency. First we have some CPU core isolation so we have done some work to dedicate some core only for the Linux system that is running on the IPI Vizer and isolate some core only for the VM which are running and do also some IEQ and processor isolation inside the Linux kernel to be sure we have some priority for some application etc. We also did some biosoptimization which is very hardware dependent but there is a lot to do. There is a thing like the multi-EPF reading feature that are very bad for determinism and you have to disable this kind of feature. And then on the virtual machines we have also it's kind of the same work that we did on the IPI Vizer side which is all of the isolation and CPU and IEQ etc. We use what is called the PTI path view and this is a very interesting feature because it allows us to directly inside the VM to take the packets which are received on the network interface of the IPI Vizer and this brought some good performance. And finally we can also use some SAIOV that can be used if you have multiple virtual machines but keep in mind that even if you have better results this is an optional feature. Thank you for your attention and please let me know if you have a question and I will answer to it. Thank you. Have you got any examples of real world adoption of this? Let's say Karin A. Ronan. So you are asking if we have concrete implementation of the C-PASS project. Currently no. We don't have any concrete implementation because C-PASS is currently near in early adoption but we have a good opportunity to be in the future. But currently it is not in production if it's your question but it's the goal of the C-PASS project. If you don't have a ground master clock what is the source of time? Matthew I will let maybe answer this question about PTP. In production we have to have a ground master clock but for testing we use just Linux PTP as a PTP clock. Thank you.
OpenSTEF: Opensource Short Term Energy Forecasting
Hi everyone. Thank you for having some patience with me. Computers are not my strong suit, although I am in IT. So my name is Sunita Rijder. I am the Community Manager of Open Staff. And I work at Alieander. So let's get into a little bit of background. So Alieander is a distributed grid operator. So we are responsible for the distribution of energy in both electricity and gas in about a third of the Netherlands. So I think we all know these kinds of gas. So this is energy consumption on some place in the Netherlands. However, we have no idea what's going to happen in the future. Well, this is where Open Staff comes in. Open Staff stands for Open Short-term Energy Forecast. So instead of our question mark, we actually know what's going to happen. So after this very short introduction, let's tell, let's talk about what I'm going to talk about today. So first of all, I'll start with the challenges on the grid and why we actually need Open Staff. Then I'll talk about Open Staff, of course. And finally, I really want to discuss our recent developments and collaborations. So the challenges on the grid. So when everything was still good and easy on the electricity grid, it looks like this. So on the left, you see one big producer, just one direction energy flow, and then we have our consumers. Fairly easy. However, due to the energy transition, I think you're all aware, it looks like this. So very chaotic. So on the production side, we have a distributed production due to our solar and wind, both on the mid and low voltage, but also at our consumers. And on the consumption side, we have the issue that our consumption has exponentially increased. We had a lot about EV charging over here. Well, those electrical vehicles need electricity through the grid. And this is where our capacity issues start. So this is a map of the Netherlands. And I think you can all guess that red is bad. So on the red parts, we actually have no capacity available. So let's say that you want to start a company of one of these areas, we cannot connect you. So you get no power from us because we just simply have none to give. But of course, we're all very smart and over the people. So we have some solutions. So one of these solutions is actually to shave the peak. If we expect grid limitations to be surpassed. So on this left image, you see a forecast on the load on, for example, a transformer. We see a very clear peak. And this is where our grid limitations are surpassed. So our solution is just to say shave the peak. So for example, if this is production, we just ask one of our solar farms to just shut off for a little while. Of course, they can money for this, but that's something else. And then this is the result. So our grid limitations are not surpassed and nothing breaks. So great. But then able to do this, we do need to know just left image. So we actually need accurate forecasts. And this is where we have open step. So again, open step stands for open short term energy forecasting. And let me a little bit explain a little bit more about it. So first of all, what the hell is it? Well, it's a complete software stack to forecast the load on the electricity grid. But it's energy forecast. So it could also do it for heat. And it's automated machine learning pipelines. So it's a step by step process, which is automated to make a forecast. So in these dark blue boxes is all everything that open stuff can do. And I'll talk a little bit more about it. So what does the software look like? So first of all, you need a database. This is one that you have to make yourself, of course. But we do have open step DBC, open step database connector. And this is able to get all of your data from your database. And then we get into open step already talked about pipelines. So of course, these are in the software overview. And these are part of the tasks orchestration. Then we have data preprocessing, which includes data validation. So for example, if you see a little flat line, as we're able to cancel these out of your input data, and there was something very interesting, feature engineering. So in this feature engineering, we're, for example, able to calculate the wind speed at the height of a windmill from the wind speed on the ground. And we're also able to calculate the lag load for one time stamp. And then of course, we're machine learning pipelines. So we have some machine learning in there. So we're using open source models such as XG boost to make our machine learning models. So we're able to train, optimize hyper parameters, of course, make a forecast. And we're also able to make a split forecast to our Dazzles model. And finally, we are able to evaluate our forecasts, store our model, and do some post processing. So let's look into the methodology on a really high level. So on the left, we have our target load. This is where we actually want to forecast. Then we have some external predictors. So we have our weather forecast, market prices, and typical profiles of companies and households. From these external predictors, we can actually calculate our derived features. So this is the feature engineering I just talked about. So we're able to calculate lag loads for each time stamp, but also to have some derived weather features, such as, for example, the wind speed at the height of a windmill. And for the more calendar info, it really matters if you're are forecasting on a Sunday or Christmas compared to a Monday. And then we can train a single model for all our lead times. So here you can see what the data, for example, looks like. So if a daytime with increments of 15 minutes, our targets, and external predictors, you can also see here that we have the Dutch energy prices in there. So if you have multiple training horizons, we just simply do pick late our data and use this for our training horizons. And then if there are questions about it, please ask me in the break, but I don't have time to go into this in 15 minutes. We can with our trained model now actually make this forecast. And of course, we want it to look nice. So we have this beautiful Grafana dashboard, which actually summarizes all of the information that you need for your forecast. So let's look into it. First and foremost, our forecast. So the red line on the left is actually the low that has been historically measured. And then we see here the yellow lines is our forecast. Well, now you see that there are a lot of yellow lines. What do those mean? Well, those are actually the quantiles. So you have actually a certainty in your forecast. And this can be actually useful if we're a certain location. You're quite sure what your forecast is going to be. You can go into another quantile. Then if you have a location where you have a lot of factors that you actually don't know anything about. And also very nice our feature importance plot. So here in the feature importance plot, we can see our lag loads and some other features. And this is actually nice. So you can see for every location, which features are important for your forecast. So for example, here we see radiation. I don't think it's readable for you, but it says radiation. So you know that there are quite some solar parks or solar panels behind, for example, your substation. Wind speeds nowhere to be seen. So probably no windmills in that area. So this was really short about open staff. Let me see how much time I have left. Six minutes. Perfect. Okay. So community and upcoming events. One of the main things that has really changed in open staff this last year is our community. So before it was just Alliander who actually created it together with RTE, working on open staff. And now it looks like this. So let me go over every company really quick. So Alliander, that's where I'm from, talked about that enough. RT actually working on open staff for quite a while and they're actually ready to implement it very soon. RT International just joined us this year. They have a very nice proof of concept and they're going to work on it further. Fidel has actually been using open staff quite a long. I've heard some terms, leeches this today. Well, that was a feed on up to actually a month ago. So we contacted them and they were like, oh yeah, we found some bugs. We fixed it. We can implement this. So they actually joined our community as of this year. Sigelman still working on a proof of concept and seeing if they want to replace their own forecasting model with open staff. And Shell is working on open staff DBC and seeing if they can use their method of data important. Now, I hope everyone feels like they want to try open staff. Well, you're in luck because we are organizing a workshop. So on the Friday, the first of March from two to four, we were organizing a workshop. And I would like everyone who's interested to join. So you'll get a better introduction to open staff and also a little bit more of the technical details. It will be virtual. And you will get really a hands on experience. So you get some example notebooks from us where you have to make your own exercises and you can actually make your own forecast with open staff and see how easy it is. If you want to sign up, just scan the QR card over here. And it will be very nice. I also have it on the next slide for people who are too slow. So want to know more about open staff, maybe even before you sign up for the workshop, we of course have our GitHub website documentation, etc. You're only one command away from using open staff. And if there's anything you want to ask or give some comments or anything, you can just send me an email or send me a message on LinkedIn. So thank you for your time and I welcome any questions. Who's running the microphone? I'll try to do my best. Please feel free to guess to find the best path. Hello. First of all, thank you so much. This was very interesting. And I have no experience, I have never heard of open staff before reading on the FOSTA website. I have one question about the data collection. Do you provide like some examples or standards on how and where to fetch data because the data source is very, I tried, I looked. So very good question, I think this is something that a community indeed struggles with. So for the Netherlands, we actually do have those sources because we are using them ourselves for other countries who are working on it to see if we can find some open data for everyone. But if you're interested, you can always send me an email and I'll see what we have. Yeah, great. Hi, it's Miné. I'm from Red Hat. So obviously I will ask the question about scaling this, right? How will you standardize and scale this because it's a project. It sounds super interesting. But how are we going to scale this to 49,000 substations or millions of smart meters at home? Very good question. This is actually something we're working on right now. So we are actually employing our open step stack on Dexter probably anytime soon and seeing if you can actually scale from that. Currently we have it scaled up at I think 100 substations. And if you're curious how we have a reference implementation on our GitHub and you can see all the information there on how we deploy this. Thanks. Yeah, yeah, sure. I have a question about the data sources. Is there any thought given to adding geographical information systems data into the system for forecasting models? Because especially stuff like wind and solar radiation also not just depend on the time of day and the wind speeds, but the location itself. Great question. Yeah, actually for our system, it just connects to the closest K and MI. So that's the Royal Dutch Weather Organization. So it's able to find the closest station to where you actually want to forecast. So it definitely takes a location to account. We have a prediction job class where you can put in all of the information for your forecast and in there you also put the latitude and longitude of your location. So it does take that into account. Question over there. Thanks for the question about the geographic data because I was thinking about an approach of just using cheap raspberry weather stations in Austria and distributing them across some locations to fetch the data because I have the Google Weather API and the Open Weather API or whatever as comparison values. And for the geographic thing, thanks for the question. How would you connect that? Like is this a plan of open stuff? Did I miss this? Yeah, thanks for the kind of difficult question because I don't know the answer. So I'll ask my colleagues who actually made this part of the open stuff and I'll get back to you if we connect afterwards. So then you'll know. But it's very interesting to do with the Raspberry Price things. Thanks.
OpenSCD: Everything Everywhere All at Once
So, yes, you can hear me. Hello, everybody. And welcome to my talk, OpenSCD, everything, everywhere, all at once. So my name is Tamar Schuss. I'm a software engineer and the lead of the domain development at Sprint 1. We are a software company in Stuttgart in the southwest Germany. And I would like to talk about... Hello, hello. Yes. Okay. First of all, what is OpenSCD? Just to give you context, a brief introduction to the history, how it came to be, and where is it today. Then what I would like to do, the goal is the talk, is that I would like to talk about the challenges we have as a community and which approach we took and which approach we are thinking about. So I'm just going to talk about the technical approaches today. So let's start. What is OpenSCD? It's an open substation communication designer. We're going to get into it later, what that really means. And it's also an IEC 61-850 tool. I don't know if everybody knows it, but I'm going to explain in a few sentences what it is. It's a progressive web app. So it's browser-based, and it's also a platform. So we think about it as a platform rather than an app. Okay. So just again for a quick context, probably everybody knows for the... Or for later, for the recording. This is a substation, an electrical substation, and it converts high voltage to low voltage and vice versa. And IEDs, so intelligent electronic devices, monitor and control the substation, so everything works as it should be. And the... Before I mentioned IEC 61-850 is a communication standard that describes or specifies how these devices should communicate or how you should design a communication between these devices. So OpenSD does something with this, and how it came to be is that at Omicron, one of our good friend Jacob Fogelsang created a Java app first because he wanted to help with the colleagues and his team to create multi-vendor projects because every vendor had its own tool and they interpreted everything a bit differently about the standard. So they try to... Our Jacob tried to create something where you can agree on a software level, so not just a specification but also agree on the implementation. Later on, Christian and Dinka joined the team and they restarted the project as a progressive web app because they saw just how hard it would have been to deploy and to just distribute a Java app to everybody. So they see the web platform as a nice way to distribute the software. Then the project started to grow. Alieander RTE joined Transpower from New Zealand and TransNet BW also from South West Germany and we joined with them and we had to create a few plugins for them. And now I'd like to think that we are at the scaling phase. Just last year, a colleague from Alieander, Pascal Wilbrank and I took over the maintenance of OpenSCD and just last week we have been accepted to LF Energy. So we are very happy about it and we are looking for the onboarding process and get to know all the other projects too. So these scaling of course are the scaling problems I think everybody has. So we have more interest in usage and more usage and more interest in the project and we face a few challenges. First to get back to the title, so everything. What we see is if a tool doesn't really provide all the tooling to design substations, then people are going to just use under ones. And we are right back there where we were at the beginning, that the tool is differently maybe interpreted as specification and then these designs, so these files are not going to be as much exchangeable as we would like to have. So what we see is that in order to be successful, we need to provide all the tooling, all the features that the users need. The problem with that is also that we have to provide it everywhere, otherwise a standard couldn't really work. It's already too bad that this high EC standard is not accessible for everybody for free, but even if the software that uses it isn't accessible to everybody, then it's never going to work. So at least we are trying to change that what we can. So we would like to really make it available for everybody. So all at once means, as you may know, in a multi-stakeholder project where everybody has its own deadlines, roadmaps and timelines, everybody tries to basically prioritize their own needs over the others because it makes sense. So this is also what we are facing with all the TSOs that everybody just has a different need. So our approach may be, so not every problem is solvable of course with technical solutions. We try to do, we try everything out, but today I would like to talk about just the technical ones because otherwise we would be here wrong. One is Web standard, it's really important to use them. We depend on them for the flexibility and performance and of course the long-term maintainability. Then the plugins, we have a plugin system. I'll get into these topics deeper in a bit. The plugin system just helps you customize for every use case you would like. And also the distribution, it's just one step further that you can have your own version from the whole system. So Web standards, how does it help us? So as I mentioned, it's a progressive web app. OpenSD is browser-based and what we need also is an offline usage capability because not every engineer has internet connection at sites or they would like to browser or design the digital substations on the go. So this is a really big point for us. And also that again as mentioned, installing an app, it's not really possible. It begins the prices and TSOs because the IT just doesn't like to install apps. So providing it in the browser, it's a nice way if you have internet connection to have it. Because it's a progressive web app, you have to visit it once and then you have it and you can use it at any time. The next one is custom elements as a web standard. So we use it for the plugin system and for a few other things. Why is it important for us? Because again it's a standard and if you can compile to custom elements or web custom elements, then we are fine, then you can create your own plugins. That leads to technology independence because we don't really mind what you are using. For example, OpenSCD is mainly lit JS based but we have for example, Swed plugins at Sprint I and so we created for Transnet BW, Swed based plugins and we just compiled it to custom elements and everything works fine. So this is also really nice to broaden our perspective and broaden our, let's say the developer team because no one company or the companies doesn't have to stick to one technology. Every company can pick their own or what they are best of it or what they have knowledge of and they can just use it. I'm going to show in a bit also how easy it is. So let's dive into the plugin system. This is OpenSCD and almost everything is a plugin. The menu points for example are all plugins and as is the example, the Open project opens by default, it opens a file locally from your PC. But for example, in our let's say sister project in Compass, it's also an LF Energy project, they re-implemented it and they have re-implemented the Open project plugin so that it opens files from the server. You can do this with everything else. So of course saving makes sense too. Then the next one is the editor plugin. So this is basically the main content that you see in the middle and also in the tab bar on the top where you can switch between the plugins. And the editor plugins are the plugins that can really manage the... Yes? Oh yeah. Yeah. Thanks. So editor plugins can really manage and modify the design. And what you don't see is the good thing, it's a validator plugins. We have by default the standard XML validators for the standard but you can of course, everywhere if you want, you can create validators that check for some semantic meaning. That means if you have for example a naming convention at the company, you can create a validator for it and then it's going to tell you if a naming of a device is not correct. Right. So how can you create plugins? It's really simple I think. It's just an unregistered custom element. So that means, if you can see it hopefully, it's just the standard way of creating custom balance. That's everything we need. We don't really need anything much because this we can load and use. And basically in this function, you can see almost everything we need. At the top highlighted, you can see this is a... Okay, maybe it's too small but in the top, we create a custom plugin tag. So a custom HTML tag name for every plugin. This is just to make sure that no plugin are collide. And we do this by hashing just the source. So you can have as many instances of your plugin as you want if it's necessary, just only the source or the source URL has to be different. In the next step, we just load the custom element, the JavaScript file and define the custom element with already generated tag name. And then render the tag, render the element, put in the HTML and the DOM and give it a few props, a few attributes so it has something to do. And the result is going to look something like this. Where you have open SCD and inside it, we have this plugin with random or a hashed generated HTML tag. So this is again an other example. But this is one of the plugins we created on the left. It's just a small weld component that wraps around another component. And on the left, we have this relatively small wrapper custom element that the main thing it does is here just basically deploys or starts this weld component. And why is it pretty good for this use case because it doesn't really have a runtime. So even if you have sub-weld, so to say, in every plugin, you are not going to have anything too big because it just really compiles down to basic JavaScript. In case it would be also possible something similar with React because React also bootstrap similar like this. And the only thing is that then with every plugin you would load React. Actually as the whole library with it, which sounds like a problem, but to be honest, once you load it, the plugin then it is cached and you're not going to load it every time. So even it would be good with React. So the last thing I think, the distributions, what's one of our solutions we are trying out currently is for example, it's what already working is that you can already deploy Open SID. So you can just take as it is today and deploy it on your own infrastructure. It is just a web app. So it's pretty easy. And it's yours. The other one is Eddence. We are currently working on to provide building blocks so you don't have to use everything. You can use just some of it. And it's easier to recreate and modify. For example, the plugin system, there is a history system where you can undo and redo your actions and also saving the project and editing. So these everything you could replace yourself and make it like for example that the editing doesn't happen in the browser but it gets sent to the servers, to the backend and then everything happens there. So this is what we are working on to increase again the flexibility. So again for the Compass project it's necessary to create new Eddence or a new, right now they are, they use a fork of Open SID but it's not the best solution so we would like to provide rather building blocks where you can put together your own platform. And what we saw is creating your own plugins. You can do it today at any time and the nice thing is that you can load the plugins from your local PC so you can have your PC and nobody can access to it but of course it's not the nicest thing to do so you can deploy it anywhere and you can install it in every distribution. So we already have a few distributions and we already have few plugins that we use everywhere even and it's not developed from the same teams. So it's always a nice way to use the work of others. Yes, so I was a bit quicker than I thought. Maybe we have a few questions but if you want to get in contact we have of course the Open SID organization at GitHub. We are on the Elephant Energy selection. We have a website and you can try out Open SID at opensid.github.io. Thank you. Is there a question in the room? The United States we have plenty of time, we have 10 minutes so if you want to ask a question of course to Tamas but to other speakers are still in the room feel free to ask for an energy question here. But of course priority to Open SID. It's post break, that's why. Let's break. Everyone's a bit tired. We can understand perfectly. We can ask questions to the audience. Oh, that's nice. Let's jump. Okay, IAC 61A50 right? You said that. Does it, apart from explaining, does everybody have any experience with that? Is that something you know of or not? Raise your hand. Okay. So who works in energy industry? Okay, about half I guess. Are you doing something with energy at home? Home automation maybe? Okay, yes of course you said. Okay, yeah. What if we teach energy that comes where? It's not industry. Yeah? My teaching. Teaching, teaching. Oh, education. Education. Higher education, primary schools, I don't know. Yeah, so much things. We had the own tech of course also. Has anybody thought of a question now? Ah, there's a question. Great. You're a hero. I think it's a follow on from your comments. I'm coming from a telecoms industry. I think I see a similar problem. The community is not big enough. The energy community is not big enough to sustain these types of projects. I think telecoms industry sees the same. So is there a way that the projects can be widened to get that their scope is even bigger so there's a much bigger chance of getting a more sustainable community? Do you have an opinion on that? Yeah. For sure. So as I mentioned in the beginning, so like it's what we try to do is what I talked about today is a technical approach. One was that basically having a desktop app is not going to cut it. So you need some new solution like the web platform where you can really distribute your software to everywhere. And also LF Energy and Alian there already does a great job with supporting the open source communities and using their project. It's already really big. How else you can, I think it's really hard to get amateurs so to say or hobbies I would say to the projects because the features we develop is not for a week and not for a week. The results are really long term. So until we really reach it, it could be years even in the energy industry and in the telecommunication I think it's similar. So the technology, how the technology moves in these industries is quite slow so to say or slower and maybe interesting. So how you can maintain such a communities I think you have to get through the chasm. If enough people get to use the project, for example, I think we are either there just before the chasm because Alian there uses it, Erty uses it and TrustNet BW uses it. If you can get a few other TSOs on board, then probably we're going to get over the chasm and the rest of the TSO is going to see this is a nice project and they want to maybe get involved too. So that's one thing how you can maybe grow the project and how to maintain the project is of course through foundations and through the companies. I don't see indeed this industry because it's so specialized and because of the closed source or the closed nature of this standard doesn't make it easier. I'm Dan Brown from Linux Foundation Energy and you're exactly right. There are so many parallels between networking and energy. I would say networking in telecom is like ten years ahead actually of where energy is right now, believe it or not. Where things, you know, ten years ago nothing was software defined and that sort of thing in the telecom space and now it largely is. So we need to go through exactly that same transition. I'm not saying telecom is perfect by any means and there definitely are not enough people in energy. So it's a matter of getting all of these traditional old school suppliers on board as well, the vendors who have been selling proprietary black box systems to the energy industry to utilities for years. They need to basically stop doing that and come to it with an open source approach and so they need to bring in the resources but we also need universities, we need researchers, we need government, we need the utilities themselves. So it's really a matter of community building and scaling and it's, you know, not an easy task by any means but that's why we're here in hopes that some of you who may not currently be, you may be developers in other vertical markets or in horizontal industries, horizontal technology areas who may find this interesting and be inspired and be inspired to, you know, come and join and start contributing to these sorts of projects. There's not, you know, an easy solution unfortunately but we're just doing everything that we can to keep building capacity. How the IAC61-850 market share, in terms of number of items, what part of the market of the substation does it represent? Meaning on one red electronic site that's deployed, how many are compatible with this protocol? So I'm not the best person to answer it, I'm not an electrical engineer, right? I'm not sure. So far what I get is that they are capable of it, so the IEDs, the Intelligent Electronic Devices are capable of it. I'm pretty sure, at least in you, so I haven't heard that they wouldn't be, so yes. Any other questions? Maybe to complete what you asked about. I think that the two last days some of us were in the Policy Summit for a European Commission. It was organized by the European Commission and we thought that it's very important to make a big announcement on energy and open source opportunity because we all rely, our future relies on energy, of course, our business, everything is relying on the energy. So if we can have fundings and if we organize through foundation to coordinate the effort and not to make efforts there and there and there, I think we will find a great path to have more and more contributors. Yes, you have a question. Can you please give the mic? I just want to compliment, sorry if I stopped abruptly, I'm sorry. I think that there's a very, in my experience, in my research in software-defined paratronics is that software-defined energy is much harder to achieve than software-defined data and signal because of the fact that there's a lot of current, there's a lot of power, there's a lot of issues with that and different use cases require different types of converters and all that. So there is, for me, one of the hurdles that we have as a community is that we need more open hardware as well. I mean, let's try to do some, you know, no code with no computer. It's not possible. We need to, if we want to do software, we need a computer. And if you want to do power, you need a power converter. It kind of goes with the, we abstract the hardware because eventually we want to, but there is a lack of hardware and I think that's a big frame, that's a very big break on the process because hardware is not only hard but it's difficult to abstract as well. We're going to get there. Thank you.
Sharing the operational cost of Europe's electricity grid: optimization and transparency through open source
Hello everyone, I'm Peter Mitri, I'm a software developer at RTE, the French TSO. So today I'm going to speak to you about two open source tools to software that help us optimize and share the operational cost of the European grid. The first part of the presentation I will focus on optimization. So I will talk about what we call the regional operational security coordination and remedial action optimization. In this part I will introduce the open source software which is called OpenROW. In the second part I will talk about cost sharing through flow decomposition and in this part I will talk about the open source software which is called flow decomposition. I try to keep as much time as possible in the end for questions. So yeah, I hope you have some questions. Great. So let's talk about first of all why we need to optimize the grid. So I understood that many of you work in the energy sector but some don't. So we talked a lot about congestion management in the previous presentation. So here I'm going to try to set the scene and explain what a congestion is. So as you may know electrical equipments in the grid have physical limits. Outside of these limits the equipment is not safe to operate. So for example a power line which transports electricity from point A to point B has a thermal limit. If we exceed this limit, if we transport too much power on this line, the line may heat up, it may deform, it may even catch fire and of course it's pretty dangerous. So to help set the scene, imagine here that you have a small grid or a small part of the grid which is represented in three nodes. So the nodes would be like sites where consumers and producers are connected to the network. And between these nodes you have power lines which are in black here. And let's imagine that you have most of power production on the left side and most of power consumption on the right side. So most of the power will flow from the left to the right. Let's say that we have a consumption increase on the node here to the right. Then of course the flow will increase from the left to the right and depending on the network's topology it may very well be asymmetrical. So we may have more increase of the flow on the bottom part here. And we may find that the flow, the new flow that is on the line here exceeds its limit. So this is what we call a congestion. Of course there's not just the question of consumption and production, there is also some accidents that can happen in the grid and that can lead to congestions. So here you have an example. If we lose the line that transports electricity from here to here, then most of the power will flow through this line and this can lead to congestion on the upper line. As a TSO, RTE has the responsibility to be robust to all eventual incidents on the network. So we have to do something about these congestions. So what can we do? Fortunately we have what we call remedial actions. So these are actions on the network that can serve one of two purposes. The first purpose would be to redirect the flows on the lines. So for those of you who work in the electricity sector, you may know them as topological actions, HVDC actions or phase shift transformers. I'll talk about them in an example in the slide that follows this one. There's also another type of remedial actions which acts on the injections. We call that either redispatching or counter trading. These are actions that will change the power production plan of the producers. In general, the first part of remedial actions which redirect the flows are called non-costly because the only cost to operate them is the aging of the equipment. The TSO has power over these remedial actions. And the second type of remedial actions is costly because when we ask consumers or producers to change their injections, we pay them for their service. So to help set the scene, this is an example of non-costly remedial actions. So here in the example above, we have the base case where no remedial actions applied. So let's say that you have a congestion in the line here. One first type of remedial actions is the topological action. So let's say that you can split this node here into two nodes. This will make the power flow equal on both lines, this one and this one. And then it will relieve this line here and then we would have relieved the overload or the congestion on the network. Another type of remedial action is the phase shift transformer. So let's say that we equip the line here with a phase shift transformer. This kind of equipment is able to shift the phase of the current on the line and so act on the active power flow and so it can relieve the congestion on the line. The second type, in the second family of remedial actions, which are costly remedial actions, this is maybe actually easier to understand. What we can easily do is to call a producer which is on this node, a power plant, and ask them to decrease their production and ask a power plant that is here to increase their production. So naturally this makes the power production closer to the consumption site and it reduces the overall flows on the network and by consequence it relieves the congestion on the line. The key difference here is that power plants 1 and 2 get paid for their balancing service. The fact is that Europe's electricity grid is highly matched, interconnected and synchronous. So for example if you have an incident in France it is instantly measured in Romania. Thus the security of the network is no longer a national one, it's a European one, it's a global one. So TSOs have to conduct coordinated computations to ensure that the European network is secure. This is why the Acer, the Agency for Cooperation of Energy Regulators, imposes on TSOs to conduct what we call the regional operational security coordination. So in this process TSOs must choose the best remedial actions on the European scale to implement in the network in order to ensure that it is secure. Of course it's a large escape problem so we can hardly do it by hand. That's why we need an automatic tool which is called the RAU or the remedial action optimizer. The RAU will have to choose the most optimal remedial actions in a given perimeter and it also has to do so by minimizing costs that are imposed by cost remedial actions. So using an open source RAU has many benefits. First of all transparency because we are in a European perimeter. So what better way to be transparent about what the RAU does and which cost remedial actions it selects than to put its code in open source. Given of course that it's well documented. It also serves the purpose of coordination because this way when we put a tool in open source different TSOs from different countries and different vendors from different countries can cooperate more easily. It also serves robustness, interoperability, it also serves reusability and time to market because when a tool is used in many business contexts it becomes more versatile, it becomes more robust and it becomes quicker to deploy. At RTE we have developed an open source remedial action optimizer called the possible open RAU. So for those of you who may be know it it was called FARAO in the past. The journey started in 2019 but two weeks ago we made the move to possible open RAU and we did this because we wanted to join the Linux Foundation energy adventure because LFE provides a clear governance for which all contributors accept to abide and it also provides a clear methodology to work more efficiently and in better intelligence. Open RAU is actually used internally at RTE but also in many European processes. So I talked about regional operations security coordination or ROSC. Open RAU is being implemented for the SWE region here which covers France, Spain and Portugal. It is already in operation for another process which is called capacity calculation on the Italy North region and on the Co region which is actually the largest region in Europe to conduct the coordinated computations. It covers around a dozen countries. A few words about what our RAU can do. So it's an optimizer so of course it has to have an objective function. It can either minimize the worst congestion or remove all congestion in the network. About congestion we can model flow congestion and we can optimize flow congestion. So this is the example I talked about in the previous slides. We can also model voltage magnitude constraints and voltage angle constraints but for now the RAU cannot optimize them. It can only monitor them. For immediate actions we can optimize phase shift transformers in a given range. So the RAU if you give it a range of possible tap positions for the phase shift transformer it will choose the most optimal one that reduces congestions over the whole network. You can optimize an HVDC set point so it can change the set point of the HVDC to reduce constraints. It can also choose to activate or not activate some topological actions. For example closing a switch or opening a switch. It can optimize a subset of redispatching remedial actions so actually a redispatching remedial actions are pretty complex and actually an open RAU would just have a subset with strong limitations. Also it can optimize a subset of shunt compensator actions and it can for now only model counter trading remedial actions but we do not support optimizing them in the RAU. So of course like I said open RAU is used in the multiplicity of business context so it is very versatile. It has a lot of ways you can use it by changing the input data or by changing its parameters so if you need more information you can look on our website for all the ways it can be used. Under the hood the open RAU software is licensed under Mozilla public license 2.0. It's hosted on GitHub and the code is written in Java 17 so we use JUnit for unit testing of course we use Mavin for dependency management. We monitor the quality of the code on Sonoma cloud and we're pretty happy with our figures. We publish the code on OSS Sonotype and we rely closely on the possible library to be able to model the network and to simulate it in particular to use sensitivity computations and load flow computations. We also this specificity of the RAU we also use Google OR tools. I don't know if you know it but it's an open source modelization library for linear problems developed in open source by Google and through it we can support a multiplicity of linear solvers. For now for example we have skip which is an open source solver also CBC but also we can support express GROB Cplex which are commercial ones. As a side note we tested that open RAU is compatible with Docker Jenkins Kubernetes and Cucumber testing. So in conclusion I'd be more than happy for you to participate in our RAU adventure either by using it and giving feedback or by contributing to the project. So the best way to join the adventure would be to join the possible Slack team and then to join the RAU channel. And there is also a quick tutorial on Java if you want to play around with the RAU on our website. And if you want to know what the future of the RAU looks like the roadmap is updated once per month and it is discussed during the possible TSC which you are free to join. I'm moving on to the next subject which is decomposition and cost sharing. So I'm going to set the scene with a small example here. Imagine that you have three zones. Let's say there are three countries A, B and C. Imagine that you have big bow production in the north of A and big power consumption in the south of A. Then naturally you'd expect the power to flow from north to south so from producer to consumer but in reality it's not so simple. Any part of this commercial exchange, the power that is sold to the consumer, only part of it will transit through the internal lines of zone A and the other part will go through zone B then to zone C and then to zone A to the consumer. So of course the consumer got the power they needed but some of the power went through zones B and C. We call these loop flows or polluting flows. So the commercial exchange is simply the sum of internal flows plus loop flows. And we say that they are polluting because they transit through zones in which they are not consumed. So as you can imagine more loop flows in the polluted zone means more loads on the zone's internal grid. It means eventually more remedial actions to implement possibly costly and this leads to more costs for dispatching and counter trading. So in the core region alone we have up to 3.7 billion euros per year of dispatching and counter trading. And of course loop flows are a reality. They are a consequence of the topology of the network. We can do nothing about them. We cannot eliminate them. However we can compute them and we can better share costs when we know where they come from. So the Acer again the European regulator defined a clear methodology of computing loop flows in the core region and this methodology is followed by a methodology to better share costs between TSOs. Of course using an open source tool has all benefits here and most of all transparency because when you talk about sharing costs we talk about TSOs having to share the bills and being transparent is very important. At RT we developed a tool which is called possible flow decomposition. It follows the Acer methodology so you have the documentation for it here. And it has both a Java and a Python API. Under the hood it's almost the same as the row so MPL 2.0. It's developed in Java. It uses Mavin. It's hosted on GitHub. It uses a lot of computations thanks to possible for load flow computations. And most importantly it's already supported in our PyPossible API. That's it from me. Do you have any questions? So maybe I wasn't paying enough attention. Can you, so the purpose of your system is to allow you if something happens like whatever that thing on the Pyrenees a couple of years ago for the whole system to react appropriately. But you were showing that you're doing subsets to the computations. I didn't understand in an emergency. Presumably everyone needs to do something right at the same time. The whole network. However far the effect propagates. So what was happening there? What happens in an emergency versus whatever you were showing on the screen with doing computations of various regions? This is not really an engine that is supposed to help decision making in real time. It's supposed to be used as an optimizer for the grid. For example, in the regional operations security coordination, TSOs have like a photo of the grid in the day ahead. So 24 hours before real time. We merge the whole grid models of different TSOs. We conduct load flows and then we see if there are any congestions. If there are any congestions, then we run a remedial action optimization. The optimizer with us, okay, I found these non-costly remedial actions and these costly remedial actions that will make the network secure. 24 hours ahead. 24 hours ahead. 24 hours during the day, but it's not supposed to tell the operator which remedial action to choose. This is another, this is really apart from balancing. And if you, if we go back to the example where I showed balancing, something that resembles balancing, what we should do here is every time we change production somewhere, we have to, so if we decrease the production here, we have to increase the production here because the TSOs do, anyway, when we handle congestions, we cannot change the balance of the network. So the balance between demand and offer is handled in another process. Hello, I have a question about how much resolution you need to see into each of the grids in order to actually make some of this. Could you talk a little bit about the visibility that's required at the TSO level or beneath it, for example? Depends on the process. So in the regional security coordination, we look at high level voltage, so 200 kilovolts and 400 kilovolts. And basically all big production hubs are on this voltage level, but this is a really generic remedial action optimizer, so we can generalize it to whichever resolution we need. Any other questions? Is there some ideas to change the software for real-time congestion management, like for DSOs or for other systems? Yes, some experimentation is underway for balancing in order to be able to find creative remedial actions in real-time. So for now it's not an operation, but it's being experimented. So my question is about impact. Have you noticed that over European TSOs are using your software as well? Is that the goal in the end to share among different TSOs as the Europeans can? For now we are the only TSO using the RAL internally. However, here it is Coriso, which is the computation coordinator that is using OpenRAL for these three regions. And also the idea of joining the possible project is to be able to develop a Python API pretty quickly and to be able to have more users in different TSOs. What kind of algorithm is used in OpenRAL? We have an optimization algorithm, so a linear optimization algorithm. I have a few slides in the appendix for this. We can talk about it later if you want. But basically it's a search tree in which we optimize the topological actions and inside after every topological optimization we run a linear program to optimize linear remedial actions. These are remedial actions that have a linear effect on flows, for example, PSTs and HVDCs. How do you test it? How do you ensure that there isn't a bug that affects all OpenRAL instances running simultaneously? With this, if it answers your question. We have a lot of input files and expected output files. And with this stack, with Docker, Jenkins and Cucumber, Cucumber is a framework for functional testing. So you write scenarios in a Gherkin language. You say, for example, given this input file for the RAL, then I expect that there is no congestion at the end and that this remedial action is activated. You write it in a very natural language. And of course there is code to run these things. And then we put that in a Docker and in Jenkins and we run this every night upon almost 500 scenarios. And every night we are sure that our main branch on GitHub is still solid.
Quartz Solar OS: Building an open source AI solar forecast for everyone
Welcome. I'm Rachel Tipton and this is my colleague, Zach Watts. Can everybody hear me in the back? I'm not used to talking in a microphone. Yes, it's good, okay. So today we're going to be presenting Open Quartz, building an open source AI solar forecast for everyone. I'm a full stack developer. I work for Open Climate Fix. I'm going to introduce myself and then Zach will introduce himself. I'm a career change developer. So before working in climate tech, I was teaching English in France. I got a little bit tired of teaching 18 year old French students the present perfect. So I decided to, the French people, huh? Because it isn't. Yeah, because it isn't. It's not perfect. And I'm not a perfectionist. So I decided that I was going to channel my love for languages into learning code languages. I completed boot camp about a year and a half ago and that's why I landed. This is why I'm here. I landed in climate tech and I'm quite happy. Zach. Thank you, Rachel. Yeah, I'm Zach Watts. If anyone's noticed, my last name is Watts. So I think I was destined to work in power of an energy of some sort. I recently finished my masters in physics two years ago where I was trying to make cells dance using acoustic sound waves. And then I kind of fell in love a bit with AI and then joined open climate fix about a year ago where I do some of our machine learning implementation and data science. All right. So what to expect from our talk? I'll introduce open climate fix. We'll talk a little bit about why solar forecasting is important to balancing a power grid and some of the use cases that we use it for. We have a live solar forecasting service called court solar and derived from that is the open source court solar model that we'll be talking about and Zach's going to present that today because he's worked on that model. And then hopefully we'll have time for questions at the end. And this is a sneak preview of the code that we'll have you guys run at the end of the presentation. And we're hoping that the demo works, but we'll see. Open climate fix was founded in 2019. We're a London based company. I'm based in the north of France, so getting to be in Brussels is kind of more my home territory. This photo is from the sustainable work ventures office in London where we work. We're a nonprofit product lab developing open source solutions to decarbonize the power grid and generating solar forecast is part of that work. All right. So we see ourselves as like a, I'd say like a middle man or like the traverse between ML researchers and the energy industry. So we want to make our data available to researchers and we want to make the research ML researchers are doing available to the energy industry. And how do we do that? So all of our code is available on GitHub. We also have models and data sets that are available on hugging face. Does everybody know what hugging face is? I'm assuming this crowd does know. Yes. Okay. We know what this is. So a lot of the data sets are from NWP data or numerical weather predictions. And up to date, we have 500 people who have signed up to download those data sets. So we like to say that we're making an impact in that way. We also make available the EU met set data that we collect from where connected to a live service of like we get data from the satellite itself while we're generating our forecasts. And then we're actually putting that data into the ZAR file format and making that available to ML researchers. And that data has been downloaded 16,000 times so far from the Google public data sets site. So that's a way in which we're having an impact. The data has also been used to forecast rain, like to do rain predictions in Sweden, storm evolution in Taiwan. So it's been used for a lot of different purposes. And most recently, there was like a graduate paper that was published on, I think it was like day ahead PD forecasting. All right. So moving on to why solar forecasting important. The weather is unpredictable. The sun doesn't always shine. The wind doesn't always blow. If any of you have listened to a podcast on decarbonization, you've probably heard that phrase before. So moving into the future, our basically our power generation is going to be dependent on weather dependent energy sources like solar and wind. So in this chart, you can see by 2050, about 75% of the world's primary energy source is going to be based off of renewable resources. And then the resources at the bottom are gas and coal. These are what are called dispatchable resources. So you can you can burn X amount of coal and get X amount of electricity, you'll burn X amount of gas and get X amount of electricity. This is a basic concept that I'm presenting. But it's important to think about, because you don't have that predictability with solar or with wind. And that's where our predictions come in. So does anybody know what this is, this image that is on the screen? I'm sure there's somebody who knows more about it than I do. Peter, would you? No. Huh? Somebody else? Anybody? Yeah, it's a gas powered turbine. Thank you. So it's again. This is a gas powered turbine. I'm using it to introduce the idea of spinning reserves. So a power grid is, as we've seen, there's a lot of calculations, as well as a lot of, it's complex to balance a power grid. And so what we're doing with our work is, we're helping power grid operators balance the power grid by providing them with a PV solar forecast that indicates how much solar energy is going to be on the grid. If they don't have that energy, what ends up happening is they have something called spinning reserves that they keep running. And that spinning reserve is running at 50% capacity. And so it's running at 50% efficiency. And so you're actually burning fossil fuels just to ensure that there is electricity that could be generated to be on the grid. If you don't know how much solar energy is going to be on the grid, it makes it more likely that you're going to have a greater amount of spinning reserves that are functioning or running at a given time. So I'm just introducing this to explain how our solar forecasts are actually decreasing carbon emissions currently with our work with national grid. So our main solar forecast is a national forecast that's run for national grid ESO, which is the electricity system operator in the UK. This is a picture of the control room. If you've never seen the picture of a control room, this is what the national grid control room looks like. And our national forecast is in operation in the national grid control room. So this is what a solar forecast looks like. You have the dotted line here. So the dotted line, that's your forecast. And then the solid line behind where it says 1130 is basically the history of the forecast itself. And I'm just using this to show you the information that national grid is given. And then they're able to make balancing decisions based off of this information. So if they see that there's 3.5 gigawatts of energy that is guaranteed to be on the grid, then they can reduce the spinning reserves that are running at that time. And therefore, decreased balancing costs for themselves. And then they also are diminishing carbon emissions at the same time. The other model that we have in production is a sites model. And this is what the open courts model is based off of. And so this is a model that's not necessarily generating a solar forecast for the power grid itself or for an entire country, but it could be like for a solar farm or like a smart home operator. And Zach is going to tell us how it all works. Great. Thank you very much, Rachel. So as said, we've taken a lot of the information that we've learned from building these kind of larger, more complex models and distilled this down into a site model. But essentially what we're doing when we're trying to do a forecasting problem in general, we want to start by providing as much information as we can about the problem we're trying to solve. So we start that by providing a diverse set of solar historic generation data. That just means we can capture all sorts of different types of conditions that might occur across a different location. We then provide multiple numerical weather predictions. These are forecasts made by large supercomputers of different countries, forecasting things such as cloud cover, temperature, rain, irradiance. And not all of these numerical weather predictions are equal. Some of them have slightly different biases. So we try to incorporate as many as possible to try and capture that information. We also utilize satellite imagery. As Rachel said earlier, we've made that data set public on Google data sets. That's really useful for helping with kind of near term cloud formation, not only that the satellite imagery, because it's a satellite up in space, it can take a picture every five, 10, 15 minutes. So you have a higher resolution of data going into the model, whereas the numerical weather predictions, they're run on quite resource intensive, quite slow to run supercomputers with much lower resolutions. We also then provide some topographic data about the terrain in which we're forecasting. And we feed all of this data into machine learning model. And if you've dealt with any data on this kind of order of magnitude of 60 terabytes of satellite imagery, you would know some of the pains in creating batches and the slow processing times involved there. And out of this, we're able to create a national, a regional, and an individual sites level forecast, which I'll be talking about today. So as we said earlier, we've been doing some work with the National Grid ESO, which we started a couple years ago. They were our first pilot project with our forecasts. And we managed to generate a forecast, which was three times better than their existing in-house forecast. So that gives you a key to kind of the bar that was set when we kind of started this, trying to getting an error, which is three times better. And this chart we can see to the right here, this is from one of our latest models, which we call PVNet2. And you're looking at mean absolute error as a percentage per forecast horizon. Now I've used this to demonstrate the value of using satellite imagery combined with these numerical weather predictions. The light blue line that you can see here is if we train the model just using the satellite imagery, you can see it's quite good early on, but the error relative increases quite a bit. Whereas just using the NWPs, which is this dark green line here, very kind of horizontal consistent error. And then by combining the two data sources, we get this, what I find a quite satisfying convergence where the models learn to take the information it needs from both data sources. So moving on to our site level forecast, just curious here, if you have solar panels, could you just raise your hands now? All right, now keep your hand raised if you also have a battery pack in your house. Now, are any of you using solar forecasts in any way at all at the moment? You are, nice. So this is where we see the kind of site level forecast that we've generated to be open source being really useful. There's a bit of a shift going on in the past couple years as consumers and kind of home households are realizing that there's these technologies available that can help them optimize their energy consumption. And it's not just the consumers as well, it's the smart home operators who are looking to participate in these energy flexibility markets. Now, as we've heard, there's been lots of really great presentations today about how to manage a grid. The grid, the electricity grids really need a lot of more infrastructure that needs to be built on to the grids to meet electricity demands going forward. And one way they're trying to tackle this is by increasing flexibility through things like smart home management. So one way this could possibly be used is when a smart home operator has access to many, many households, they can incentivize households to turn up electricity or turn down during different times. And this provides a flexibility to the grid. Now, from a consumer perspective, you might have an electric vehicle and you might want to charge your EV at times when you know you've got the lowest cost to you, which is when you'd have solar generation. So you can look at a forecast and say, I want to drive my EV tomorrow. I can look at my solar forecast and be like, well, it's really sunny today and really cloudy tomorrow. So I'm going to charge my car up fully today and then I can drive it tomorrow and it'll be lowest cost to me. So we see this being used by smart home operators. We're already speaking to a few startups in this space who are trying to integrate this into this smart home optimization sort of systems. Experts in battery optimization, research and academics, and just general hobbyists who might want to incorporate solar forecasts into their current situations. So to create this model, we've used a data set of over a thousand household UK sites, which can see on the right here. And we've trained quite a simple model, just a gradient boosted tree, which essentially tries to separate out the data into different buckets. This is quite a crude example, but say the clouds are less than 25%, you might predict 100% PV. If not, then try and create another branch that will then split the data up further. And what we're able to do by using kind of a wide range of different sites spread out all across the UK is forecast anywhere in the UK. So we can now plug in our specific latitude and longitude information about the site we want to forecast for and forecast for anywhere and hypothetically globally as well, depending on what data we have available. So this brings us to open courts, which is the open source solar forecast we're presenting here today. This uses open NWPs. Now there's two primary open ones. There's a few, but the GFS, which is the American global forecasting system, and the ICON, which is created by the German weather service DWD, and is widely regarded as the most accurate free to use weather service. So we take things such as cloud cover, temperature, visibility, and we pull this data from open Meteo, and we're using our pre-trained model that we previously showed. And by doing this, we're able to create a forecast up to 48 hours ahead at a 15-minute resolution and do all this in four lines of code. And we're able to get a pretty good error doing this. In comparison to some of our other models, which use slightly more up-to-date information, the error is not too much worse. Now you might notice that there isn't satellite imagery involved here, and that's because this model itself is something that you can run on your computer using our pre-trained model and by pulling the data yourself in just a couple of lines of code. Now when you involve satellite imagery, you need licenses and stuff to have that data live. The repository, the data storage that we keep has a two-day lag, I think, on live real-time data. So we were going to do a demo, but we've had to do a last-minute swap of computers. So instead, I'm just going to talk through this with everyone. But if you do want to do the demo, you can follow along. So if you head over to our GitHub repository, which is GitHub-OpenClimateFix, I've pinned the repo, open-source-court-solar-forecast, so you won't have to type in that mouthful of the name of a repo. And if you head to the example folder, there is an example notebook you can follow through, which will lead you to creating a solar forecast. But essentially, all you need to do is pip install-court-solar-forecast that we have here. And then once you have that installed, these are the four lines of code we tempted you with at the beginning, but essentially, you want to first import the function, which we'll be using to run the forecast. Next, we import this PVSight class that we use. We then want to create the class. So in this case, we're going to specify the latitude and the longitude of the specific house or site that we want to forecast for, and then the capacity of our solar panels. Next, we just run that, we use that run forecast function, passing in our site as an object, and then specifying a time in which we want the forecast to start from. So using this time here, it would create a forecast starting at midnight on that night, going out 48 hours from that point onwards. And what does the results look like? Well, we get a nice, so this is where I click demo done, and it would nice graph and smooth, but we get the best results out of this anyway. So we get our solar forecast, which looks, as we might expect, kind of peaking around midday. There's some bumps in the road here. This could be due to some clouds that are coming over or storm. And we've got our forecast from midnight out to 48 hours ahead. So hypothetically speaking, with the demo running, I could have shown you what it looked like exactly at this location here today, looking out for the next two days, and we could have seen today. But running it on my computer, it didn't look too great. And that's kind of reflective on if you look outside the window today, it's a bit cloudy, and not the nicest. So I'm going to pass back to Rachel now to talk about the robot. All right. So moving forward, the idea for the open, the Quartz open source forecast is that other people can use it. You could potentially input different types of data, so different NWP data could be input or PV data. And also just anybody who wants to do a bit of ML experimentation, this would be a place to start with that. As a company, we're looking to build our community as an open source company. It's something that we're kind of trying to put in place. So if people use the model, hook it up to an API or a database, and actually start generating a regular forecast for themselves, we'd love to know about it. So I don't know if we have any time left for questions, but yeah. Too many questions. The prediction, does it assume like you can specify the capacity, but can you specify things like south facing versus east west facing, that kind of stuff? And how does this contrast with forecast.solar, which provides for home users a similar API? Sure. Thank you very much for the question. So in providing features like tilt and orientation, that's something that we have built into the model and needs a little bit of a tweak to get it working. So originally, with this model, it was based off a model that we have in production, which we run for a thousand household sites in the UK. And we found that the tilt and orientation data that is generally provided is not always that good and that accurate, because oftentimes with a solar installation, the builder might have noted it down, but it's not that accurate. And when we ran experiments, hard coding the tilt and orientation, versus letting a user kind of specify exactly, we got slightly better results if we assumed it was a perfectly south facing and at 30 degrees. But that is something that is a little tweak and is I think one of our kind of issues to work on. And your next question about using kind of another provider, what was it again, the name of the forecast.solar. So I think what differentiates what we're doing compared to other people, this is something that you can run like locally on your computer and do it yourself. And we're also forecasting generation. I think a lot of these other APIs, they're forecasting things like solar irradiance, and then it's down to the user to basically interpret that irradiance value into a generation value. Maybe forecast or solar is different, but I think that's what we do, maybe slightly different if that makes sense. How do you handle so this issue of the solar, long term solar weather and recent critical events quite like volcanoes or dust balls, which can affect the yield for the solar partners? Yeah, so things like volcanic eruptions, they definitely do affect the solar. And a lot of time, I think that information generally is helped out. So the numerical weather predictions that we use, sometimes they tried to capture in that information. I did see some research papers on how they actually don't capture in things like volcanic eruptions. And the researcher, I suppose, who was saying, we need to improve these models to capture things like that. One other data set that we're looking to incorporate is aerosol data sets. So that does include information like that. And is something which I think we're doing with some of our other models. And at some point, I guess, we'd like to do with this model as well, which should help to capture extra information like that. Hi, thanks for the talk. So I wanted to ask, what is the geographic extent of this? You're using models which might cover more than, say, Eukaryurop. Or if it's confined to Eukaryurop, do you have plans to expand it to a wider region in the future? Thanks. Hi, thanks for the question. So this model in particular, because it's sort of dependent on the weather data that you have available. So we're using IKON's global weather forecast. That essentially means that this model can be used anywhere in the world, because that forecast is a global forecast. The only issue you might encounter is because the training set that we've used is just for the UK, there might be some sort of bias towards the UK household sites that we've not really looked into yet. So I think one of the things that we do want to do is to create maybe a more robust global model is to have a PV data set, which does cover the whole world. But I think so this is very recently, we've pushed this out. And since we've done that, there was someone reaching out to us from Indonesia who was testing it out there. I think they got it working. So it does have global coverage. Some of our other models, which we provide as a product and service, they are quite specific to the UK, but we're expanding out to India at the moment and some other European regions. And that's mainly down to the satellite imagery data that we have access to, because we're using the European geostationary one. So it's easier for us to build on that, how it is at the moment. Thank you, everyone.
Can open source development drive energy transition? PyPSA-Earth experience
So, we have stopped somewhere between regional and global perspective. Let's go global. The energy transition implies that thousands power systems around the world should be transformed with a pace which has never been seen before. And while we know what should look like the picture on the global level, that is still a question how should it be translated into regional levels. And what is special about this global scale energy planning problem is that we should plan decades ahead under deep uncertainties. And basically, we have quite an experience of energy policy failures. There have been quite a few cases when energy policy measures looked quite reasonable in advance but have resulted in failures, didn't lead to results which have been expected and these programs should be stopped. And that is why actually we need large scale energy modeling. We can replace this painful experience of rail war cultures by playing with the modeling, with energy models. And obvious advantages of open source, of open modeling and open data for energy planning has lead to a rapid increase in interest towards open energy modeling. And currently we have dozens of open energy models. We have a lot of open data sets relevant for energy modeling. But the picture is very incomplete and very patchy and there are regions in the world where we do not have even a net zero plan not to say open net zero plan. And that is exactly the gap which we are addressing as independent research initiative. PIPESIMEDS Earth aims to provide every part of the world with the open and reproducible and accessible energy systems model. What we are doing can be divided by three blocks. First of all we are doing open coding indeed. We are working with open data and we support open energy modeling community. So just a reminder about energy systems model. There are I would say power engineering models that is tools which we mainly have discussed today and there are also academic integrated assessment models. Academic integrated assessment models relate to the whole world and model global scale large scale interconnections between economics, environment and energy. And energy systems model that is kind of tool which translates these results of global assessment into plan of actions on the regional scale and obviously energy systems model should contain should reproduce in realistic way behavior of power systems. So that is what our workflow what our architecture look like. We have data block, we have modeling block and we have optimization block. Processing is orchestrated by snake mate and well probably the most trying part of the whole picture is work with data. There are different groups of data which effort operation of power system and there is also quite trivial but very impactful moment which relates directly to open data licensing. Basically we have starters data kit which we provide with the model to facilitate starting with modeling and I think the most frequent how to start request is about loading this starter kids data and many troubles by created by the fact that some open light some licenses of open data set do not allow redistribution or hosting of the data. So for some data we can collect data set and transform it in the form which is needed for energy system model to run while for others we do not have right to redistribute and have provide data to sources and connect them with the scripts to clean data and to prepare them to format which can be used by energy modeling and that is exactly chain of the whole link which breaks most frequently. So just open data in action. Environmental and climate that is part of the data workflow where we are truly grateful to open science community and to geophysical community. Basically that is the most unproblematic part of the whole workflow and we have package which translates geophysics to energy related parameters and basically that's it. Mainly it just works but as for electricity demand here the biggest problem is data availability. Indeed well what we need are hourly demand profile for every country of the world at least at national at aggregated national level. Indeed they data exist but they are not openly available and so we have a model we have machine learning model which has reproduced synthetic lot profiles but we would be very happy to improve flexibility and geographic coverage of this approach and access to the data to original lot profiles that is bottleneck currently for this group. Another important part, another part which is crucial if you are interested to model a power system of some arbitrary country is data on power infrastructure especially on grid and here we have used open street map database and developed a dedicated package which extracts power futures and allows to prepare model of grid topology. A part of that we have packages from pipes ecosystem which provides data on power plants on installed generation and a data set which collects and curates data on technology costs including forecasts for technology of technologies development. So and that is what modeling workflow look like. We take preprocessed data for power infrastructure and simplified topology preserving electrical properties of original power grid then cluster it to make the problem tractable and the next point is the most challengeable from the perspective of open source because open solvers are still overplayed by commercial solutions and here there is some room for improvement and we are collaborating with developers of open solver to improve the situation and now once the workflow has been established we had to be to ensure that it is actually possible to apply our model for every country in the world in the most literal sense. It has it took about almost a year work to introduce all the necessary fixes which account for different special futures and now it is done that is linked to it's another report which contains schemes for power systems of every country of the world of 193 United States country and we also have the code the source code which we have used to produce these schemes as images and if you are interested in model any country of the world please feel free to do that. Now let's look what actually can we obtain if we apply this approach that is net zero study for Nigeria which we have used in course of development the model which we have used as kind of proof of concept and the lessons which we have learned the most interesting output of this study has been that well net zero power system for Nigeria can be actually a little cheaper as compared to state of as compared with status quo indeed we haven't included we haven't accounted properly for uncertainties which exist for energy demand for Nigeria and this work should be certainly continued and applied to every country of the African continent but that is what does it look like that is which well that may be helpful to shift a paradigm and that is actually what is it all about and that is a study which has been done by in collaboration of pipe summits earth and open energy transition and a German think tank agor energy vendor they have considered Kazakhstan power system and the question is if it is feasible to implement solar in and wind faster as compared to with the current Kazakhstan current national development plans and the results are quite encouraging and being currently discussed on policy level and that is output of a master study for Saudi Arabia and that is a country where 99% of energy mix relays on fossil fuels a study which an author has done using pipes earth has shown that wind and solar actually can have quite well quite a place in power system of Saudi Arabia and it isn't so expensive as it could be expected that is the case when data accessibility data availability is a big issue so this this results are quite preliminary because more advanced optimization methods are needed to account for this uncertainty and also account for all pathway all transformation pathway but what is important what is an effect what is an impact of this study is translating conversations translating discussion about possible futures for fossil fuel relate countries from purely hypothec hypothec level to a level of numbers and that is a case for Bolivia case when South America when networks of South America are considered and that is region where data of open street map data have not so good quality so it has been needed to introduce quite some tricks to restore topology and the resulted model has been successfully validated for energy for dispatch on the national level so it works even if you don't have data of such excellent quality in open street map and that is a case for Malaysia we have considered decrobanization of industry and in Malaysia the local the local feature is well renewable sources renewable potential is not so excellent so we have shown that it is basically possible to decrobonize one bright branch of energy sector but if we would speak about the whole national economics it looks like it makes absolute sense to include into modeling into discussion not only traditional on wind off wind and photovoltaic also something more exotic like floating solar or probably to consider cross-country interconnections so and last but not least community is essential part of the whole story we have different channel of communications and we are very interested that is essential for now for us to build global community as we have seen there are some countries of the world where there are still a lot of modeling evidence available where efforts of researchers and developers are still focused but energy transition is a global thing and if we wanted to work we need to provide tool we need to involve people around the whole world and we can unfortunately confirm that there is definitely a gap geographic geographic gap in free and open source software community Tobias has talked about during free about that during previous first day and now I think we have some understanding about reasons which which are behind this gap and that is basically quite quite simple people around people in different regions just have different patterns of communication and that should be accounted for if you want to build inclusive community and another part of the story is that many things which we take for granted like education or even stable internet connection cannot be taken for granted in too many parts of the world so but the good news is that actually those problem which cannot be solved alone can be perfectly can be solved if we join efforts and well we are doing it we are solving them we still have a lot to do there are research tasks there are validation tasks because we can build power system model for every country of the world but it would be nice to understand how close are we to reality what are errors what are modeling errors for each of the components for power grids model for installed capacity how far we are from reality in demand profiles and that validation task it is huge if you're interested to join please feel absolutely free we would be happy to accommodate you and another big task is to increase usability in particular condo environment and version conflicts inside all our pytonic soul that is still a big questions and we have we would be very happy to improve it somehow and another part relates to capacity building relates to improvement of documentation and to spreading the world spreading knowledge so again we are very happy to accommodate any suggestions and we are inviting contributions if you are interested please do not hesitate to ping us using any of our channel our communication channel so just a reminder that energy transition is a global thing and can be tackled effectively only together thank you very much and I am very happy to take your questions what's the role of Earth observation for this models do you use satellite data to track transit lines or look you look for wind turbines or solar cells or is in this data set you use is just you use official data sets for your modeling thank you we do not use satellite observations directly but we are using for power grid we are using open-street map data only while it would be great to supplement them by satellite images we had some stream which have been focused on addition and on adding actually satellite data to open-street map but the this team currently is not very active so that is perfectly that would be perfect that would work definitely but we just don't have capacity to do that right now also we would be happy to revive it and as for installed capacities we are using fusion we are using merging of a number open data sets on power plants I am not sure if satellite observations have been used of in any of this data sets but at least we don't make yet satellite processing ourselves and we do not use them directly also I agree with you that it would be very interesting idea and it would be also perfect academic topic there are some countries which really don't want to collaborate I don't know quite like North Korea or other other countries where we don't get any data well to answer that directly we have data for Northern Korea but we would be very careful about using them because when you are applying if you are modeling especially well specific countries I would be very much concerned about safety people who are affiliated with those countries and that also goes for China for example because for China there are some local regulations which basically forbid going into too much details of power system for people who are not approved by national government so I would be very careful about delicate areas of the world but technically yes it is possible so my feeling is that correct approach would be to try build collaboration in more or less safe way providing tools to people who are safe using these tools for example if there is there is some group in China who is approved as approved by national authorities as experts in power systems as people whom whom they trust then we may provide tool and support them in using it in the right way also I agree that it is complex question and it may go a little bit complicated first let me remark agor energy vendors a very good name in Germany so Congress and getting them using that and my question now are you also doing storage like water reservoirs or millions of batteries distributed well I agree that storage is one of the key question when we are speaking about energy transition and we include a number of different storage technologies and currently that is one of the key points of energy transition we are able to capture them and if you're interested please feel free to investigate the details we would be very happy very happy to obtain your feedback and suggestions and contributions if you see that something can be improved actually we have a huge poor request which should provide interface to a big list of different storage technologies and it would be perfect if you could revive it I was just interested I've got a friend who I'm doing who's a research who's doing geothermal in Nigeria and you have to yeah well he's doing it on me do you have geothermal resources in there as well okay all right good thank you yes we have geothermal and we have quite recent request from from Kenya where people are interested to include geothermal in a more sophisticated way thank you
Carbon measurement and energy attribution for processes and hardware devices in the Linux kernel
All right, everyone. I hope the mic is working. It's great to be here. This is my first fast time, by the way. And I'm very happy to talk to you all about current measurement and energy attribution for processes and hardware devices in Linux. My name's Aditya, but you can call me Adi. That's the first three letters. I'm a grad student. And yeah, that's my contact. I'm always very happy to talk to people during, before, and after my talk. Please reach out. Please. I would love to hear from you. So a bit of background. I'm a graduate student at ETHEHA Zirik in Switzerland. And I do research at the intersection of computer architecture and operating systems. I love this stuff very much. Great. What do we want to talk about? Let's get a bit of a brief background to bring everyone on the same page. Now, when we talk about energy sources and computing systems, you can have a bunch of options. You can have direct inputs from DC, from USB. You can have battery power systems. And if you're really exotic, you can even have energy harvesting devices. Okay? Now, we want to use the maximum. No, I'm sorry. We want to use the minimum amount of energy to perform our task. Why do we want to use the minimum amount of energy? Because energy consumption correlates with battery capacity. And battery capacity is a major, a significant design constraint for your consumers. Okay? All of us have cell phones. We have the recent buzz around Apple Vision Pro, AR devices. These devices are significantly restricted by the amount of battery capacity. So we want to minimize the energy that we use to get the job done. Okay? Now, what is the problem here? What do we want to solve? Let's flesh it out. Energy consumption is defined as power times latency. Power is determined by your hardware. Okay? Latency is determined by your software. Okay? Now, how do we measure this? How do we get this data? Programmers often measure latency using well-established tools. I'm guessing many of you would be familiar with Linux Perf, or you would have timed your own software using wall clock time using CPU clock cycles, right? Now, these are well-established metrics and well-established tools to quantify your latency. What if I ask you, do you know of any tools to calculate your application's energy? What comes to your mind when I pose this question to you? How would you calculate your application's energy consumption? You would say, okay, Ali, I know. This is very simple, right? Energy is power times latency, right? We just talked about this. I'll get the power from the CPU. My CPU has this magical interface called Rappel, which stands for running average power limit. I'm going to get the value and they, oh, voila, my CPU says 15 watts right now. Great. Then I'll time my application and my application turns out to be, let's say, five milliseconds. Okay? And we put these values into this formula and great, we have 75 mJ of energy consumption. Stop done. Let's go home. Unfortunately, this is too simplistic. Let's try to dive into what we missed here. This does not reflect the ground reality, okay? And now I'm going to deconstruct what happened here and what we missed. The first step, we saw the power was 15 watts. And unfortunately, this model assumes a linear power draw over time. That is not the case. If you actually look at the system, this is what it looks like. You have these values and you have these peaks. And if you measure your power at the wrong time, you will end up with a significantly different number than what you should have. On the x-axis, you have time value. On the y-axis, you have the power for CPU. And power consumption is not linear. So this assumption that we have a linear power draw is incorrect. Second, we got the power value from Rappel. Remember running average power limit? It turns out that Rappel is only available on Intel or sometimes on AMD. ARM, for example, has a very, very different interface to report power. So I would love to share a story. I was doing energy profiling on a server-class system back in university and I said, oh, I've built this great infrastructure on my Intel platform, right? Let me just use it to run on ARM and see what happens. And the moment I ran it on ARM, Linux Perf said, I'm sorry, I don't recognize the CPU. I can't give you any numbers. And it just crashed, okay? So all of these interfaces are really different and you need a significant amount of engineering to make sense of it across different platforms. Second limitation is that we do not have uniform interfaces or the formats to measure power reliably. All right. Let's try to go deeper. Let's try to get more into the closer to the ground truth. Our model got the power value from the CPU. What about the other devices? I'm right now broadcasting from this device and oh my God. I'm sorry for this. I hope not. Give me a sec. Beautiful. Beautiful. Okay, okay. So back to the presentation. We were talking about the impact of devices like the screen, the memory, the network cards, right? We don't know how to quantify them. So we did a lot of experiments and it turns out that these devices very often dominate your power consumption and our findings are also correlated from some similar observations at Google. So what Google did was they were trying to optimize their data centers and did a huge amount of profiling on their server class CPUs. Server class CPUs are the heaviest CPUs that you can get in the market. And it turns out that they observed that DRAM is dominating their power because DRAM is burning power all the time. CPU turns on and off, but the DRAM you cannot turn off. Remember, it is volatile. So you need to break out of this mindset that CPU is the end all be all. Okay. So let me try to summarize everything. We are inaccurately calculating only a fraction of the system's actual energy consumption. Okay. And I would love to put this in a take of a code for you. This is not from me, but I like this very much. We cannot improve what we cannot measure. So we first need to understand how to measure energy correctly. And that's what my project is all about. That's what I love to do. What is the goal of what I'm trying to do here? My goal is to develop a framework to accurately and reliably measure the energy consumption of processes in the kernel. All right. All of us can get this data. What is the use for this? Because data without use is it does not get used. Okay. Once we have this data, we want to report it to the end users in an easy to understand format. Right. End users should be able to make sense of the number. Right. What does this number mean for me? We wanted to report it to the programmers which improve their action ability, which enable them to move their code up and down to change their code to move the numbers. Right. And we want to report it to system designers to enable them to iterate much faster over low energy designs over low carbon designs. Okay. So let's try to dive deeper. What do we mean by a framework? What are we trying to do? Let's flesh it out. A framework comprises models and tools. Let's break down these two words. A power model is how we think about a device. When I say that I want to measure power, a power model is the mental model that I have that I will use to get the value. Okay. And it turns out that these power models are often very poorly understood for a number of devices. For example, DRAM power models are often not available to the public. They're not available to academia. They're, let's say, a proprietary trade secret. Don't quote me on that. And okay, once we have these power models, we can build tools which accurately calculate power based on these models. A tool that I would like to mention would be the NVIDIA SMI utility. It allows you to calculate the power of a GPU using this tool. It's a good tool. And okay, so let's pull it all in. What I would like you to take away is that we need accurate models, first and foremost accurate models, and second, reliable tools to calculate the energy consumption correctly. So we defined our problem and we defined our goal post where we want to go. And now let's see how are we going to get from point A to point B. Great. So before I dive into what is the mechanism, I would like to bring to knowledge what has been done before. All of us have been here for the entire day, right? We love energy and we love efficiency. If this is such an important problem, why didn't people solve it before? People did. People did try to solve it before and I'm going to describe to you right now what they did before and why that is insufficient, why we need to do better. Okay, on the screen you can see a screenshot from a tool from Intel that is known as PowerTop. And you can see the first column here which reports power estimate. And on the right side you have the description of the particular device, interrupt, process for which this power estimate is calculated. Now, what are the challenges? Well, first of all, I believe in energy. It turns out that power is a discrete time event. What do we mean by discrete time event? Let's try to break this down. If you have a graph, a power is a single point on that graph. Energy is the area under that graph, okay? We want to calculate energy because energy is what correlates to your battery drain. Your battery supplies you energy. Power is just one particular instance in that time. Second, PowerTop has a vendor-specific implementation. I hope that is clear. Third, what is the actionability? So I just showed you this data. I just showed you the screenshot. It says, oh, my display backlight is taking 350 milliwatts. Great. This particular process is consuming 292 milliwatts. Okay, fine. The question that comes to mind is what is the use for me? What is the actionability for the programmer for this data? How does the programmer change the code to move this number? And I don't know. How do I fix that? How do I fix something that I don't know how to fix? And that is a gap that I would like to bridge, right? So let me dive into the guts of the system. This is a system design. On the screen you can see an elementary flow chart which summarizes the system at a very high level. And this is a regression-based system. A regression-based system has two inputs. You have the parameters and you have the inputs to the parameters. First, we calculate the parameters and then we calculate the inputs to the parameters. Great, we have time. I will go into details now. Please bear with me. Okay, let's first look into the parameters. How do we determine the regression model's parameters? There's an algorithm for this. First of all, we turn off everything. We turn off everything that we can turn off in the system. We measure the baseline draw. This is what we refer to as the minimizing the system load. Then we pick each device one by one. We isolate the impact of the device on the baseline load. And we measure the drain over multiple times. So we turn on a single time. Let's say that I turn off everything and then I turn on just the screen. Okay? And I measure the difference between these two values. The difference is the impact of the screen on my baseline. And then I also do one thing. I sweep the screen. So I change the brightness of the screen from minimum to the maximum because obviously the minimum brightness is going to have different power than the maximum brightness, right? I hope this makes sense to me. Are you guys still with me? Okay, so this was just an example. But what we're trying to do is we're trying to quantify the impact of each device on the baseline. Now, I would love to give a metaphor to help explain this better. Imagine that you have a water tank. And in this water tank there's one single input and there are 10,000 tiny outputs. And the problem that you're trying to solve is what is the rate for each of the output pipes? You cannot measure it directly. So you have these 10,000 outputs which go on their own anytime. They can go off and you don't have levers to control them. What you're trying to figure out is what is the drain rate for each of the output pipes? That is essentially the problem that you're trying to solve. So what you do is you turn off all the outputs, okay? You turn off all the outputs and you turn on one single output and then you see the difference in the tank level before and after turning it off, okay? And that is essentially what we call as an isolation or well in academic terms it's also sometimes known as an ablation study. But we try to isolate the device and measure the impact. Next, we repeat this process for all the pipes in the system and we try to get a reasonable estimate of what is the impact of each pipe. Great. So that was the first step, the device-specific measurements. The second step would be the kernel process accounting step. This would be the inputs to the regression model. So we have the parameters that we got from this step and now we need the inputs. Now how do we determine these inputs? Sorry, did I hear a question? Okay, great. Right, how do we determine the inputs? What we do is we isolate the impact of each process. So we identify how much time the process used for the CPU. We identify how much was the network activity, the screen wakeups, file handles, memory usage and we put all of these numbers together into the model. And this is what gives us a predicted energy consumption value for that process. Okay, so what are the challenges? This seems very simple. This seems, okay, you've done this work but what did you not tell us? Here comes the part that I did not tell you. First part, estimated value. This is not the reality. It is really hard to find out the reality. And there's a very famous line in machine learning community. It's known as all models are wrong but some are useful. So my goal here is to build a useful model that I hope is less wrong. I would love to make it perfect but unfortunately we cannot make it perfect. But yes, I would love to make a useful model first. Second, there's a bit of a cash 22 situation here if you observe that. What is the cash 22? I am running a measurement process. There's a process that is doing measurement on my system. Okay, that is also going to create a load. So there's going to be a skew in the values that I get because of my measurement. Okay, and the more accurate I want it to be, the more skew it is going to create. So we want to understand what is the right amount of accuracy that we can use to also be useful while also minimizing the bias. So this is very challenging, right? Because this is different for every system. And that's a problem that I'm almost struggling to solve. I would love to get your inputs if you have. Great, next challenge. There are millions of devices out there and these millions of devices have billions of ICs inside them. Very often we don't even have the data sheets for these ICs to correlate the values that we see. The estimates that we get can range across two to three orders of magnitude. One device can say, oh, I use one microjoule and the second one can say, oh, I use 10 milliwatts and those numbers don't make sense. Those numbers really blow you away. So how do we maintain our sanity in the face of the variance that we see here? And one more challenge would be that, assume that you can say, oh, let the users supply this data, let me get the data and then build a centralized farm of this data and then try to make sense of it. Should the users share this data? Would the users share their device users' data to you and allow you to put it on a centralized server? Who will own that data? Because there's enormous value in it. So this is, I would love to get your inputs on. One more challenge here would be the validation. So we got a value that we estimated. How do we make sure this value is as close as possible to the ground truth? In an ideal world, I would have infinite money and I would go to every computer in this world and take a probe and put it next to their CPU and say, oh, this says 17.5 watts and my tool says 17.5 watts. Great job. Let's go. I cannot do that because I don't have that much time. Okay, so we want to minimize the difference from the ground truth and what we see in the tool. There's a significant challenge in making sure that what we see is what is the reality. Right? Remember, there's accuracy and there's precision and there's correctness. And all of these trifecta come together and make this a very difficult tool to get right. But still, I believe it's going to be great. I'm very happy to work on it. Great. So once we have the energy consumption, how do we link it to the carbon emissions? We just saw that we can calculate energy consumption using power time flatancy. The carbon footprint can be calculated by multiplying this number by the composition of the energy. Where did the energy come from that you used to power the device that you were running? And this composition depends on multiple factors. It can include the geography. It can include the time of availability. It can include the cost of generation of that energy. Right? So fortunately, there are good tools and libraries out there which can simplify this problem for you. So energy composition is, let's say, something that I believe people will solve faster than I can solve this one. That is why I would love to focus on this one. Great. All done. Let's get back to the good stuff. How is this going to look like? How is this going to make your life better? If you're an end user, I would love to ship to you an application like this, an application which tells you how much energy your inkscape usage consumed, how much energy your screen was dissipating. So as an end user, you can remember to turn off inkscape when you're not using it. Or you can figure out, oh, I need to deliver a presentation to so many people in five minutes. I'd better save my battery. Otherwise, I'll be in deep trouble. So it's for those use cases when you want to maximize your battery life as an end user. As a programmer, if you want to expose an API that enables programmers to take action, if you want to indicate the devices and the code regions which consume the maximum amount of power in the code and enable the programmers to change it, to modify it, to fix it. So actionability is the primary concern for programmers. In an ideal world, I would love to have direct suggestions in the IDE that tell the programmer, oh, this code is not, this code is going to burn this much carbon. You'd better change it. And for the system designers, what we want to do is we want to enable them to iterate our designs faster. We want them to enable this, we want them, we want to enable system designers to discover designs which are really low on energy, which are really high on performance, which are really high on carbon efficiency. So there's typically a design space that designers explore. And we want to enable them to explore the design space faster. That would be the end goal from this tool. Great. So what is the takeaway from this talk? If there's two things that I would love for you to take away, that if you forget everything else, okay, just remember these two things and I'd be very happy. First, we cannot improve what we cannot measure. We must measure correctly, okay, to improve things. And second, we need to break out of the CPU mindset, okay, non-CPU system components can dominate your power. Please remember that. Please remember these two things. And the next time I see you, please come say hi and I'll, I'll buy you lunch. Okay. Great. Thank you very much for listening to me. It's great to be here. It's great to talk to you. Please be in touch. Please reach out and oh, boy, we're out of time, but I'm very, very happy to get your questions. Come talk to me. There's still like two minutes for questions. So if there are any questions, please. Go for it. There's one in the back. Yes. So. So hello and thank you for this presentation. I hope you're not going to hate me for this question because I'm a primary infrastructure guy. And one thing I was always concerned about is redundancy, like a scale twice. So if one dies, is this part of your thinking and scope? Or does the question make sense? I'm really sorry. I don't fully understand what you mean by redundancy. I mean, sure, I understand, but redundancy is trying to solve the problem of fault tolerance. Okay. It's not trying to solve the problem of efficiency. I'm trying to solve the problem of efficiency. So redundancy is an orthogonal concern to mine. Does that, does that mean it makes sense? Yeah. Thank you. But thank you for the question. I really appreciate questions. Yes. Did you try to monitor the hover head of monitoring the energy consumption? Yes, that's a great question. No, we did not. On one side, I'm afraid it's going to be huge. On one side, I don't know. It's like an infinite recursion, you know, like how can I measure the impact of my tool itself? Like the tool is what measures the impact. But how do I measure the impact of the tool? I don't know. I hope, I hope that, I would love to believe that. That's what I, yes, that's what I want to believe. Yes, please. Thank you very much. It was great to be here. Thank you.
Advanced Linux Power Management Evaluation using Perf
So, hello. Let's start. I think 50 minutes and all, so I will hurry up here. And the previous slides and presentations, we saw the overall picture somehow, the crit thing. And the last slide, we dig into one system somehow. And in this slide, we also want to dig a little bit more in the details, how to analyze the power consumption. What we saw also in the previous presentation, there was an... Sure, sure, sorry. What we saw in the last presentation was that we saw the power consumption of one system, a little bit similar to power supply that we saw this task consumes and what's whatever. But the question what you often have is after this data, how you can optimize your load for your server, for your embedded product somehow. What are the causes why the application runs too often, the system runs too often, cannot go into deep states, peace states and such things, right? At the end, it's the hardware that consumes the power and you can save power if you put things into deep sleep states or consumes the frequency somehow. And this is really important to save energy. And what we did in the past was writing scripts to optimize your workload and get things what are the causes that an application runs too often, right? It runs often, cannot go into deep sleep states. This is important to do the power optimization. And what I provide in the next couple of slides here is an application that helps you to optimize your workload and makes this visible somehow. So what we are talking about is a perf script and extension to perf. So it's, if it is mainline, it's not yet mainline. I will send this script to Analo in the mailing list and hopefully it gets merged quickly somehow. But when it's merged then it's really usable, easy usable. It's just an up-get install and everything works out. And also for Yachter and Buildroot, it's really easy to use these things after. It's also important that it can be used in embedded systems and everywhere. How does it work? It's just a record call where you record your workload with the workload separator, like every time. And here I record for 60 for one minute, workload on all the CPUs. And then you record everything one minute fine. And then you start with the report, the power analyzer, and it's have different modes. Because I have just 10 minutes, I show one mode here, but there are different modes for different optimization, analysis, and things. So what are the modes? There are several modes that can be activated and used. And you just activate or use the mode you want to focus and dig into the details, right? This is how things work. And what's also important, every mode has different trace points in the kernel. So usually you record only the trace points you require for the particular analyzer. Because if you record everything, every trace point, you get a lot of huge data and things. So normally you limit the data. How does it work? So there's the per script. As always, you can write a recorded data, as we saw for one minute here. And it just records all the trace points that are required. But on the other hand, you can also record the data that are required for your analysis. This is documented, what the trace points are required. Then you have the data. And then you start the script, the reports, and here outputs all the analyzers for this. You start it here with a timer. So what are the timer events somehow. And then because there's a lot of data coming out of this, you usually can use this data already and see, here's something that's not working well, too much timer interaction, for example. But what is it also can do is some post-processing and to create data graphs somehow or filter things afterwards because it's a lot of data. And here, just a showcase, this is one image that's created. You see the time and you see a workload. It's a logarithmic scale here. How much time, timers are working. Timers are one course that triggers from a deep C state to an active state to a zero state somehow. Timers are not that good. Often you see, if you begin analyzing things on your desktop CZ, you see here, I think this is the kitty, my terminal I use here. It has just wake-ups all the time. Why are there wake-ups here? And then you see often some buggy applications, clip-out things. They are constantly triggering your system and this prevents to going into a deep C state. This is the causes that prevents this. So it's really important. And here you see a workload I started and you see all the timers that are correlated with starting a workload. Here you see a lot of kernel timers and then you can start optimizing things. This is just a focus for the timer events, but there are a lot of other events as well. This is other sub-sequence analyzes also just for the timer events. You see here for a tick-less system, normally if there is no load, kernel can really go in a deep sleep state. And then it shuts down the timer tick altogether. But does it really stop the timer tick? You will see it here in these images and you can analyze things and optimize things. What are the kernel timers that trigger your systems? If you look at the graphs a little bit, the resolution is not that good, but you see that there are timer ticks all the time, and the network interrupts, timers are working here and you can optimize this. If you see this and you know what's happened, what we see here in this graph is the timers that are working for each particular task. So you can optimize for your task as well. How many timers are there? I often see in the production environment that the timer has done all the time somehow and not correlated. What you can also do there are system calls for the granularity that the timer can optimize things. For example, the kernel which the introduction of the HR timers, the resolution timers, you can align timers so that timers are not really spaced there and they're exactly triggered at a particular moment in time, which is a simple system knob. You can also say, oh no, it's not so important that the timer is triggered at this time so that the kernel aligns timers at a particular time and allows a deeper sleep state again, something. This knowledge can be combined with the knowledge of this, what you see here, for example. Where are the timers? Right, CPU 0 is somehow special. There are the timers. Can you move, for example, tasks to CPU 1 so that this other CPU cores can go in a deeper sleep state, for example, right? All this is important to do an optimization there. There are some general options. Some others are not always required. This can be turned on with this particular flag. There's CPU, often you want analysis on a particular CPU so you can limit the data. And there's a file out option so if you want to do a post-processing, as we saw in the images, somehow the data is not put it on the standard out, so it's put it on the file and you can use this there. And the data is also written in a day and sanitized that you can trust through use partners here to read the CVS data. And for the post-processing, it's really easy. But there are multiple modules there provided. This is just a sneak peek on the timer module, but there are a lot of other modules as well. You can use them later on. But to the time limit, I just highlighted this timer module. But one last sneak peek here, for example, is the governor. The governor is the component within the kernel to do the processing and commanding of disease deep-stakes. This is the governor. You can select a different governor. It's normally the menu governor. There are other governors as well. And here what you see, how often is which C state is commanded here? And what is also analyzed is, was this good or not? Because the kernel doing a guess working, right? So here the things are the next time in 10 milliseconds, there's a workload because the timer will trigger. So it puts a processor in a particular C state. But was this the right decision or sleep is too narrow, too shallow? And so this is also important somehow. And here you can debug the governor. A student of mine also discovered a bug for the AMD stuff. It's for one particular C1 state. It's switched all the time to the wrong state. But I think this will be released in the next couple of weeks somehow. So it's really also important for you. If you see, does the governor does the right job here? This is visible with another analysis, but there are multiple other post processing steps. And yeah, that's all. I hope this will be integrated in the mainline next couple of weeks. But if you want, you can use this kernel tree and this particular branch to use this. It's just a perf script, really easy also to use out of the tree. And this post processing scripts cannot be shipped with the kernel. That's not how the kernel somehow works. This Python scripts and there will be always available here based on this. And at the end, good documented, hopefully somehow. So yeah, that's all questions. Yeah, perfect. Questions. I'm always getting a question. Process of coverage, just x86. What's the coverage of you got? I mean, now look, I've got an M1 Apple thing. Would I be able to run it off there if I run Linux on that hardware? Yeah, this script will work on ARM x86 for Intel and AMD. There are differences in the P state tracking because P state tracking is the introduction of Skylake and HWP with hardware tracing. So it's will be not visible, but it will be visible on ARM CPUs. For example, some as a sample work, some will not work, but it's just Linux and all the major. And some are more software, the analysts of scheduling events. Somehow it will always run, but more hardware like analysis will not work somehow. But yeah. Just a follow up for previous question. Will it work for like, Graviton, all this kind of cloud proprietary processors? Yeah. It would generally run there. If it at least Linux ARM and the processor and it will just be the same. So no difference there. Yeah. So I think it's going to be a good idea to run it on the same hardware. Yeah. And there. If it at least Linux ARM and the processor and it will just be the same. So no difference there. Another question. If not later on we can install a script at your PC and test it. Hi Aaron. That's just a follow up on the previous question. There's actually an extra library, LibOpen CSD, which gives you a whole lot of extra stuff on most ARM cores, but not necessarily apples and Amazon's ARM cores, but any that actually come from standard designs. So a Turing design here. One goal was that it runs everywhere somehow, right? It must be general. And I don't, we don't skip going into the EPPF world. So there are advantages to do things in the kernel to aggregation in the kernel. So but this has sometimes problems with on specific ARM and PSOX and embedded products. So the design was really that runs everywhere. It's easy to use and generally available somehow. Somehow EPPF working with EPPF things to in the kernel and process unwanted data there out has some advantages, right? But you need a tool train then on an embedded product. So it's not that great somehow. And this everything I told you was somehow the idea on the design somehow or extra library. Keep it a bit of minimal stuff which works everywhere somehow. If you want to do more and often you want to do more if you analyze your particular task, how is the scheduling behavior you need more and you need more custom scripting as well somehow. But this is not here. I think it's a lot of data already there. Easy available somehow. But if you want to do more, you need more scripting and things like that and libraries you want to use. Sure. It's a compromise. Maybe a question for me. Can you give us a few insights about the community? How many developers, how many people contribute? Currently I'm the main developer. But at the end it's just in the Python script so it's not really the rocket science. And there are students also working on this, help things and looking at the details. But yeah, it's not that magic somehow. It's just keeping things putting together and make them easy usable. The trace points and Steven Roslatt and all the things, the infrastructure that the kernel provides are the main drivers. That this is possible, right? So just in script. Thank you so much.
How can Open-Source help the Wind Power industry?
Hi everyone. Everybody can hear me well from the back. Okay. Good. So, to introduce, I work for ZF Windpower is not any company doing software. What we are doing are turbines. Okay. We produce pieces that go to produce wind power actually. But I'm part of the digital team. Why do we have digitalization? We'll talk about later on. Okay. Like that. But to start, let's tell a bit about my story with wind. I didn't start just working with turbines. My love for wind dates back much earlier when I used to live in the beautiful Marseille. Marseille, nice town for who has been. Less than six hours from Brussels and it has a great resource, wind. At the time, I used to do sailing. You come to the Vierpart, city center of Marseille, have your pasties and you will see beautiful boats. You go out, the sea is nice. What has this to do with the energy transition? Well, historically, Marseille, the main industry was fossil fuels, oil. Okay. We have a place called Fossilmer. Great place to sail. It has great wind and one of the best winds in Europe. Just in front now of the oil factories, what do we have? The latest technology of turbines, floating turbines, means devices that you can put on the sea. Very deep, very recent technology. Those are, if not the most modern, among the most modern of France. The power is 8.4 megabatts and now they are three. They can already power a small city like Martig. They're just close to it for who is familiar with the area. So from the love for wind, from sailing, now I can see that this gets to the energy transition. Right? Is it only Marseille? No. What happens in Marseille does not stay in Marseille. Valid also for energy. These are graphs that I created myself from data from Kaggle. So open source data. What do you see there? Well, you can see that around the world we have some big production of renewables in general. And I guess there will be somebody here who has been or is from South America. South America is a place where you have strong input from hydropower. But at least to stay here in Europe, wind here is a big deal. And it's increasing. If we look at countries like Denmark, we are already almost at half of the energy production. National energy production of wind. It's a combination of good wind, because up north is really good, and of politics of wheeling it. I told about Marseille, there is good wind, but France is not even close to the measure. That means wind is definitely one energy resource that we will use more and more. Very important. And produces at a big scale. We have already now in Marseille 25 megawatts. It's huge, only with three turbines. All good? All great? Well, we have some problems in general with big installations. Things can go very wrong. Very wrong. And this is not just a matter of changing a small component. Let's say that a turbine, like even in land, not as big as the one on Marseille, gets faulty. It has to be stopped. What happens? Notification processing. It has to go, and we need to tell someone. Two days, then there will be an inspection time. I get the team to inspect the fault. Two weeks. And if I have locally the component, the replacement, will be six weeks to replace it, but maybe much longer. Component may be on the other side of the world, even. We don't know. And then the repairs, a couple more weeks. So for a turbine, which is 3.5 megawatts, not the latest technology like I show in Marseille, the whole intervention, if you are lucky, let's say 10 weeks, is lots of money, 125k, at least. A lot. How do we tackle this problem? By forecasting and by optimization of spare parts, so getting the spare parts already in-house, and to get it start as quick as possible, faster return to operation. And this is done by treating data. What do we do? Ideally, we monitor and predict. So there is an alert. It has to go to the cloud. I have to classify the failure already in the cloud. I had to know what is about. I have to prescribe a solution. I have to find the spare parts. And I have to forecast when I need to apply my solution. What, where, when. That means alert, data are collected and analyzed, and compared to historical data. The graph that you see there is like a exponential curve. I don't get into the Bible modeling. Here is not the moment of mathematics, but is a model that predicts well when cumulative failures will occur, subjected to a certain type of failure. So that's the type. I don't, as I said, don't get to technical. This is not the moment, but we can discuss it later on. How do we do that? Well, wind turbine data come, production data come. So this is from us, from people who produce. We get it to the cloud and we do prescriptive maintenance. What means prescriptive? Well, let's see quickly. Reactive. I fix the failure. Okay. I have a puncture on my bike. I change. Preventive. I do it regularly, like changing the oil in the car. That's what is typically done also for bigger installation. Predictive. That's what I talked about just now. Prescript is AI. Okay. Familiar with that. So AI tells you this is going to happen. Please do this. Data analysis with open source software allows more and more sophisticated maintenance. We have just been talking so far about AI, power by Python and so on. We have seen very good demonstrations earlier on. And what is our digitalization tech stack? So to get more specific, I already introduced Python. Then pandas to treat data frames, to treat data, at least to a small scale. Lifelines is a package that implements ViBool and is open source. Myself, I had already my own version where I added some modules. Docker, Git, of course, we work as a team. And it goes to Azure DevOps. Notice, I told about the cloud. I told about DevOps. DevOps is not just a technology, it's a way of working. That's very important. The technology allows to work us in an agile manner. And that's how we get results. So what we want as a result, reduced on time. We want to shorten. Instead of six weeks, let's get less or 10 weeks. Reduce the cost. So we need to have a proper stock level. And for that, we need predictions. Reduce template maintenance. We don't want the scenario where we have to call the technician. And the technician has to come from the other side of the world. And to avoid, of course, the consequential damage by addressing recurring failures. Okay, if I see that a certain bearing is always failing, let's address that. So, why? How do I know that I'm not just talking hot air? Okay. This all sounds great. Okay, you use AI, you use open source. Does it work? Well, what did we do? Then as producers, as manufacturers, pardon. We went to one of the customers and we proposed a pilot project. We apply our techniques. Let's see. The results. Okay. How did it go? Well, 50% less of alert processing. All these alerts. So, unplanned field inspections, 60% less. We strongly reduced the lead time to repair because we could forecast and we had the right parts in stock. And in general, the annual energy production got up to 0.5% of all the park with also with all the turbines. This means lots of money. Of course, for corporate reasons, I cannot tell too much, but you can figure out. To conclude, okay? Without, then we can go more technical if you like, guys. But take away messages are fragmented value chain affects very badly the wind energy efficiency. Don't have value chain, which is all dispersed. Okay. In general. But data insights and very good communication from the data. Has great benefits. We reduced the alert processing effort. We have prescriptive maintenance, which allows us to decrease the time to repair and we increase the overall efficiency. So, the annual energy production is one of the main KPIs for the public. So, we have a lot of data. So, the annual energy production is one of the main KPIs for wind power, but in general for any energy source. All these could be achieved with open source software. Okay. I sort of saw the full stack. Finally, it was the devos practice, not only the software that allows us the success of the pilot project. And I guess here we have a lot of people who are familiar with devos. Now, I guess I still have a couple of minutes for questions. Anyone? See. Hi there. I used to work for Siemens wind power and they had a predictive maintenance team. I'm just wondering, have you found any other companies as you've put open source kind of using your tools? Well, I'm not dealing directly with the customers. In general, with the customers, we just propose our solutions and we exchange the data. For example, if Siemens Gamesa has failures, we communicate the failures to the dev and we can suggest a stock amount for a certain component, for a certain turbines. But it's not that we are like the software company going to sell that. You see, it's more like the normal customer relationship when you sell parts. But how can we make, let's say, predictions? How we can interact? How we can serve better the customers even as a company? Good analysis of data and for that we use open source.
Energy optimisation: smart home meets smart district
Good afternoon. My name is Rik Barillot. I've been a core member of Open Remote for a bit more than… louder? Okay, sure. A core member of Open Remote for a bit more than 12 years now. I'm not the person who was supposed to give this talk, so I'll do my best to work it through. Don't hesitate to come back afterwards, and I can point you to some of my colleagues that worked on those projects. A bit away? Yeah. Okay. Okay, I'll do my best. And speak louder? Okay. Okay, so Open Remote, it's a 100% open source IoT platform, so it would do whatever you expect from an IoT platform. Back to the devices, have some logic, and user interfaces. We'll come back to that a bit later. So open source, fully free, available on GitHub, and a community throughout the world that's pretty active. But also some projects that we work on with some companies. That's mainly what the core team does when I said professional. It's working on those projects. Also they have projects that are in home security or smart cities, typical IoT projects in more exotic things like smart clothing, architecture, and of course a lot of projects in the energy domain, energy management, but also some link to other aspects of energy. And we'll go into a bit more detail in the Nottingham city project a bit later. So looking at Open Remote, what is it? It's mainly a middleware developed in Java. It has a database that is both for the configuration of the system and for the state of the system. So the current values of your sensors, but also all the historical data. It has quite a few connections using standard protocols, so you can connect to gateways or to data feed. We'll see that later. Awesome property hardware. It has a set of user interfaces. You have standard more management user interfaces where you can configure the system or see the values or trigger some actuators. You get Insight, which is a dashboarding kind of application. But we also have a set of web components, freely available that you can use to build your own custom application for a given project. And so you have an application that you can access through a browser, or you can embed it into a mobile app, what we call the consoles. And you can also connect to other systems like Grafana, Power BI, if you want to have extra features. Then you have, of course, a mechanism for the logic. We support different type of rules engines, simple through the UIs like IFTTF. So if then that or more advanced features like Groovy scripting. So if you want to go really deep. There is a set of default services, so building blocks that you can use, for instance, to push notification to the mobile phones or to place devices on a map or to implement optimization services, what we'll talk about in a minute. And this is, of course, built with security in mind. So there is a strong identification, authentication and authorization layer in the system. So coming to energy optimization. We'll talk about two things. As we say, what we call smart home, but it can very well be a smart office or even an office complex. Basically, it's the concept of an island behind a meter. And you have kind of a sole proprietor of the island. And then when you move to the smart district, it's a composition of many islands behind one transformer. The problems are a bit different, but the system is the same. So if you look at the system, yeah, whatever, I'll do this. You have your renewable energy, so solar and wind. You have the grid, both import and export. You have a battery with charge discharge, and you have your load, your consumers, but can also sometimes feed in energy back into the system. Some electric vehicles can do that. So the goal for the smart home is to optimize either based on the cost, so you want to pay the least amount, or on the environmental footprint, so you want to be green as much as possible. The data that we have to do that is for the renewable energy, we are going to estimate the consumption based on the peak characteristics of the installation, so how much your solar power can produce, solar panel can produce, and on weather data, so we can take the estimate of that. For the grid, we have dynamic tariffs, so people can, for instance, have contracts where they pay a different tariff by the hour or by the quarter even, and so we have the data to know those costs, but there is also a carbon cost associated with the type of energy that is produced. The battery, it's a charge discharge, but there is also a cost, so a levelized cost of storage, so for instance, if your battery costs 1,000 euro, and it can do 1,000 charge discharge, every charge is charged cycle is 1 euro, so you need to take that into account when optimizing, and so for the loads, we have the path consumption, and we do a weighted exponential average to predict the future consumption on that. So now what we are trying to optimize, as I said, is minimizing the cost of the carbon exhaust based on all this data. And so the system will control what we call the flexible load, so depending on this data, it can decide when to charge or discharge the battery, it can decide when to charge or discharge potentially the electric vehicles, or it can decide to control heavy loads, like heat pumps where you have a bit of freedom and when you can power them up or the temperature set point, things like that. And this can be automatic of course, but it could also be simply manual by pushing information to end user through the UI. When you move to the smart district or the collection of island behind the transformer, you have a slightly different problem, which is the transformer that is between your district and the grid, which has a peak capacity, and so what you want to make sure is that you stay under the capacity of the transformer, both for import and for export. So when there is a real high production of renewable, you don't want to surcharge the grid. So the data that we have is basically the same for the battery, for the renewable and for the loads. In addition, we have real time peak power, not peak net power of the transformer, so we know how much the transformer is currently taking in and out. And we also can then adjust the optimization algorithm with a fake kind of tariff. So if we know that we need to change the consumption on the transformer, we can like fake how much the electricity would cost so that the optimization algorithm would steer one way or another. And so we keep doing the optimization at each individual island, but we want to push for the global optimization so that the grid stays or the transformer, the grid stays under control. And so one additional problem comes now with the fact that you have many households, for instance, in a district, which can have their own technology. So it's quite complex to control them, to automate them at all. So one way, and we're exploring that, is interfacing with more home automation systems, like Open Hub or Home Assistant, for instance. Another way is to manually impact. And so what we can do is send personal challenges to every household where the people can earn points, which basically earns them money if they play nice within the whole ecosystem. And there is a lot we can do is we also have shared flexible loads. So for instance, in a district, you can have the shared charging station for the electric vehicles. And then we can control and, for instance, diminish the available power so that we can also keep the grid under control. So that is the general idea. That is what we are aiming for. There are several pilot projects that are starting to implement that. So this is the global idea. One of them is the Nottingham City Council. The idea, it's a smart home, but really it's more smart, well, we could say office complex. The idea is to control the charging of all the vehicles, electric vehicles that are used by the City Council at Nottingham. And so what it means is you can control a global static battery plus the charging of all the vehicles to save money. You can also control or you want to have your vehicle charged at least to some level because you want to use it in the end. And you also want to prevent surpassing the limit, the power limit that you have for the whole district. Oh, sorry, Council. And so what you see on the right is the dashboard interface that we have in Open Remote that can show you the different location of the vehicles. So we can track that anonymously, but we can track the different vehicles and the global power that is currently used by charging of this vehicle. If we now move to the smart district, this is a project that is currently starting in Amsterdam, where we have a community of about 500 households that are part of this project. One thing is each household can control their consumption by we interface with the meter and they can see a real time information about the power they're consuming through the mobile app so they can adapt their own consumption. We have the challenges that I talked about so they see how the whole district is doing and their proposed challenges so that they can play nice within the neighborhood and by doing so earn money. And we can also, as we said, limit the if there is really an emergency, we can control the heavy loads that are shared for the district to make sure that we don't go above the limits of the transformer. So it looks a bit like that and these are design of slides so there are some inconsistency in the wording, but globally every participant will see his own consumption with a bit of a history on how the district is doing. And the green dots around the indication are a global indication of how the district is doing. So it's really gamification there. Now you see that at some point the neighborhood might be reaching the limit so we are reaching the limit at the transformer level and so we will propose to the person in each household a challenge saying well for the next hour you need to keep your consumption below this level. If the person accepts then for the duration of the challenge they will see their own consumption, see the limit, how they are doing against it and how many points they will collect. And so they also receive tips, say well potentially if you want to keep your consumption under the limit maybe charge your car a bit later or set the temperature a bit lower, something like that. When the challenge is done they see how many points they have collected and then they of course can see a summary of all the challenges they have completed, how many points they have earned, etc. This is the view from the manager so we can see different meters that are all connected to the system. At this stage as it's pilot project they have 50 meters connected, the project just started, the target is to have 150 by the end of the month February and with 150 this should be enough to already influence the whole behavior of the district. So with 150 connected meters we should be able to have an impact really on how the district and the impact on the transformer. And so this is here the dashboard where you see a summary, the small diagram I showed with the consumption and the load on the transformer, how we are doing compared to the peak performance of the transformer, a historical graph and things like that. So thank you, these were the two projects that are currently running on energy management at this stage, there have been others. You can find the open remote platform in the GitHub repo, there is also the forum where the community is active and other information. Thank you very much.
A journey accross the environmental materiality of digital services
Hi. So in this talk, we'd like to take you on a journey across the environmental materiality of digital services. So the speakers in front of you, here's David. My name is Benoit. We are contributors to an NGO called Boavista that we'll present briefly later. We also are colleagues in a small company called Hublot working on ICT and environmental impacts. Regarding Boavista, so the NGO we work for and this is the work of this NGO we present to you today. This is an NGO based in France that gathers more than 250 members now, private companies, public organizations, universities, researchers, freelancers and so on. And the goal of the organization is to provide public and open methods, data, tools and knowledge about environmental impacts of ICT and its assessment. And of course we try to provide a useful open source, open data and open science stuff. Thank you Benoit. So today's objective will be to see how can we get from digital service to its environmental materiality. Environmental materiality is another way of seeing its environmental impact and it includes not only its carbon emission but also all of the other pollution and its usage of renewable and non-renewable resources. To do this we need to follow a process which is called environmental accounting. And at Boavista we have chosen to do it with an open source approach. What is very difficult when you're doing accounting, environmental accounting in the context of ICT is that you must take into account all the value chain of your digital service including the end user equipment, network, data centers, so all of the infrastructure that your service is using. But you also need to take into account another dimension which is the lifecycle phases. So you don't want to only include the use phase impact of the use phase but also the impact of manufacturing the equipment that your service is running on, transporting those equipment to their place of usage using them and also the end of life of the equipment. Today we won't be able to dig in all of the dimension so you'll see on the slide what we're going to focus but Boavista is working on all of the dimension here. It's still me. So why have we decided to do open source? So we're out for them, I think everyone here is convinced that we should do all of the data and development with an open source process. But when we talk about environmental accounting, it's more specifically important to follow an open approach. First, because we believe it's a democratic necessity. Environmental figures are often used to justify political orientation. For instance, the Green New Deal is full of environmental figures and we believe that citizens should be able to criticize, audit and criticize the figures that are being used to make political orientation. Also, environmental figures and environmental accounting are used to label product and services. I think you might have seen some data centers who said that they are greener than greener. But to say this, you need to rely on figures, environmental figures and often those claims are not based on open approaches and figures, which is for us a problem since consumers cannot audit and criticize the figure. There is also a very more straightforward argument because today environmental accounting in the context of ICT is very immature. So the data that we use, the data that we report are of very bad quality. To illustrate this, we've done some work. We normalize the carbon impact of manufacturing one inch of a lead panel, so lead screen, and this is the impact for manufacturing one inch, the carbon footprint. And you see from the five data sources that you have here, we have a magnitude of 10 between the lowest impact and the highest impact. We could think that HP has a way better environmental friendly process than Dell, but this is not the case. At least we cannot, this is not the justification for this difference. This difference is, there is this difference because all of those providers are not using the same data sources, the same hypothesis, and the same method. And because all of those are not open, we are not able to explain you why there is those difference. So open source should be a way, if all of those figures were based on open source approaches, we could try to normalize those impacts, compare the provider once they get another, and explain why different providers have different impacts. So let's first focus on the energy footprint. So I guess the energy footprint is the part of the ICT footprint we mostly think about when we work in ICT. That's easier to get a grasp on it. But as David said, it's still, when we look at energy in ICT, it's still only one part of the impact. So it's really about the usage phase. It doesn't cover the rest, which is, which can be way, way greater impact than just the usage phase. That's also true for data centers. In what I will present to you today, most of the information are accurate for data centers. Some of them may be applied to end user equipment, but we didn't include specific information on network equipment. So we are going to include specific information on network equipment for technical reasons and also because it's hard to get data on that part. So first, a little bit of context regarding data centers. I don't know if you've seen the latest figures from the EA. EA is International Energy Agency, and it, let's say, it's a rather conservative organism so far regarding ICT and their own impact figures. But their latest figures is quite enlightening because we can see that in 2022 we were around 400 terawatt hour of energy consumed by data centers, which is the double from what they previously said for 2020, which is a bit strange. And also that their projection for 2026 or in two years says that it will double again. So around 800 terawatt hours. Part of it is because of AI, but not only. You guessed it. So this is the context. What we can say here at least is that we are really in hyper growth trend and not the opposites. That's not what we have seen in some medias like data centers and energy consumption is flat. That's not the case. Then what's the issue here actually? What do you want to look at? It's not just about the energy consumption, of course. I think I won't teach anything to anyone in this room when I say that energy consumption means that we at some point consume oil, gas, and coal or other energy sources. This will emit greenhouse gas emissions, of course. But we will also consume water in the process. We will consume water if we take into account the cooling of the data center. And we will consume minerals and metals and other resources. Not all the resources that we can account for are listed on the draw. But there are 16 environmental criteria that we take into account in Boa Vista tools. So what do you have in your position to work on your own energy consumption of your services? So we have talked during the day in this room about paraffin power top. There are other options as well. Of course there are physical measurement devices. So smart PDUs, ID rack or ID low administration cards if you have them on your server. What matters in general? This is one way. The other way is software evaluation. So those are the options that I've listed on the top. All of them are open source solutions. If you are, let's say in a bare metal server context, you might choose power API, paraff, power top, SkaFound. If you are in, let's say more in a development phase of software, you could use power Jula. If you are in a Kubernetes context, Kepler or SkaFound may help you. And if you are in a machine learning context, code carbon could be of good help. And these are some examples. What's behind the scene is actually interfaces that have been mentioned previously in the day. So NVIDIA, SMI for getting the energy consumption of GPUs, RapeL for Intel or AMD, X86 CPUs. And the third approach is modeling. So we could classify this as measurement. This is more about modelization. And some of those tools also use modeling, then don't necessarily use only measurement with those interfaces. And the Bois API is also part of it because it does modelize energy consumption and answers about what's the carbon composition of the electricity. What, if I take the words from the previous presentation. But when we say that we have to precise something is that both hardware and software measurement tools have their limits. If you take the wider purple and pink squares, they represent what perimeter, a physical device will be able to measure. So the whole machine actually, but you won't be able to zoom on what's the footprint of a software or a given component. On the other side, if you look at the yellow and green screens, not so green, but the smaller squares here, we see this is the perimeter that RapeL is able to measure. So a CPU, if there is an integrated GPU memory can be measured for GPUs, SMIs. In some cases you may have a broader perimeter in RapeL, but this is for recent machines only. So we have an issue here because we are in a trade-off between completeness of the evaluation and precision and the ability to zoom in on the footprint of one software, for example. And so how could we fix that situation? In Bovista we are launching a project called EnerGista, which is basically a collaborative science. This is a collaborative database that we open and we propose voluntary organizations, individuals to share with an open source agent energy data and data about the hardware of the machine that has been measured. Which will help us to do statistics and then at some point produce better models that will help us improve software evaluation for power evaluation. Thank you Benoit. So from the beginning of the presentation we've told you that the use phase and the energy consumption is not the only thing that should take into account when you want to account for the materiality of your service. And this is where the life cycle approach comes in. So a life cycle approach will try to take into account all of the phase of the life cycle of your service, but also all the impacts, well, most of the impacts of your criteria. So not only carbon footprint, but depletion of minerals and usage of water, for instance. We're going to focus here on how can you identify the environmental impact of manufacturing a server. So it will be mostly in this area. But at Bovista we try to have a comprehensive approach by identifying the impact of all the phase from all the value chain. So this is a very, very partial and simplified model on how can you get the environmental materiality of a server for a specific service. So the first step that we do when we do environmental accounting is we try to identify what is the technical infrastructure that hosts the service. And this is often the most difficult part, because for instance if you take a function as a service that runs on AWS, it's very hard to know what is the specific consumption of resources and what is the technical material that your function is running on. But we need this data to know and to understand what specific component is used and what is the impact of those components that we should allocate to the service. So this is sometimes like archaeology when we need to dig and try to make some hypotheses to know how do we get from a service to its technical layer. But once we have the technical layer, we need to go to the raw material, because this is where the impact comes from. So we try to map all the processes that needs to be completed to assemble and build a manufacturer server. In a simplified way, we could say that a server is an assembly of plastic for the casing and packet and components. So CPU, RAM, Graphic Card Card and so on. And a component has many processes, but the most impactful process is the packing process. When you pack the dye, it's part of the components that is engraved where you have the semiconductors. And for this, you need to have metal. And for having the dye, you need to engrave a silicaweather. And so as you can see, the process of engraving consumes a lot of water. And also you need metals to, of course, produce a silicaweather. Across all of these processes, there is the use of energy, which also will use raw material, which will cause the pollution and resource depletions. So of course, you don't want, each time you want to assess a service, we are not going to draw this map and go until the usage of coal, oil and so on. So what we do is we factorize the processes and we make them easier to access through the different tools we are building at Boavista. One main tool that we have is Boavista API, which is an API that can make a translation between the ICT world with IT people and the environmental impact. So you give to the API a technical configuration. It can describe a digital service, an equipment, a component. And the API will give you back environmental impact, not only on global warming, so not only on the carbon footprint. But for instance, on other impacts that has primary energy that you should know if you know a little bit about energy and abiotic depletion potential, which is a criteria that assess the removal of non-renewable resources. So this includes minerals and fossil resources. Around the API, we built, so our architecture is in microservices, so the API is a central microservice. But we have other tools, such as Cloud Scanner, which will scan an AWS account and try to assess with the API the impact of the AWS account. And we have also a pedagogical front end, which is called DataVista, which is based on the API, but it's just a nice layer on top of the API for people who doesn't want to manipulate API. So for instance, here is a way to assess the impact of a server. And you see you can configure a server. For instance, let's say that I have one CPU. Demo effect, where it's just, okay, I put an L. I can also change the location where I use a server. So this will chase the carbon footprint of the electricity where the server is running. So I invite you to play with this tool and see a little bit what is the main cause of the impact, both from the manufacturer and the use face. And also the manufacturer impact, you can have it by component. So it's also interesting to see which component is most impactful. There is also other features, which are also in the API. You can assess the impact of your cloud usage, for instance, or of end user devices, but we haven't introduced those during the talk. Yeah. The API is you can scan the QR code and this will get you to the repository of the BoaVista API. We wanted to open up this talk. So we've begun by talking about energy. Then we took a broader approach with the life cycle approach, life cycle assessment approach. And we wanted to open up with an even more systemic approach, which I call systemic footprint, but it could be also called a consequential approach. Yeah. From the beginning of the presentation, we'll talk, we've talked about the direct impact of digital service. So it means the impact of the value chain of the service. But maybe sometimes the most impactful part of digital service, it's not just direct footprint, but it is the indirect externalities, environmental externalities that is brought by the fact of deploying your service. Your service, you're doing your service for some usages and you need to be careful on why your service is used and how your service is used, because this might be, this service might be used to make environmental harm. So when you want to understand what is the consequences of launching your service, you need to take another approach, which is a causal approach, and trying to map the different causes and consequences that are, that follow the introduction of your software. For instance, if you take a cloud provider, cloud are known to be often more mutualized and more optimized in time of energy, energy usage and carbon footprint. But since we have consumed, since cloud is very easily accessed, we are consuming way more compute resources than we did before. So this is what we call the rebound effect. And this is something that we can get from a basic life cycle analysis. We need to have a more systemic approach to, to understand all of those social transformation that is brought by ICT. And I think we're done in. Thank you for your attention. We have some minutes left for questions. Yes, it was very interesting. But so the problem is that everybody must know this kind of thing for in collaboration to climate, environment and so on. And because there is no studies of Greenpeace about this kind of thing, about energy provider. Yes, in Belgium, but so this kind of thing is very difficult of because so I know that the three said what is French? I don't know in English. So I'm a Amazon Web Service. Yes, so and this, this kind of thing is very, very, very important to their data centers, how it's take off energy, its, its harm effect on the river or something like that. So, so all this kind of thing for the construction of, of a computer and so on. I would like to have that is an, an, an Greenpeace barometers of this kind of thing everywhere. So because it's very important for our future. So also when dissipate energy in a river, a separation of energy and so on. So your remark is about awareness, I think. So there, there, I think there is no report from Greenpeace, but there is a report from WWF at least. And I think the, the main purpose of Boa Vista and the tools that we're building is not efficiency, but it's more making people aware of those problems and taking action because I think, and I think we can, we can talk for both of us. We think that having more IT people engage is one way to, to fight against those, this, this, the impact of IT. Hello, thank you very much for this. When you were presenting the server impact thing, I have a technical question. There was a discussion about jewels and primary energy as opposed to something that we might use like kilowatt hours, which is quite common. Could you maybe talk a little bit about why you chose that rather than a figure that we see used in lots of other places? Because that is something that I found a little bit difficult to understand when I first looked at it. So primary energy versus secondary energy. If you could explain some of that and explain the decision to choose one versus what hours, for example, instead of jewels. Yeah. You want to answer? Why do we express primary energy in jewels? Yeah, what I can say, but I don't know if it's an accurate answer, but in practical terms, jewels are very used for very precise measurement purposes. Most of the time when we talk about big figures, we are more about what hours, kilowatt hours, megawatt hours and so on. What's its power? So it's not expressed on a timeframe. That has been said in a previous talk. I don't know if it clarifies or... Yeah. Oh, okay. Actually, I understand the confusion. Primary energy is an impact criteria. Secondary energy is a flow. So it's not considered a final impact. We use... If you see here, let's say, we can model the secondary energy, the power usage here, in what? And we use it to compute the usage impacts for the difference impact criteria. Primary energy is how do you deplete earth from primary energy? Does it? Maybe you're time for one. Maybe you can do both. So the question is, because of some countries now don't want any more of the rubbish servers from our countries, did the data centers change the policy in terms of management, for example, for the storage system? At Google, they used to break the hardware into small pieces, not even recycling them at all. And where there are changes recently for the spare parts management? Because of the fact that countries don't want to make the recycling any more offshore countries. Actually, that's a very complicated talk.
Power profiling my entire house with the Firefox Profiler
Thanks for coming so late. I'm Florian Kez. I work for Mozilla as a performance engineer. You might have been here last year when I was talking about the work I do. So as a performance engineer, my work is to understand how much Firefox uses power and what we can do to reduce it. So I was explaining last year how we developed power profiling tooling. That was the cover slide. For example, I was explaining that we have power profiling tools that let us understand how much power is used by things so small as just blinking the cursor in the address bar. So this is what I was presenting last year. And if you want to hear more on this topic, I will be doing a similar presentation updated and extended tomorrow in the main track. So today, I will be sharing a different story. It will be more a story, actually, because it's late and I want this presentation to be easy to follow, maybe a bit entertaining if I can. So first a story about why I worked on power profiling the entire house and then technical details and then lots of examples because those are the most interesting. So the story. So there was first time in February and in April, we had a new member in our family that I was very happy to welcome and that completely changed our life, of course. Two days before she was born, I installed this on the wall. One of the reasons why I installed this is solar panels, it's not obvious. I wanted most of her energy used to be renewable. And I tried before to have solar panels on the roof of our house and it turned out to be extremely difficult, which means we failed to get around those. Reasons were mostly there were chimneys on the south side of the house that were making massive shadows on the roof. Lots of other issues with the roof. Basically, all the companies who came, they never gave us a code. So we couldn't get panels on the roof. So install this and I was wondering, can this power of a bottle warmer that we will use for the milk we give to the baby? I work from home. I work on energy efficiency all the time. Will this power my home office? So I had questions and it answers. So how could I answer those questions? I installed the power meter that you see here inside the electric switchboard of the house. So it's communicating with RFI, I'm measuring three different things. The link with the grid. So seeing if we are importing or exporting electricity. It's measuring specifically the solar panels I had put on the wall. And it's also measuring my home office. So that I could answer the questions. Of course, I very quickly came up with more questions. So I was also wondering about the washing machine, the freezer, and a few other things in the house. So this is what the thing quickly looked like. So a bunch of things in here. I made the thing in the first place so I could make a mess of it if I wanted. So now we are measuring also the link to upstairs because there's a second panel upstairs. The freezer, the boiler, the washing machine, those kind of things. And also, I needed to answer the questions. So we put a smart plug on the bottle warmer to be able to figure out what was going on there. So now let's go into technical details. What am I doing over all this? How can I get relevant information? So first I need to correct and start the data. I have a constraint. I have nothing in the cloud. Because it's very personal, sensitive data. All the parameters are connected through Wi-Fi. But with parental control, they have no internet access. They all send data through MQTT. They send one piece of data every second. And there's an Ubuntu virtual machine somewhere in the house that hosts an MQTT server. And with trivial scripts, logs everything to disk. So that part is pretty simple. Then second part, I need to visualize the data. Because if I just have massive log files, I do nothing with it. And this is where the Firefox provider part comes in. A tool I was very familiar with because the power profiling part I made the previous year. I have on the Ubuntu virtual machine a trivial script that converts the data from the files on disk to a JSON file the profiler can understand. And the profiles contain mainly two things, power counters and markers. So this is what it looks like if you're not familiar with a profiler UI. You might not be. I will explain very briefly. So there's a time axis here. The top part here is what we call the timeline. Everything is against time. The values thing I said I'm metering, you can see them here. You see the shape of the chart for each of those. And markers, they are here. And they can give us more specific details about specific things that the script thought was interesting. And you can see here that for example, so BIM is the brand of the wall. You can see that typically produces more in the middle of the day. You can see that when the cloud is less interesting, many other things will go into more details later about what we see there. So one thing I wanted to mention here was the date, which is the most important, sorry, the date is the most important thing here. We were three weeks in after we got the baby. And this is what I spent most of my days doing. And actually most of the nights too. And how this works. Usually when people get a baby, they say they have no time left. I actually had the exact opposite. I ended up suddenly having plenty of spare time at night because she was waking up so often that we couldn't sleep. So we were taking turns. And half of the night I was up. And she would wake up, want to have some milk and then sleep a few minutes later. So I had plenty of hacking sessions that were somewhere between 10 minutes and three hours. Unpredictable. But I had multiple weeks of having those sessions at night, which was why the code is maybe a bit messy because I had to do it in small chunks. But it worked pretty well. Otherwise I would have had no time to do any of this hobby project on the side. Also the generous parental leave at Mozilla helped a lot because that meant I had lots of those weeks where I could stay up at night and do those kind of things. And then more seriously, generating a JSON file that the profiler can understand was really simple. Maybe because I work with a profiler a lot, but still I think most people could get it done and get something that works relatively quickly. And also I don't have to host any web UI or anything because I can just generate URLs like this with the URL to where I generate the JSON file. And that's all I have to handle. I don't have to take care about anything in the UI. Then there's the stuff that didn't work as well. The profiler was made to profile Firefox. Typically we were having profiling sessions over a few seconds. I accidentally had profiles that were an entire day. So stuff didn't work so well in terms of units, for example. So I did put some good requests to add minutes and then hours. And then a few weeks later, days also. Changing the units, if you remember the screenshot I gave of profiling the cursor blinking in the address bar, we were talking about milli-watt-hour, micro-watt-hour. I wanted to see kilowatt-hours because numbers with many zeros were not so fun. Performance also, showing a profile that contains data for an entire day. It was not that bad, but it took maybe five seconds to display. I fixed it. And another thing that was a lot more important when profiling the house and that is completely irrelevant when profiling Firefox is knowing when something happened. In Firefox, typically we want to know how long something took. Here I mostly wanted to know at which time of the day something happened when we were starting to consume more power. So I also had to tweak that a little bit. It's also nicer when using the Firefox use case, but it was a lot more important for profiling the house. Colors, but it was just nicer. Everything was gray in terms of power in Firefox because there were a few attracts. Now let's go into examples. Doing laundry. Washing machine dryer. So washing, it consumes a lot of power twice. And this is most likely when heating the water. And then there's what? Okay. Whatever. I also wondered why it's doing it twice here. I think I saw it doing it new once a few times. So it depends on the program. Actually, I would like to profile the values programs. And if we zoom into this part that looks interesting, but we don't see because of a big thing here, we see there are lots of patterns here that are probably good enough to figure out what the machine was actually doing. And then the dryer, and it turns out it uses less power than the washing, even though it takes longer. And this is probably because we took the most efficient dryer we could find with a heat pump. And I also profiled my mother's dryer and it uses seven times more power than mine. Typical day at the office, home office. And this is why I don't want this data to be in the cloud. And I don't want my manager to have access to this data. We can say exactly at which second I return to my desk throughout the entire day. And you can see that there are typical days like this with small breaks in the middle. You can see the shape here is different. And then there are days like this one. And the main difference here is when you see that it's high first and then decreases, it means my battery was not full. So that means I probably worked from somewhere else than my office. So here, here and here, I clearly worked somewhere else from my office. And then the last one is on Sunday. So on Sunday, the only thing that remains power down is the modem, which is also useful for Wi-Fi and the rest of the house. But maybe before working, I should have started with breakfast. So this is micro-oven from the 90s, generated from my grandmother. And two things we typically do in the morning is unfreezing bread and heating milk. And I was surprised by the patterns there. The surprise is I was thinking that when in the infreasing mode, we would use significantly less power. And that's actually correct. But the problem is it's heating at the maximum power for a few seconds, then nothing for a little while. And every 30 seconds, it's heating for seven seconds, which means that if I'm hoping to use solar panels, and it's in the morning, and they are not at their peak production, I'm basically buying all the power from the grid, even though the average power is only 300 watts. And that's the kind of stuff we see when power profiling with a high rate sampling, but I would not see if I was looking at that every hour. And heating milk is what you would expect, almost a rectangle. So now, time for a quiz to ensure you are still awake. In your opinion, what uses most power here? Is it the massive chest freezer we've got that's full of milk? Is it the internet modem? Who thinks the freezer? Raise your hand. Who thinks the modem? So let's provide it to figure out. So, of course, very different shape. The modem is using the segment of power almost the entire day with very tiny variations. And the freezer, there's a spike at the beginning for a few seconds. And then it's stable for a few minutes, and then stops entirely, and then starts again. Modem, 27 watts all day long. It also runs the virtual machine that does all of us power profiling. So the answer is you are all right. They used exactly the segment of power during the entire day. So back to the initial question about warming the milk for the baby. So there's this milk pump, and then there's the bottle warmer. How much do each of us consume? You can just see the number. I don't think I'm going to read them out loud. Something that we quickly realized when looking at those profiles that was interesting is we see the timing, same as figuring out when I'm working or when I'm not working. And I'm not sure if you had a baby recently and had this experience, but you have lots of constraints about how long you can keep things. So milk that has just been pumped and kept at room temperature you can use for four hours. If it has gone in the fridge and you are heating it, you can use it for two hours. So to be able to know if the bottle of milk in front of you is usable, when suddenly the baby wakes up and you don't know when it just slipped last time because you were not in charge of that time, usually it's a mess. And we can make use of this data, and we did. And that's actually what we used the power metering data the most for is figuring out if the bottle of milk in front of us is usable. And we figured this out. The reason why we figured this out is only because we could see on the chart that actually it's very easy to detect the pattern. So it's time for a summer break. We visited my parents and they recently had those nice solar panels installed on the roof of their kitchen and it came with a gateway that's sending the data to the manufacturer or the gateway who's collecting a lot of data. I'm not too happy about that but it was not my decision. So it's sending one data point every 15 minutes which is good enough to figure out how much electricity was imported or exported on that day. Use this to figure out what you're actually doing with your electricity. And I noticed during one night of taking care of a baby that actually we can get one data point every second if we query a local HTTP API. So I did. Put a Raspberry Pi in there. Of course we can get profiles. So now let's see what they look like. That's what I saw at my parents' house and one thing quickly caught my attention. So it's a free-phase system because of a large heat pump I will go into it later. This thing looks strange. There's high power use here and it's throughout the day. And the only thing that could be using as much power is this thing. And it's supposed to be using power of peak hours because the price of electricity is not the same in France at night or during the day. And after investigating a little bit we realized that there was this switch here that was in the wrong position that was forcing the thing to be on all the time. So it was pouring on whenever someone was using water. And we changed the switch and now it's eating only around midnight and then a little bit around 7 a.m. and then it stops the rest of the day. And that probably saved quite a bit of money. I said there's a large heat pump so now we are no longer in the summer. I forgot to say something. The heat pump here has a large accumulator also. And when we look at the power use pattern we see the heat pump that's pumping and using a lot of power on all the free phases six times a day. And then there's the circulator here that's going throughout the day. So we actually can understand how things work. And we can see also how the power from the solar panels was used. Back at home some magic happened. I said we couldn't have solar panels on our roof but we had a baby which means that we returned home and after returning home there was a midwife who came to visit to check everything was right. And on the car that she used to visit us there were ads for company putting solar panels on roofs that was owned by her husband who's very proud of figuring out solutions to all the desperate cases where there's nothing possible and who came gave us a code that was very reasonable on a couple months later. The baby solved all problems that we were not able to solve for two years. So now we have real solar panels on the roof but that's enough about this part of the story. Fast forward December and it's time for another baby picture. She's grown up quite a bit. She's really into trees. Whenever she's crying and we don't know why we show her a tree and she's super happy. So we had to get her a nice Christmas tree for our first Christmas. And it's time for another quiz. In what you see in this picture what's using the most power. So obviously there's the Christmas tree here. The Christmas tree turns itself on at sunset and turns off at midnight. Then you might not have seen but we have the solar panels here and they produce power during the day. They use power during the night for some reason. So what's using the most power in your opinion. Who thinks the Christmas tree? Who thinks the solar panels? Okay let's provide it. So the Christmas tree uses 10 watts for a few hours here and the solar panels about five during the end of the day and the beginning of the next day. And if we look at the numbers Christmas tree 64 solar panels at night 67. That was a surprise to me but yeah you couldn't be surprised twice by my quiz I guess. But they did produce a lot more power so it's still worth having them. And I think we still have a minute or two so I have a few more things I can share. I have more power matters that are funnier and the interesting thing about this one is it can give me data at a 50 hertz something rate which is the frequency of the oscillating AC power. And I forgot this profile at home on a computer that's not connected to the internet but the profile was fun because we can see what happens whenever the rotation direction changes. We can see that there's a break in power used for a few milliseconds and then it uses more power when the motor restarts. So all those details we can see and expose with fast sampling and power profiling and it's pretty nice to see. And then USB power meters those are interesting if you want to look at the energy used by any random USB thing or anything that charges through USB. And there are quite a few in this picture all of those are reverse engineered to make compatible with profiler and that's part of the topic for another talk that I will be giving tomorrow but this is kind of how I worked with those. So reverse engineering a bit and then putting a load here USB light that I knew what it would look like. The code is in here if you want to play with it. So I will explain why this is useful for profiling Firefox and Android and even Firefox and laptops tomorrow in the main track. Now let's see the things that were not working so well or that I still need to look into. All the profiles I shared were looking good. I selected them. Some don't look that good. So this is a profile of a boiler. I said we profile the boiler so this is just it's a gas boiler so it's not most of the energy used but still during winter it uses a lot of electricity to just circulate water so that the hot water it's producing is going through the house. And then the Wi-Fi is not so good. It's especially terrible in our house because there's a lot of concrete with metal in it almost everywhere. Despite putting multiple repeaters it's still not so great. And someday I still have missing data like this and profiles that are almost garbage. And it could lead to incorrect conclusions because the shape here is just clearly wrong. So if we can, wired network is probably better. It's not really possible to put those wires exactly everywhere like on smart sockets or things like that. I think the best solution if I have time would be to change the firmware in those devices for an open source one and ensure that they store the data until they receive an act from the server that the data has been received and include timestamps in the data. So probably a project from next time I have many nights without sleep. I would really like to clean up this code so that all of us could play with it easily. It's not very complicated but if we don't duplicate it, that's much better. So the code for power profiling with USB meters I cleaned up enough because it was part of my work and I put it in a easily accessible repository. The code to do profiles that are nice from on-phase gateways I would like to do soon. And the rest, it's a bit of a mess because it's a mix of my code and configuration data with the same files because like you know 10 minute hacking sessions. And I would also like to blog some of our profiles of appliances and devices that I tested because I think there's quite a few surprises we could have when looking at devices. Some don't really behave like we would expect. And as a conclusion I would say sampling at a high rate is useful to understand how things work just because we are often curious. I definitely am. It's also useful to find and fix bugs like the water header thing at my parents that was wasting a lot of power on costing money. And if we want to optimize consumption from the power that's generated based photovoltaic panels, it's better to have an idea of how much we will consume. Like especially unfreezing bread like I was sharing is probably not a good candidate for using energy from solar panels. And that's all I wanted to share for today. Thanks for your attention. Could you match the power used by your workstation with the solar panels in the end? Oh I forgot to say but I could totally use the power from the solar panels for my home office because it was clearly enough and I'm mostly working during days. And I could actually decide that when we have a lot of power from the solar panels maybe it's time to compile for your folks that will use a lot more power. But actually the one thing that uses the most power as we have seen in my profiles from the home office is whenever I decided to use the computer without being plugged and then plug it back in because then it charges and that's where the power used is the biggest. The other thing that contributes a lot of power use of my office is screens. I have two external screens and surprisingly the 27 inches screen and the 20 inches screen they have almost the same power use. So if I use only one I could turn off the second one and they will also save significant power. The profiling your stuff is often called NILM non-intrusive load monitoring so if you go and look up there there are databases you can contribute to. The end phase be careful if you're running on version three and you're using production.jose and it all goes away and it's all behind a power paywall and horrible don't upgrade. And things like water so microwaves yes are just on off so those are hard to do so you should run them when it's sunny. And washing machines right so normally washing machine is on heating the water at the beginning and then that's it you know there's mechanical effort which you could see on yours. Dishwashers are usually at least two because you get the main wash and then a hot rinse. So washing machine with two is weird. So I'm not sure there was a question in this or if it was just comments but about the versioning of the young phase gateway. The young phase gateway we've got at home is not collecting data about our power use. The on-phase gateway we've got at home is not collecting data about our power use so I put my own power meter behind it and the reported data about how much power is used by the on-phase system at night is dramatically different and my parents profile on in mine because in my parents profile is the data reported by the on-phase gateway and it's counting only the power used by the micro inverters that are on the panel and it's around one watt and mine is also counting the power used by the gateway itself and now we are on five. So time's up thank you so much. And you can see the presentation tomorrow if you want more details about Firefox for approfiling. Thank you so much.
Closing Energy: Reimagining this Ecosystem through Open Source devroom
Just a few words. If you want to just maybe think it won't work. Yeah, we'd like to take just a couple moments to close off the deaf room. Thank you all for being here. This was our second year we had the energy deaf room. We started, maybe we can put this off. Well, we started with half a day last year and we extended it virtually. Yeah, you were attending the virtual room last year. Yeah, and we kind of had a feeling that there's going to be more attendance if we have a full day. And I think this demonstrated it pretty well. Like, from the beginning in the morning until like basically now it was only like a few sessions maybe not completely full. So thank you very much for sticking around and I hope to see you next year. And maybe if you want to volunteer, you know, please send us an email or hit us up somewhere else.
Property based testing in Elixir
That's not helpful at all. Okay. Now turn from yellow, greenish to green, greenish. Okay. Okay, cool. So let's write a unit test for very simple use case in which we want to add two to number together. And it would look something like this. So we usually when I write tests, I try to come up with at least three cases. So positive one which tests the happy path, one that actually tests the opposites and then try to find or think of edge cases in which my software could actually fail. So this is such an example in which we assert that two plus two is four, that two plus two is not equal to five, and we also try to find some edge cases like if one combined types or do something other funky stuff that my software still works. So if you look at that example, you can understand why I think writing tests can be pretty boring. So that's my conclusion, testing can be boring. Then if we look at it in other aspect of writing unit tests, what if our software project grows? If we have end features, then we have some linear amount of tests accompanied to that. But what if we then start to combine features? So function A and function B, we have to test combinations like pairs of those functions as well. Then the amount of tests will grow quadratically. But then if we're going further and we combine even more features, at a certain point that growth makes it really hard to scale to write to go further. So testing I think can be hard, at least if you want to do it properly. Like if you really want to make sure that you have for confidence your code you want to have as much cases covered in those tests. If you approach it that way, then testing can be hard. So how can we fix this? Well, some people they come up with property-based testing. A summary of it is instead of us humans writing examples, let's define properties of our code and let the computer come up with cases. So that's the folks at a company called Qwik. They came up with this idea around 2000s, and they build a project and the company around those ideas. They've also added some more features to it as well. But the general idea of property-based testing is that we define properties instead of examples for our tests. So let's have a look and a comparison how we could do that. So let's say that we write a test for string reversal. So we take some string and we have a function that reverses the order of those characters and how we would unit test for such a case look like. It would be something like this. So a raise of hands, if you write test like this, who feels confident that this test are actually covering all cases of our function? No hands raised, nobody feels confident. One maybe. Yeah, everybody is like you feeling anxious, right? You're not fully convinced about this test. You could probably write it in a different way. But if you would translate these things in properties, so let's take a pause and think if you would try to express that behavior, that functionality in properties, how would you do it? Like the contest numbers, special characters and so on. You would come up with examples of special characters, numbers, these kind of things. So examples, right? Basically, examples of edge cases like weird input. But that's not how you would, for example, define your software as a property. Those are again examples, clear use cases, but they're not properties of our code, right? One property is that the length of the string in input is the same length of the string that you get out. Yeah, that's a good one. So if we reverse the string, in both cases, the length of the string should stay the same. That's the property, right? So another one, still readable. Another one would be if we reverse the string twice, then we should come back with the original one. And this is how you would write it down in a property-based test. So we define a property-reversed string twice returns the original. And we actually tell, on this second line, we tell from all the possible string inputs. So we ask the library to come up with any strings. If we reverse that string twice, it should come back with the original, right? And if we run this, then the library will generate about 100 cases for us. And in doing that, try to prove that it's property holds for our code. So other examples, if we reverse the list, then the first item becomes the last one and the last item becomes the first one. If we have a palindrome and we reverse it, it will stay the same. So palindromes are strings which, when we reverse, they return the same string as well. And like you said, the amount of items, so this applies to any kind of list or string that we're reversing. If we reverse the amount of items, it stays consistent. It's not like some things disappear magically. So if we, and the funny thing is, if we try to write a property again, so I don't know if anybody noticed, but in the previous example, I specified that I want to generate a list or generate examples of strings, but which only contain ASCII characters. But if we do the funky stuff, the funky characters part, so we say, well, generate any string from the UTF-8 set, what will actually our library tell when we run that? And then it finds an edge case. So there are unicode characters that apply to previous characters as well. So when we reverse them, you don't get back the original anymore. And these are kind of edge cases which we, as humans, probably couldn't come up with. Well, you do know that it exists, but if I ask you like now, within these five, ten minutes to actually write this example, you actually wouldn't be able to do that. And it actually, normally runs about 100 cases, but even after eight cases, it found this example. So that's great. It found an edge case. And the other thing that is not shown in the example, but if you write a property and it finds a case for which the test fail or the property fails, property-based testing tools are also able to shrink down the case. So it does a binary search. So if I have a list of numbers and our test fail, then it tries to, then it tries the half of items from that list. Sorry. And if it still fails, then it goes on and on until it finds the minimal set of input under which our property doesn't hold anymore. So let's talk about some use cases. Where has this kind of tooling been used? So Volvo, at a certain point, wanted to have third-party parts to be replaced by other companies. So they came up with specifications in which these components should interact with one another. So they wrote a specification about 3,000 pages long. They had about six vendors come in to test this, their specification, combined they had a million lines of code written. And when they used property-based testing to actually test these six vendors' implementations of the specification, they even found about 200 issues. 100 of them were actually in the specification itself, and 100 of them were in the combination of those parts. Because a car consists of several parts. So it could take component A from vendor A and some other component from another vendor, and they tested components in isolation but never together. So the combination of these components actually yields some errors as well. Clana is a financial system, and at a certain point they had a problem, which occurred only once several weeks, and they had kind of a hint because it came up with when the generating files and it were over 1 gigabyte big. And they spent six weeks full time investigating this issue, and they couldn't find the source. Like they could stumble upon it, they could in some cases, trigger it, but it was actually impossible to find out how and where it came from. And it took them three days or less than three days in total to come up with a model to write the properties. Less than a day to run the properties until they actually stumbled upon race condition in which that's error occurred actually. So those are two kind of big examples. What are the other occasions in which we could actually use property based testing? One obvious one is if we have symmetrical functions, so if you serialize and deserialize something, those are opposite functions, you could easily property based test them. If you need to have some other method, if you have functionality that needs to have some kind of mathematical proof, it's good for comparing systems. So I had to, in one case, I had to rewrite the system in another language, and then it's nice to have the old system and the new system and test them against one another. And I haven't really mentioned it during this talk, but the tool that QVIC has built also has special conditions to test concurrency items. So if you have a system like the Cliana financial system, you're going to test what happens if five people do some transactions simultaneously. So conclusion. Property based testing can generate all kinds of test cases for us. Very often also edge cases that we as humans don't think about because it tries to spectrum of all inputs that you specified and find very weird items. Because of the shrinking, it also helps narrow down to diagnose what the actual culprit of the error is. It helps reduce complexity. And like I said, because you have to think about properties instead of examples, it actually makes you think differently about your tests. It makes you think more philosophically. And I think that in itself is already an advantage in learning property based testing. So I think we're out of time. So a small thank you for SliceGo. So if you think this is a nice presentation, I pulled it from their website. And I also have to contribute back and mention them. I also want to thank you all for attending this early and for listening. And for the organizers, of course, keep forgetting us. So if you think, well, this was nice introduction. It sparked my interest. How can I continue learning this? There's a good book on a website called propertesting.com. If you're not using Alexa or any of the other languages, there are also libraries in other languages like Python has hypothesis, for example. Look it up on either property based testing or generative testing. Some communities call it differently. And if you think, well, how should I think in writing properties? Then John Hughes, one of the founders of QVIC, also has a good talk in which he talks about how do you come up with these properties? How do you think in this way? So I don't know if we do have time for questions.
gen_statem Unveiled: A Theoretical Exploration of State Machines
especially state machines and how they are handled in Ireland and also from a theoretical point of view. So, it's up to you. Thank you. All right. Yes, he said, like, I'm relatively young but I know a school guy, so I code in V-man and use Ireland. So, this went too fast already. I work in Erlang Solutions. We do like Erlang stuff, so concurrency, scalability, the useful things that most of you would be hopefully familiar and we also contribute a lot to open source. This talk is going to be about state machines, as you heard. First, a question of protocols. What are protocols? I wanted to make a survey and ask you and so on, but we have limited time, so I'm going to answer the question already. System of rules. A few examples. Okay, I need to point here for this to work. Protocol defines the system of rules for syntax, semantics of the project, the program that you want to write. Some examples, the usual ones are TCP for network communication, is connection oriented, stream oriented, messages are ordered and they are acknowledged. Another common example, TLS for privacy, integrity and authenticity, encryption, very important. I hope that everybody has HTTPS enabling the browsers by default. Some other examples are file formats or markup languages. Parsers for them can also be implemented as state machines. The two classic examples, XML and JSON. XML is particularly interesting to me because I work in XMPP messaging server written in Erlang, of course. If you saw our talk in CodeBeam, for those that are following CodeBeam, Pablo and me, we talked about the state machine re-implementation in Mongo's IM. This is a bit of a continuation to that. Some more complex protocols can be implemented as state machines like HTTP and as I mentioned, XMPP, which is my specialty, which is extensible, that's the X and my favorite part of the whole thing, it's an instant messaging protocol that also has presences, the green bubble, whether your friend is connected or not and it also does contact list maintenance on the core protocol, 500 extensions and build your own. This is the state machine diagram for the protocol. Much like a flow chart on steroids, I really like that analogy. With the state machines, we are like the usual thing, how you think about state machines, you draw the state with some arrows, the arrows have tags about how you transition to the next state. Finest state machines give you a way to visualize a system that can be very complex. Why state machines? State machines can be seen as a model. We want to model the behavior of protocol that can be very complex like TLS or HTTP, most of you will be familiar, XMPP, my specialty. Let's talk a little bit quickly about state machines in particular. A few formalities. I studied mathematics in university, I'm excited by these weird symbols, but some people can find them off-putting, so I will try to make it pleasant. A few terminologies, we define an alphabet, terminologies, you use Greek symbols, mathematicians, which are the input symbols, zeros and ones, or ASCII characters, UTF-8, or complex symbols treated as a single element, half, and you can do equivalences. One of the weakest ones is the regular grammars, it's how you do regexes. A regex, this thing that right ones are never read, but very powerful, is theoretically equivalent to a state machine. Again, this is jumping too fast. Something a little bit more powerful is the partial automata, I'm not going to focus on this one too much, use a key difference, then nothing else parsed, now it has one more thing, it's the same thing before, plus a stack, and the stack behaves as you would expect. The function that used to take the state and the input symbol also takes the stack and the output of the function is whether you pop something from the stack or you push something on the stack. It's safe to consume a string that you give to this PDA as it arrives to one of the final states with an empty stack. There are equivalent definitions, not all definitions require the empty stack, but I choose that one. They are equivalent to context-free grammars, parsers, but not compilers. Why a compiler? So in tree, the thing about being context-free is that it doesn't remember symbols that were defined before. So for a compiler, for example, the usual regex compiler for C that needs to remember the definition when you say int e and then you use e later below, parser doesn't remember that, you need symbol tables, parser only builds the syntax tree. And the fancy one, the computer, theoretically, Turing machines, which is again the same thing, but nothing else is supplanted by a tape that is infinite. It is equivalent whether it's finite in one side and infinite in the other, all of those are equivalents, whether it has two tapes is also equivalent, will arrive to that. The function takes the tape and the action go one to the left and write something, go one to the right and write something. Very similar, a Turing machine is said to consume a string when the next step is undefined. When it holds, you have all heard of the holding problem. There is no way to know whether a Turing machine will hold. That is important. They are equivalent to interested grammars, compilers in the Chomsky hierarchy that are like four levels. The three things that I describe are zero, one and four, there is something in the level three that is not directly useful for the moment. So I skip that. So how do they compare? This goes very fast sometimes. So that's the power that they have. A Turing machine can do all the others. PDA can do the one over there. So that's the power that they can do. They contain the power of each other. Two FSMs running together has still the same theoretical power, the same thing that a PDA with a finite buffer or a PDA with a finite state machine is still as powerful as one PDA. Turing machines, whether it's multi-tape, tape one banded on one side, they are all equivalent again. A Turing machine doesn't get more powerful by giving it 100 tapes. It gets maybe more performant theoretically, but the problems that it can solve are all the same. And a PDA with two stacks is really a Turing machine when you know you can just go in both directions. So when you give the PDA two stacks, you build a Turing machine. So conceptually, finite state machines can keep track of one thing, the state. The push-down automata can keep track of two things, the state and the top of the stack. And a Turing machine can keep track of infinite things. When I was going through the mathematics and I came to this conclusion, I found this funny for a completely unrelated reason. In the European languages, I mean to human languages, used to have the concept of dual as something different to singular and plural. The function that it computes depends on one thing to things or an infinite number of them. The function that was defined before. So in the European languages, as I said, they had this special concept of the dual. And I found it very funny how informal human languages used to have such a thing as a dual, as a different grammatical category than one and infinite. When you build the declinations, they had a different thing. Why do I know this strange thing about languages? Because I live in Poland. So Slavic languages have some remnants of that dual concept. So there is this famous joke of in Polish you have like 100 ways to declinate number two. And you have more ways to declinate number two than you have number three because of that all dual. So two is special. I live in Poland, but I'm not Polish. It's challenging. So do FSMs produce output? Let's go move slowly to what is useful here. We can define finite state transducers, which same thing than before and then nothing else is supplanted by another output alphabet. The function takes the state and the input and decides the next state and a symbol for the output. It's a to consume a string the same and they are also equivalent to regular grammars. When it comes to the problems they can solve, again, they're all equivalent. You get fancier tools, but there are properties that are going to be all the same. You will see in a second there are many, but let's focus on two ways of defining transducers, the milley machines and Moore machines, whether the output, I have a laser, yes, whether the output symbol depends on the input and the previous state or only on the previous state. There is a way to define Moore machines from a milley machine, but not the other way around, so milley has a bit more powerful. Now something a bit more useful, how do they compare? They are still the same than the FSM machines, but this can be composed. We are getting into a bit of engineering. We are almost there. Not that much. This is a thing, laser. Yes, oh god. Come on, sometimes. So given three sets of states, three alphabets, one machine goes from one state and one alphabet to the next state and the other alphabet. The second machine uses the same the output of the previous as its input, so you can define the composition as a state machine that takes the first alphabet and the first set of inputs and gives you the third alphabet and set of inputs. Composition, cool. Why? Because you can implement all these things as state machines and the output of one is the input of the next. So my stack on XMPP, you can implement TCP as a state machine. Have you heard of the Erlang socket, the new socket? It's implemented in TCP on top of gain state them. If you go to the source code. So I have the output of one state them, throwing into the output of the next state them. TLS is also implemented as a gain state them, throwing output to my thing, to the XML parser that throws its output to the XMPP protocol. So we are composing things. One last theoretical thing. The unions of FSMs that is uniting all the states and strings, it's also an FSM intersection, so the states and its input symbols in common gives you a very small FSM. It's also an FSM reversing, still an FSM, empty, so no states and no input is also an FSM that when you do union and concatenation with another FSM does nothing and homomorphism, so a function that transforms alphabets and states into other alphabets and states preserves the structure of an FSM. So FSMs are a semi-ring. This is an algebraic structure. Why is it useful to have search algebras? To prove things that you cannot prove with Turing machines because they do not form an algebra. So now let's do something engineering, state them. So as I said before, it's a Melly machine. It gets the input and the alphabets, it produces the states and alphabets, it produces the next, you follow, I hope. We can consider that the input are the messages in the mailbox and the output symbols are side effects, like for example sending messages to another mailbox. Gain state them. I'm a big fan. I love it, but I know that people sometimes don't use it because maybe it's confusing or I don't know, complicated. So I'm going to try to explain one thing that is very useful here. An extended mailbox. This is a discussion that the OTP team, when they put the pull request for gain state them, there is a big discussion with over a thousand messages that was probably forgotten, but when they discovered gain state them and I liked it, I went to the source and I read that super long thing. And there are useful things said there. A way to visualize a gain state them. Imagine that it has one queue, that is something more than the process mailbox, with like three pointers. The head pointing at the oldest event and the tail pointing at the youngest and current is where I am now. You can move where current is with some of the actions that gain state them gives you, for example postponing an event. Postponing an event means that current moves to the next, but the event is not forgotten. There is a different action that will put current again in the head. Not postponing and you consume it is removed from the queue. When the state changes, current goes again to head. Next event inserts things where current is and not at the tail. And timeouts inserts things at the tail. So the engine, the gain state them implementation allows you to extend the inputs that your formal state machine is going to get. How does it work? Imagine that we are here, we have event one and we decide to postpone it. What happens? It's still on the mailbox. We just are now going to deal with event two. Now we decide to do some stuff and then go to the next state. So that has been processed and current because we changed the state goes back to the previous. Now we are again going to handle event one and this time we decide to not change the state, but we generate new inputs as if this process has received a message. But this event A, which is ad hoc, we just created it, is inserted where current is. So it's the next event that we are going to handle. We can decide to postpone it. Now we are going to handle event three. With event three we do some stuff, but we don't generate events. Imagine that there is middle code here doing. So event three has been dealt with. Now you go to event four and you decide to postpone event four, but also insert and event B. So event four goes behind, you insert and event B, you get the idea. So the engine gives you a way to extend the process queue. What am I doing with time? Oh, one more important power. I'm not going to have time for everything. One more useful power of the state machines. Managing accidental complexity. There is a talk that I want to recommend. It's quite an old one, maybe something like 10 or 15 years ago by Ulf Rieger, where he was complaining about some limitations of GANFSM, but even GAN server that we all use. Very useful talk and I have one tiny answer to that with the new GAN state that didn't exist back then. Typical state on, off, but you can imagine that you're switching a light, but your switch talks to a through a cable protocol to the light machine. So when the user says on, this is a GAN server, you say and the state is off, you send a request to on, you wait for the answer on, it's on, vice versa, relatively intuitive code. Now imagine that that request through the cable protocol was not synchronous and imagine that the switches cannot block. It needs to do other stuff. So you send an asynchronous request to the light, hey, turn on yourself and continue doing other things, but then the user sends more off and on. What do you decide to do here? It's not part of the protocol. The events are now asynchronous and out of order. There is no global ordering. So there are some questions like you need to choose what to do. Sometimes this, this is the, so we can use a state machine. They use all the way. The name of the function is the name of the state and you can postpone things if you are already running a request, you postpone it and when if the user press on like a hundred times, by the time they like says on, then you have changed the state and you're going to handle all those. It's already on, so just do nothing. But the code is terribly symmetric. It feels repetitive. So problems, there is no ordering when things are asynchronous. Tying yourself to the ordering of events leads to accidental complexity. This is the point of Ulfiger when the order changes, the whole implementation changes. It grows relative to the number of states. This is super simple. It's a like that goes on and off. But imagine complicated protocols and for example a middle layer between a very talkative protocol and a like one and code reuse. So I really like the handle event way of handling things. It's a single function callback that gets a simple the state and the data. By the way, it's very confusing because we are used to the state of the process for the server thing. But in the state, the state is the state machine state. So the other thing where you save like, I don't know, the socket, for example, is called data. So just confusing terminology. This, you can just pattern match whether you're in the same state and the previous function that was terribly repetitive is now in a single function head. This is, I believe, a way to answer to the problem that Ulf raised and now I'm exactly on time. One more slide. A way to answer to that problem and in a way that you can reuse code, that you can decide the order of events because you can postpone things and you can also insert things. Quickly here, why I use on the XMPP, we had like this implementation. There is only one thing that I really like here. The composing. As I said before, you have the TCP state machines that go to TLS that goes to XML that goes to messaging. So if we want to implement this on a single process, this can be, for example, this is a simplification on my data. I have a parser and the crypto library that when I get that TCP payload, this is how we do it in Mongoose. I am not TCP, TCP we just use TCP to complicate it. So it's a separate process. But crypto and XML parsers, we implemented on the spot. There is a C code that the parsers, part of the XML, for example, it gives you a new parser with a buffer and the XML structure that then you can use to generate the events that my protocol cares about, the XML payloads. That's one use case that we have. That's me. You can find me by that picture in all the places. Those are some of the projects I work in and I was going to say questions, but we are one minute late. Thank you.
Guess Less with Erlang Doctor
Okay, I... Yeah, it switched off to like... By itself. I didn't touch it. So yeah, when you debug, for example, your code, when you're trying to find out why you have a strange error or something like that, you can use our long tracing. And it's very powerful, as we said before. And for example, you can use tools like DBG or Recon that are using error tracing underneath. And the first step is to choose which functions you want to trace, actually, because you don't trace everything. Although you can trace what you want, you cannot trace everything at once. So you choose like, I want this function to be traced for this bunch of functions. And then, when you call these functions, you get your traces being printed out. So you get the information of this function is called, these are the arguments, return values, things like that. You can get it to console, you can get it to a file, and you can also send it to a network, and that's what I have been doing for many, many years. I was just setting up, for example, I said many years, yeah, for 15 years, I think, with Erlang. So I was just setting up a special node that was like collecting traces for all the other nodes. So you can also send them to the network. And, well, afterwards, you either read the traces that you collected, or you can also search them, grab them, parse them, do some other operations on them if you want. But these are just text logs, let's say, mostly. And the problem is that very often you have to repeat the whole process. That's because you've traced one function, but you found out that maybe the problem is in another function, maybe in a completely different module, and so on and so on. So you do it, so you repeat, repeat, and that might be kind of a problem. So this doesn't scale well. And what I mean by that is if you try to trace a lot of functions, well, I found out that at least for me, when I get like 100 to 1000 traces, it becomes difficult to read, like for a human to read that amount of information. Okay, but you can search, for example. And this also has a limit. So, of course, this is just a rough estimate, let's say, but for me, usually, when I have like 10 to 100,000 traces, then it becomes difficult, because even my system can slow down, IO can slow down, and actually it's quite surprising, but sending traces to a file or to a network, it's actually quite expensive. And it can slow down the system quite a lot. And it's a heavy operation, so sometimes I had traces accumulating for three minutes after I finished traces or something like that, and the messages were still in the queue, still being processed. Yeah, so this doesn't scale that well. Okay, so let's sum up. Choosing the function to trace is kind of a guesswork. Not always, of course, sometimes we know precisely, but most often I don't know. I know kind of what I'm looking for, but not exactly, and that's the problem, but I need to somehow know the function exactly here, to choose it to be traced. So, possibly many iterations of the process. This is, for me, this is like ad hoc logging. This is very much like logging, but I don't need to have a log statement in my code. I just choose dynamically, right now, I want to do these functions to be logged. And what if the trace behaviors have tests that fail every 20 runs, for example, do I need to repeat this 20 times? So what? That's the problem, right? And answer to some of those issues is Erlang Doctor, at least for me, and for the people who've used that. So what's the difference? So you set up the tracing for entire application, not always. Sometimes it's not possible. Sometimes you have to trace individual modules, but usually you can start with just one entire application. You capture traces, store them in the ETS table, and you clear the traces afterwards. And you can repeat this process instead of repeating everything, because you've collected enough stuff to query and to find out about different functions, for example. To query, oh, this was this function called, maybe another one, and so on and so on. You can do this. And of course, rarely you have to repeat this, but for me it's only when I, for example, trace the wrong application, because the problem is not in my code, it's in a library that I've used. Then I need to trace another Erlang application, right? But it doesn't occur that often. This scales much better. What are the limits? So on my laptop, for example, querying becomes slow at about 10 million traces collected, which is quite a lot, but it's like tracing a function, a system under heavy load, for example. And of course it depends on the size of individual traces, because you can have big arguments being passed or something like that. Yeah. System memory becomes the limit at 4 million at about 50 million traces, but sometimes it's 10 million, sometimes it's 100 million, it depends. But basically when you have a few million traces, it's probably too much. So there is a limit, of course. So to sum up, very few iterations of the whole process, usually one. This is for me like ad hoc instrumentation instead of ad hoc logging, because you're gathering structured information in the ETS table. I will show you details in a moment. And use cases, for me there are many use cases. For example, debugging, system exploration. I often use it to just learn about the system. I just run the system, do it with the usual stuff when I'm tracing the whole application, and I'm just querying what the system did, actually, with the traces. And you can also do some lightweight profiling without the need to set up the profiler for a particular function. Yeah. So let's go to the Erlang doctor itself. How to get it from GitHub for Erlang or for Elixir? For Elixir it's called Xdoctor, which looks like a former doctor, but it's just a bit funny. Yeah, so here is a package of Xdox for both of them. And yeah. So how to run it? Three options. The first one that I'm using sometimes when doing firefighting, this is when you don't have it in your shell, but you want it right now, like in a system that's misbehaving or something. In both tools there are snippets that just download it from the Internet and compile it and run it, which works in this particular case. It's probably the best option if you just want it right now. And yeah, all you need is to have the access to the Internet, which is usually the case. The second option, which I'm using always in development, is just that you set it up in your .erlang or .iex.exe file. So that it's always available whenever you start any Erlang or Elixir shell, be it in your project or wherever. And third option, packaging. You can always include it in your application, in your software, if you think it's that useful. Okay, so let's move on. Let's start. Examples are in Erlang, but they are also available for Elixir in the docs. You can find them. The first thing to do is to start. It needs a GenServer, so it just starts at GenServer. And a few other examples how you can start them. You can choose a different ETS table. You can just have multiple ones if you want, switch between them. You can limit the size of a table. Very useful, like in a production environment, if you need to do some tracing, you just set it to like 1,000 or something. Just the table will never grow bigger, so you will never consume all memory. And yeah, there is also a start link. Okay, so let's set up tracing. I'm just tracing an example module. It's a test suite, but it contains functions that we can trace. It's good. So yeah, I'm just starting that tracer. And I can also trace a specific function, like provide a module function argument, whole application, multiple applications. And add a bit more. You can trace messages. You can trace specific processes and so on. There are a few more options. Capturing the traces. Okay, so let's call a function from the trace module. I'm calling just a sleepy factorial of 3. It's a function that calculates a factorial and just sleeps for 1 millisecond between each step, right? It will have some time difference. That's it. Yeah, very simple. And yeah, I'm just... Okay, now we can stop tracing. It's a good habit because you don't accumulate traces when you don't want them anymore. And now what can you do? Because we've accumulated traces, what can you do with them? So let's read the record definition. By the way, I'm using records because they are very performant. And even maps were giving me five times worse performance for some operations. So yeah, I'm using records. Yeah, so let's get all the traces. So I got all the traces and I don't want to talk about everything. Let's talk about the arguments. So these are the arguments and these are the return values, okay? For calls, for returns. And I will just introduce the other fields as we go. Arguments are in brackets. So now trace selection. You can do a select. It's a fancy way of doing this ETS select with ETS from 2MS. And let's get all function calls. And for each argument, let's just get this argument. So I'm getting a list of arguments. And of course, this is a recursive way of calculating factorial. So it's 3, 2, 1, 0. And there is also select slash 2. And this one takes like any term and looks for that term. So here it found, for example, it has an argument. Here it found it as a return value. But there is more. It can be hidden inside any lists, maps, tuples. So it will look recursively inside your data structures to find out anything you're looking for. So for example, you can look for an error message, even if it's called unknown error, which occurred to me once. And I just found this unknown error. I just put unknown error here and I just found the function that that causes, right, instantly. Okay, there is also filter. It's similar to select. But here you can pass any function. It's a bit slower, but it has more features simply. So you can, for example, assign a result to a search, to a variable, and then you can search in that list again. Oops, sorry. Then you can search in that list again. So you can narrow down your search. You got like two traces. Now you search in that two traces, but only for calls and you get only one. So another way to query it. And the tracebacks are very important for me because I want to, like, know the source where this originated, this particular, for example, function call. So here I'm just looking for any return value of one. And the sleepy factor of one actually matches it and it returned one. So this is returned, the traceback of this one. The call itself is first. It's right here. Sleepy factor of one and the rest is just the traceback. And the sleepy factor of zero returned also one, but it skipped because of some skipping logic. Yeah, it's details, but it helps you, like, limit the output that you get. Actually, you can disable that and you can get, like, all the traces with output all. Then you get, then you have no skipping of traceback that include another tracebacks. And you can limit the number of much traces. You can reverse call order. You can search in a different table, in a list, for example. And you can get only the first traceback if you want, very useful, just shortcut, let's say. And you can also get it for a particular record or just an index of a trace because there has just this auto-incremented indexes. Yeah, and similar to a traceback, you have ranges and ranges look inside. So traceback is like what's the source and ranges give you, like, all the traces starting with a function call until it's returned. Everything in between, from one process. And, yeah, so for example, here we are really looking for any traces that are just for function calls that have one as an argument and you get a range of traces from the call until the return. A range options, you can get, like, limit call depth is quite interesting and very useful because by having one you just get a call and return, which is very useful. And searching in a list of traces is also possible, getting only the first range if there are many, also possible. And getting the range for a specific trace. So quite a lot of options. I've just, you know, added, I've been adding and adding over a few years of development of these two. So they're all quite useful. Utilities, two simple utilities they wanted to talk about. One is to just look up a trace. Nothing fancy here, ETS lookup, does it, right? But then you can execute the trace, which is quite useful for me. So if this was a function call, I can just execute it right now, again. This is just, for example, let's say I fix the bug and then instead of writing some long code, I can just execute a trace and see if the result is the same or different. Or I can trace again, right? I can start the traces and trace again. Okay, now a bit of profiling. So I find this lightweight profiling very useful because it doesn't put as much stress on the system as F-Prof, for example, the Erlang profiler. And it's like instantly available. I don't have to prepare for it in any way. So call start, it's statistics aggregated by this function. So I'm aggregating everything under the total atom. So I'm getting like four calls and this accumulated time and this own time. These are equal because I'm just accumulating everything. But if I aggregated by function argument, you can see that there was this one call with each of the arguments. And this call took the longest time, but actually it's accumulated time because its own time was the shortest section, right? You can also do filtering here. So you can say when N is smaller than 3 and we just skipped one of them. So you can do that and you can sort them. Yes, you can sort them and print them as a table. We had some just nice utilities to have. And the last feature I wanted to talk about is function call 3 statistics. I called it like that because let's say we have a function that calculates the Fibonacci sequence in a suboptimal way, you probably all know that it's suboptimal. It branches a lot and let's clean up the traces. Trace again, call fib of 4 which returns 3, which is the correct value and stop tracing. So we now have different traces in the table and let's do it. Let's just call this function with default arguments. So it says that there is a call 3, I mean by that function calls, returns, everything inside repeated twice because there is this number 2 and it took 10 microseconds that there is no sleep in this example. So it took 10 microseconds in total. And this is how the function call 3 looks like. So you can see that, yeah, indeed, it repeated twice. So this can help you find out redundant code. Yeah. Okay, so this function also has some options but I don't have time to talk about them. You can just customize them a bit. And table manipulation, you can get the current table, dump it to a file and load it on a different Erlang node. And then you can continue analysis on a different Erlang node. And that's all I wanted to talk about. And that's me on a mountain bike. Thank you.
Mainline Linux on Qualcomm SoCs, are we here now ?
Thanks. Welcome to my talk. So my first time at FOSDEMM, so questions, twists. So I will do a summary of where we are now about Qualcomm SOC supporting in Mainline because I think it's time to do a point. So I'm part of the Linao Qualcomm Longing Team. I joined one year and a half ago. So my many daily work is actually platform Qualcomm support. I'm maintainer with Calib, the commentator of UBoot Best Sports and I bring new platforms upstream in Linux. I also maintain and develop other piece of Linux, namely the MLG SOCs, DLM bridge, DLM panels. And I'm working only on upstream Linux and UBoot for the last few years. And so I have a lot of patches upstream in UBoot and Linux. So I'm part of Linao. Linao has been founded to actually enable and make Linux and any software on ARM work better. So we're basically helping vendors and product makers actually make better products to work on ARM. So we have plenty of services to help the whole stack of software running better on ARM. And open source is at the heart of Linao. We mainly work on open source software mainly. So Qualcomm joined Linao 10 years ago because they wanted to have better open source support at the time, which was minimal. So they joined to support Linux, but it quickly collaborated plenty of other places and so far so good. Even happy Linao and Qualcomm is happy with the situation. So in the last 10 years, Linao and Qualcomm pushed a lot of really huge features for ARM ecosystem in Linux, namely the power framework that the energy hour schedule really changed the way Linux schedules correctly on cores. With help and Qualcomm participated in the standard and software structure on our servers, the Dragon boards are the reference today to test Android, ASP for example. We have called Linao, which is the principal kit storage for Qualcomm and for Linao engineers. And namely for the last three years, we were pushing the flagship mobile platforms. So this year, the last three months, I pushed the standard on Gen3 upstream and it was 98% two months after the announcement, which is pretty cool. So the agenda, where we came from 10 years ago, where we are now and the two supported devices, a demo and what's remaining. So we were 10 years ago. So 10 years ago, Qualcomm and vendors using Qualcomm SOS ships with kernels with more like three million change. So it was basically a separate kernel in a kernel. So this was a problem, but it's a hard to solve problem. How do you integrate, how do you upstream so much change in main Linux? That's why Linao started the learning team to fix this. And this is a graph I made to show on the last years, the last 10 years, how Qualcomm managed their downstream kernels. So initially, they used the long term kernel very long time and they kept accumulating new SOS support over time. So each time frame and for the last four years, the company changed the strategy and they stopped adding new code and simply changing existing code. I think the reason is first Android strategy with GKI and because the main line Linux has enough support and has the principal architecture missing over time. So this is what I posted nine years ago, eight years ago, which was true. I mean, it was mostly nonexistent. It was the only SOS inverter that was not upstreaming almost nothing. So hopefully it changed. So Linao worked on Qualcomm specific features in the last 10 years. So the biggest feature was the remote proc to handle DSP because before Qualcomm had a complex custom solution to speak to DSPs, which was like two million line of code only to speak to DSPs. And the biggest work we did was to implement it correctly upstream. So we have now a fully integrated way to speak to DSPs and it works really, really well. The other big feature of Qualcomm SOS is Interconnect because the Qualcomm SOS are very complex and you can fine tune any data path in the SOS and you can change the bandwidth. You can change the performance of any data path. So it was a huge feature. It took a very long time to upstream it. All the Venus video decoder was complex because it needed the other support before. The DSP audio support also needed the proper DSP handling. The DM driver is a huge beast because the graphics display engine is really complex and supports a lot of features. Lately the sound wire support was upstreamed for Qualcomm and other platforms and we worked on plenty of very time consuming subjects but tiny. But all of these are needed to actually boot a platform. So this is a graph of the upstream contribution. You can see it was quite a blow but all these features are really complex to upstream because they are either Qualcomm specific or very complex but it doesn't fit in any framework. So it took like seven or eight years to actually push the base features to actually be able to boot a high-end device. And the last four years because we had a complex support for all the small and very important features we are able to finally boot high-end and commercial forms on it. So we had a lot of contribution from Linao, Qualcomm and also the community. This explains a huge peak in the last two years. So this is a graph of the supported board of the time. So 10 years ago we had only two boards in 2D keyboards and now we have a lot of 300 boards which is huge. And most of them are community boards and non-reference of the base boards. These are the new supported boards of our time. So for each release the number of supported boards were added. So you can see in the last 10 release there's a huge amount of new boards added which is great and the community actually helps a lot in this case. So for the boards, supported boards, like Caleb says the historical dragon boards were the first really publicly available boards in the SBC form factor and they really helped starting the mainline development. And while they were like low-end SOCs, we supported a lot of features. Those support camera and very high-end features so it helped develop the baseline support to actually enable high-end SOCs at the end. So like Caleb said, these are the robotic boards. They are quite expensive and it's the current Qualcomm offering in the IoT world and they aim to support them fully upstream and which is quite, it's quite each board as a mid-end and low-end. So it's quite diverse and it helps supporting all the new features. So there is commercial phones which are running very, very well. So you won't expect all the features for daily usage. So you don't have haptics, you don't have camera but they work fine and you can boot and actually use it with Wi-Fi, Bluetooth and storage. You can have a few tablet convertibles running on Linux, mainline like the Lenovo Yiga 640 and these are the Qualcomm high-end reference devices. So those are the devices we use daily use to upstream high-end platforms. So this one is a one-year-old platform, this one is a two-year-old platform and this one is actually running this presentation. And those are the specific Qualcomm reference devices with test points used by Qualcomm engineers to actually upstream develop Android and we upstream mainline Linux support with them. So as I said, I was upstreaming the Gen3 support, the latest Qualcomm high-end SOC which the Samsung phones were announced two weeks ago and in 6.1 RC1 that was announced like two days before announcing the Samsung phones, we had already a display, UFS, PCI, USB, thermal, CPU fray, QSPC, suspend-resume and crypto-working on mainline Linux, check out Linux master and it was working, it works. And in the meantime, we developed audio, display portal mode, DSP, full DSP, modem, compute and audio, USB PD and charger and GPU is the last remaining one and I won't talk about it. So the flagship device you could use today is the Lenovo X14S. It's actually the best platform to use Qualcomm devices. It's really powerful and you can use it daily. My colleagues are actually developing mainline Linux on this platform. It supports, my colleague can use it about eight hours time working and you have almost all everything supported. So this is an example what is supported. You have a JPEU working storage keyboard, thermal, USB, suspend-resume, audio and you can boot over EFI. So you can, but obviously they're still working process like every software stuff. So the most important is the camera. Camera doesn't work. It's complex due to the sensor putting raw data and Qualcomm not wanting to upstream the stuff. So it's in working progress. We have something working. It's not public. We are working on it. There's plenty of other small features are missing like the embedded controller, the power measurement. Power measurement is infinite. It will never be perfect. So we're gaining amps every release. So it's a constant work. There's always some small, wiffy and Bluetooth issues. Audio needs active speaker protection. This is a big, modern feature. All the new, modern audio needs active speaker protection because it's not no more included in the codecs. And some stuff are still missing like the fingerprint or VDOC shale action. But we aim to support all this in the short frame. So today, if you want to test Linux, many Linux on the expertise, you can use Fedora, Ambian, Nubuntu or Debian. We about changing anything. It will install directly and boot and you could use it daily. So this is a great, great advancement. So demo time. I mean no need for a demo because I'm running it. I'm running the procession on a Qualcomm device. So yeah, for example, this is 8550. You can play a video, for example. It works fine. You can switch. I'm still in full screen. So you can see everything is fine. The video is still running. So the GPU works very fine. Up. Demo effect. Okay. So... So to show it's really usable. You have Wi-Fi and Bluetooth working in GPU and this platform is one year old. But I got hardware like two weeks ago. So it was great. And the support for the board is actually on the... It's made by the Qualcomm ARM maintainer. So it should be part for 6.9. So globally, what's remaining to support, properly support the Qualcomm SOCs, power optimization. It's a long term, nearly infinite work because the Qualcomm is complex and we still need to gain every time. So performance. Performance, like I said, each data path can be optimized. And it's also a long, long journey to support power and performance and manipulation. There are still missing some advanced graphic features, mainly for non-phones and non-laptops, like HDR and multi-plane and so on. Video with decoding accelerator is working progress. We're working with Qualcomm on it. Camera support is a big feature. Audio support, we still need to support DisplayPort audio. To support audio over the HDMI or the DisplayPort. Speaker protection, the sensor hub for the phones, feedback and the vibrator and the new platform because each year we have between two and three platforms to support either in computers or phones or IoT. Otherwise, it's keeping us working a lot. So we need help of the community because we need testing and we need to support more devices. Thanks to the community, we have the largest ARM 64 change in the last years. Every single release, we have a top change because it's really actively changing. We are really supporting mainstream devices, the phones, laptops, modems, accessories, converters. And we are working on new books. Qualcomm is porting new devices. It will simplify installing new distributions. And if you want to know the status of each SOC, you can go to this address in our GitHub IOMSM. It will give you a nice overview of the support. So like the last line is the standard on Gen 3. So all the yellow lines will be green in four weeks now. So for example, it's really kind of cool. We simply describe each feature with and it generates automatically a website. So it's really cool. So thank you for listening. I was happy to present the state of Qualcomm SST port and demo it in live. And it works fine. So no demo effect. Thank you very much. Very nice. Does anyone have any questions here? Yeah, hi. When can we expect Qualcomm to start upstream in support for the Linux that runs on the modems? On the modems? I have no idea. I'm sorry. Another question? Thank you. The question is actually first, is Linao or Qualcomm considering doing any upstreaming for legacy platforms? For what? For legacy platforms, for early edge upsets. We do it daily. Okay. So this is also happening? Okay. Yeah, we continue adding features for all platforms daily and the community helps a lot and we are testing it. And in fact, Qualcomm is pretty consistent in the firmware interfaces and APIs and registers. So we in fact support all devices quite constantly. And then the other thing you mentioned, specifically on camera, there's a lot of work on Android. Second, you have a lot of out of three drivers. Instead of a platform, Qualcomm, they actually get everything supported directly in upstream Linux. I hope so. And the question here, one second. Hello. Very nice talk. Any plans for Spectra ISP? So yeah, it's the same question. I don't know. It's not in our hands. Another one? Okay. I'll pass the mic. You talked about many distros already working. If we had, for example, a root FS from another distro, the boot loader situation, is it the same as in mobile phones and their SoCs? Or can we just expect to boot from UEFI or similar? So for the laptop, they have a functional UEFI shipped with the laptop. So there's no need for you boot. I think it's not perfect, but it works fine. So you can directly install Fedora in UEFI on the laptop when you open it. So it works. Thank you. So you mentioned something about video decoding. How will exactly that work? Will there be a VA API driver or will it use something else? Today, there is already a Venus v4l2 driver for the old platforms. And we are working to support the new platforms using v4l2. So Qualcomm wants to push support for the platform. So we need to find a way to merge it and make it more prettier. But yeah, v4l2. Okay. Thanks. Another question, anyone? Yeah, yeah. Hey, thanks for the talk. I had a question about availability of certain documents required to write a lot of the drivers. Is Qualcomm making those documents available to the public? So no, it's not the industry. They don't want to document the hardware publicly. So for regular people that want to help, it would be like reverse engineering or? Yeah, code. Okay. I mean, cool. I've implemented all the MLG support using code only. Almost no documentation. So it's hard. So we need documentation for more complex features. But most of the future, we use code, even us. Because documentation, you have registers, but you don't have the state machine. You don't have the behavior of the hardware. So. Okay, we could fit in another question if there is any. Otherwise, yeah. Okay. Yeah. I'd actually like to continue the question that the teacher's raised. So how is it working then? So you signed an NDA with Qualcomm, get the doc. Docs can write the code, but you're not allowed to document it. Yeah, speak about it. That's how it works. Yeah. Yeah. Gotcha. Please give another round of applause for our speaker. And it was really all running from this device here, the board. No laptop. Yeah.
VoLTE for FOSS
I'm not sure if you should attach the mic. Maybe. One of you should attach the mic. Maybe. Okay. Do you want this one or me and Mike? Me and Mike. I'll give you this mic. How can I use it? So put this in your pocket and this one you attach to your... Yeah. Is it correct? Because I'm bad with mics. Yeah, that should work. Can you put it up here? Maybe like this? No. No. I'll just speak and test it. Okay. Let's try it once. Let's try it once. It shows here. Yeah, it shows there but not... It's here. I have a green light. Are you going to... How do you say that? I will do my... Alright, next up we have a talk called Volte for Force. It's over LTE. I'm pretty sure we all want to see that on Linux Mobile. And this talk was originally supposed to be given by Marius from UbiPort. But Marius is not here today so please give it up for Nikita and Ivan instead. So, hello everyone. I'm Nikita Ukhenko from the UbiPort community and also looking for Yula. People mostly know me from Telegram as NotKid. And since Marius is not here, we cannot be here to the replacement but we try to say what we have learned so far when trying to make the evolves in the same work on Ubuntu Touch and other mobile industries. Currently it's still HMS but I hope that we can get more people involved into this and if you want you can stay here to discuss on how we can implement the Volte 3 on more Linux distros and what can we do together in our understanding. Can you turn up the volume like... You put the mic up or something. Yep. Okay, is it much better now? Yeah. Great. And the chat is not yet so can you raise it up more into the... Oh, yeah, that's a delay. Okay, good. Just go on, yeah. Yeah. So, I expect people in this room are familiar with voice over LTE and what is it for but briefly just a communication standard for voice calls over LTE networks. And there are similar standards for voice over Wi-Fi and voice over 5G networks which is called VOR NR basically. The main reason that we have to worry about the things is that GSM and LTE networks are now becoming scarce resource and if you want to make calls from our mobile Linux distros we need to implement this VOR LTE at some point. And if you had voice over Wi-Fi it allows other cool things like when you're in a roaming you can try to connect to your mobile operator endpoint and make local calls to your home country at local prices, not roaming prices. So, let's start from how it currently works on Android. There's a picture with the TelephoneLinux website but the point here is that there is a modern firmware. On top of it there is a modern interface library or a library that is used by RIL which stands for Radio Interface Layer on Android. On top of it it provides an HDDL server which implements HDDL radio interface and on the recent Android versions it became ideal instead but that's not what we care about at the moment. So, the frameworks part are the ones that implement the communication with HDDL radio interface and there are vendor specific IMS parts which plug into IMS Android interfaces but the vendor implementation is closed source and unfortunately device specific as well or it is chip specific. When we go to Ubuntu Touch we keep those four bottom layers but we don't have the frameworks anymore. And here the problem comes that the IMS parts are provided by the frameworks on Algam and we have instead of frameworks we have Ophono that talks to Radio Interface and Ophono is talked to by Telepathy, Ophono or other layers on TelepathyCondistro. On SelfieShow as we use TelepathyRing but in just implementation details. So, if we don't have the IMS part of frameworks what can we do now? From here we have a motif option. First we can reimplement the Android frameworks part of IMS so it can still talk to vendor interface and that's how it's currently done on SelfieShow as for Xperia 10.2 and 10.3. It's also been tested to work for other Qualcomm devices but unfortunately the plugin by YOLA is currently closed source as it relies on Qualcomm specific headers and I think YOLA is afraid of some legal stuff that's going to happen if it's publicly released. On Obuntu Touch we've been trying to use the same plugin around here but the problem is that the Qualcomm IMS part is black box and sometimes it works and sometimes it doesn't for no reason. It's quite hard to actually understand what's happening because basically what's the more the Ophono part sending is just asking the model, hey can you connect to the IMS for me and the model just answers yes I can and that's it. So you don't know when it's connecting, why it's connecting, how it's connecting and it's a complete black box. So, as you see in the picture we can try to write an IMS plugin and we plug it between the radio interface and Ophono or some other telephony layer. It works but it's device specific and it's a bit of a pain. I've been trying to write a similar plugin for many of the devices now and the idea of it is very simple like you tell the model, okay please enable IMS, here is the IMS IP and connect to it, copy and config from some pass but whether it works or not it's a bit of luck. Dependent on your career. The positive of this approach is actually that you don't need a 4G network knowledge and so you don't need voice over utility knowledge but it's a black box and if it works, if it works you don't know why. Yeah, so second option we have, it's very similar to the first I mentioned but it's maybe interesting for mainland people and that's why I'm mentioning it. So we can ignore completely the ICTL Android parts and you just write a library of a driver that talks to modem firmware directly. That's how it currently works on Pinephone actually because on Pinephone the Qualcomm modem is a separate USB modem and you can tell the modem via QMI to enable IMS and voice over utility. So it's the same black box approach but on a bit lower level and I don't think we will use it on Halium distros because it will cause MS if Android and direct modem interface communication done at the same time. But that's possible, yeah. This approach requires at least a little bit understanding of the network stack. And you need to also know your modem firmware protocol. On Qualcomm it's QMI as mainland people probably know and on media it's interestingly 80 commons but of course it modifies it up 80 commons. And then the most annoying approach but also maybe the most interesting is that we can attempt the modem to set up a data connection to MSAPN and interface with mobile service operator services on network transport layer and it becomes a real mess of senders and protocols. Basically that's the end goal but we wanted to show you how the voice over utility stack looks like and that's the picture. It's not only the voice over utility, here's the full 4G stack how it looks like and the voice over utility in the network is going over just a second. So this is the TCP IP network stack. This is the transport protocol used for the 4G network and the voice over utility network and then we're going for the stack which we showed you previously. So our end goal is trying to implement this on software so it would be open source for every distro to use but as you can imagine it's quite a challenge. It can also allow for some interesting things to do like there is a project to perform the SIM card authentication and set up encrypted AKV2 IPsec tunnel to the mobile operator endpoint from your laptop. So it makes it a bit easier to debug if you can just use the phone for authentication and then you set up the voice over Wi-Fi connection from your host PC. And there are multiple projects who try to implement the open source telephony. The most prominent one is currently DoaBungo for IMS services but sadly it has been un-maintained for the last five years or so. It's in a working shape but you'll probably need to figure out a lot of throw-angels later on. However, courtesy of Mohammed who also wanted to make it here to Brussels but he was refused a Belgian video suddenly. We have a screenshot of DoaBungo connecting to the mobile operator endpoint via APsec tunnel for voice over Wi-Fi and it tried to receive a call while it couldn't because audio part wasn't really working. But at least it could receive an SMS. Here are those symbols you see because SMS uses ETF16 text encoding and the console is ETF8. It did receive something and that's where I am currently. If to summarize the set of part is that we have voice over LTE working with device-specific... using Android radio interface on selfish OS and we won't touch Android on this selfish OS plugin but only for specific welcome devices. We have something cooking for MediaTek and we tried the third option for implementing and we are facing with IMS services in software. Both of them are possible but we are not there yet. Since Marius is not here he cannot speak about all the operator wordnesses he encountered over the road but we are open for discussion and if there are other mobile projects who want to get voice over LTE working it would be nice to see how we could collaborate. Do we have any questions? Maybe in the chat room? That's a question. Can you pass the mic? Okay, so you are going to touch. I get that right. Who is developing a funnel? I was wondering who is pushing these kinds of efforts forward. It's interesting question. Even Ophono is not... We have multiple folks of Ophono. The upstream one is... I don't know how other developers of upstream one. It was sponsored by Intel at some point, by Migo, but they have some community maintenance for Ophono. But the Ophono version we are using for is developed by YOLO for selfish OS and it's heavily forked from the upstream Ophono sadly, but they have been enforced by Adam Peek to bring latest Ophono changes back into selfish OS fork. So it's closer to upstream Ophono and it can be used for pine for selfish OS. And the Ophono binder plugin I've been talking about has been developed by Slava Monich also inside YOLO. Is there any cooperation with YOLO or are you just taking their stuff and developing it forward? I have a style of fish, so I'm interested in it from the user perspective. It never worked for me, by the way. Obviously, the stuff on the fork is open source and when it's possible we try to make upstream MMRs, but the code base is quite diverse, so currently we are taking from Ophono and we will be using it. So I thought that's a question in the chat. So somebody asked, on the Librim I learned that it can be very carrier-specific, whether voice over AT works or not, and it carries white or black list specific modems. Is there anything we can do in this regard, like spoofing modems? So there are multiple carrier-level specific things. First, each one has vendor-specific configs provided by its vendor. For example, on Qualcomm you have vendor-firmware MNT partition, it has image-modem subfolder, and for many carriers there is carrier-specific, modem-firmware configuration, and it's very much assigned black box, we cannot do anything about it. Of course, we can try to load configuration from a different carrier and whatnot, but as Alan would say from Sony, do not do this, you will break the carrier's network. So I think the detection of modem on carrier-level is mainly done by few parts. First is the EMI of your phone, which you cannot spoof in most cases, and there is also user-agent. The user-agent part when connecting to the... I'm a service on the network stack level, of course it can be spoofed. Okay, thanks. Any more questions from RUM? Yes, at the very back. One sec. So a bit related to this, are you encountering any pushback from carriers, because you could potentially be messing with our stack? I guess we are too small for carriers to care about us, unless we break something to bed, so not at the moment at least, but on us. Do we have another question in the room? Yeah. Hi, just by chance I was on the schedule for later today. There's some talk, ProvideVol.te using OpenZips. Have you ever heard of OpenSips? And is that interesting for us? Yeah, I haven't heard of them, but it would be nice to check the talk and see if it can be run on Linux. Can I get some? Just to expand on the previous question a little bit, in order to not have problems with the carriers, I'm also trying to set up just a 4G network with a software-defined radio, a private one, so we can test whatever we like without breaking stuff, you know. Okay, maybe from the chat again. Are there any plans to upstream Ophono changes to kernel.org or Ophono version? I don't know, I cannot speak for Ophono developers at the other side. Okay. Is there another question? Okay. Yeah, I guess then we close it. Give another round of applause, please. Don't forget the mic. The mic? Yeah. Would you buy the pack already, the mic? Yeah, I'll take it. We talked in video chat at some point from SusmoCom. Yeah, yeah, yeah, yeah. It's good, nice. Yeah, so this is also your, how do you say that, your field. Did you get any further with it or? No, not really, no. I think it's a little bit with everyone that is like, we married that guy who's working, but couldn't figure out why if some last work, some doesn't work. Yeah, so I don't know, we just need people to like to have a different approach with everybody. Is this your bag? No, it's the Coulance here, so I like to... Are you also late for the mobile mic? Yeah, I guess I'll make you check it out. Thank you. I think the refuges are here. I got an avid setup here. What is the score on my... Here's the HMI and here's the... Just be safe, whatever you need. Just be... Okay. I mean, it's also small. Nothing that much specific. Looks really nowhere. And you have some more minutes for the end of the 30, so it's like 9 minutes. Okay. I have so many problems with the inters, because I would say like... I mean, you're gonna do a little bit of this, but I'm like, let's say like... Okay, er... Check the camera. I'm like, fuck, I can't say my name. I'm like, seriously. I think the ref I have is the one I was saying about. I was like, I should be shooting myself more often. I don't know. I don't know. I'm gonna get a step-by-step. Sorry. I'm sorry. So I had to write something. But they felt like, you know, I should have written something. I could have thought of something. Otherwise, I could have done that. I don't want to. I don't want to. I was thinking the very smallest bit. I'm not sure. I'm not sure. I don't know. I'm just wondering. I still have a few questions. Have you talked to me before about the Dragon Messenger? No. I'm just trying to set it. What was your voice before? I was just wondering. I was just wondering. I was just wondering. I was just wondering. I was just wondering. I was just wondering. I was just wondering. I was just wondering. I was just wondering. I was just wondering. I was just wondering.
Universal Serial Bug - a tale of spontaneous modem resets
Okay, thank you all for coming. The next talk is anniversary serial bug, a tale of spontaneous modem research from Sebastian. Give a big round of applause please. So hi, I'm Sebastian Krzyszkowiak, I'm also known as DOS, and I have many hobbies, I make games for example, maybe you have played any match for, but other hobbies there is also mobile GNU Linux, which started many years ago when I got an open Mokoni of Rane, and eventually I had been contracted by Poison to work on the Libre M5 phone, which is this chunky boy here, and within this device there is a USB connected cellular modem on an M2 card, the worldmobile BM818, and this is the main character of our talk today, because we had a problem with it, and the problem manifested itself in this way that sometimes, occasionally, seemingly at random, the modem would just disappear from the bus, it would be just as if it was unplugged from the socket, and it would come back right away, even though it did come back, it was still pretty disruptive, because the network interface would go down, the audio routing would be turned down if you were doing the call, so this wasn't really great, the modem wasn't crashing, it wasn't rebooting, because it maintained its state, at least some of its state, but it was just as if you would pull the plug and plug it back in very quickly, without with some external power connected to the modem, and there were also other instances where the modem wouldn't come back, or when the whole USB bus would actually die together with it, however we won't be talking about those turned out, even though they were like wars, they turned out to be connected, but separate issues that weren't as technically interesting as those resets turned out to be. So this talk will be some kind of debugging case study, and I would just like to talk about how we identified the issue, how we debugged it, and worked around in the end, and at the start I would like to note that this is not some groundbreaking research, this is not a new discovery, because it turns out that this was known for ages already, but I think it's not a common knowledge still, and it turns out that it can still bite, so I thought that this would be an interesting thing to talk about and to share. So in order to understand what's going on, I'll quickly go through the topology of the devices on the LibreM5, so we have two M2 slots inside, one of them is the cellular modem, and the second one is the Wi-Fi and Bluetooth card, and there are two USB controllers, one of them goes to the USB C port, and it all swaps, and the other one is connected to the internal USB hub, and therefore it works as host only, and the internal hub is USB 2642, which has three downstream ports, one of them is internal, as this hub has a microSD reader built-in, and the other one, the one that we will be interested in today, is the modem's port that goes to its M2 slot, and there's also the third port that goes to the Wi-Fi M2 slot, however none of the cards that we use on this phone actually use USB, they use different interfaces, so this third port effectively remains unused. So universal serial bus, I'm just going to assume that everyone here knows what USB is, we all used it, so I won't read Wikipedia definitions, I will however go through some of the properties of USB, either to remind you or to make you aware of how this works on the wire, so the first thing, that the devices can be suspended, this is a power management thing, you can put a USB device to sleep, theoretically all of them can be put to sleep, not all of them react well to that, the specification says that they should, but yeah, reality is different, and there are two ways in which you can suspend a device, you can either selectively suspend a single port or put the whole bus into so-called global suspend, and another thing is that no device on the bus speaks unsolicited, every communication is actually handled by the host, it's the host that keeps pulling each of the devices for whether it has something to say or not, and then the device only responds to what the host is asking it for. There is one exception, when a device is suspended, it can actually signal that it wants to be woken up, but that's the only thing that it can signal on its own. One interesting thing is that, I think not everyone is aware of that, that all USB hubs are technically smart, they are on their own, a proper USB device that you can talk to, that you can send comments, and that can respond and send some status. The features that you can control this way vary, so not every hub will, for instance, provide power switching control, however, this is exactly how suspend is implemented, you send a message to the hub and the hub inter-president does it. Internally, when it's on the wire, USB works with two wires that form a differential pair, and you can have, on two wires you can have four states, however, one of them is illegal in the specification, the USB doesn't use it, so we are down to three states, they are called J, K, those two are when one of the wires is high and the other is low, and there is SE0, which is when both of the wires are low. There are some differences between various speed modes between USB 1 and USB 2, we won't be going into newer versions as they are different, and the modern here uses USB 2. However, the differences between USB 1 and 2 are small, the old states are similar, they use different voltages, but logically it's basically the same thing. So, let's go back to the bug. At some point we have noticed that those modern resets are somewhat dependent on movement or signal strength, the easy way to trigger them was, for instance, to ride a train, you could often see the cellular connection icon just disappearing for a moment, or when you don't notice some file, it maybe could drop out, and that was pretty annoying. And also, sometimes in some places it basically never happened, like at my desk when I worked on it, and it quite often happened in my bedroom, for example, where overnight I would wake up to a bunch of resets happening overnight. So, in order to look at those issues, we have to check some logs, I have showed them earlier, but that's not enough, and Linux has this pretty useful feature called dynamic debug, and pretty much all the channels called drivers are sprinkled with debug print messages, however, they are by default compiled out for performance reasons, however, you don't need to recompile the kernel to put them back in. They can be dynamically patched in, and this is how you can do it. Using this interface, this invocation tells the kernel to re-enable all the print statements from C files, from drivers, USB core, directory in the kernel registry. So, we did that, and this told us a bit more. It turns out that this is an example of such a reset happening, and it turns out it happens when the device wants to wake itself up from USB suspend, and here we can see the status given by the hub to its ports. The port one is the microSD reader, and we can see that there is 0507, which means that five that is connected and enumerated properly, seven is that it's suspended, and change zero means that nothing changes, and port two is the modem, and here we can see that it's different. Zero one means that it is connected, however, it didn't actually went through the whole process of connecting, so something happened there. Zero one means that it's not suspended, and change five tells us that it both changed its suspend status and connection status, so it's just like it would be from the plug, and put quickly back in at this point. To compare it, this is an example of when things go right. After the port has been resumed, we can see that status is 0503, which is different from the port one, because port one is still suspended, and port two is already working up, so there's three at the end, and change four tells us that only the suspend status has changed, so this is how it looks when it works fine. That told us something but not much. There is another feature called USB MON, which can be used to sniff on the traffic on the USB bus, and can be used with such common tools like Wireshark, however, it still didn't tell us anything new, and it's just like if the device was disconnected and put back in, so not very useful at this level. We have to take a few steps back, and the first LIME 5 fonts shipped to the first customers in December 2019, and the issue about those resets was filled actually by myself in August 2020, so there was plenty of time to notice this issue, and it hasn't been noticed earlier, so it's safe to assume that it wasn't there initially, and just came up later. So looking at what was the state that those first fonts have shipped in, the USB power management was already enabled with the software that was running on them, however, it turned out that the SD card reader, the driver for it, kept the USB hub active all the time. It was basically pulling it for media change status, and that's why it never suspended, so the whole USB hub was kept active, and it was fixed in August 2020, and there is a somewhat lengthy thread on the Linux kernel mailing list that you can follow if you're interested in that, and there was also another thing, at some point I have noticed that modern manager pulls the modern for signal strength every 30 seconds, and I wanted to change that because that's not very nice on the battery, and to make it start listening to the messages coming from the modern whenever signal strength changes instead, and I got it working the first time in the context of LiveM5 in August 2020, and later I noticed that with this change the resets popped up more often, without this change they were still there once the hub started suspending, but not as often as with this patch. So now we know that this is related to power management, and it turns out that disabling the suspend may be the issue going away, so yay! However, doing so it's almost half a watt, so not so yay, and basically this was the main reason behind a poor reputation of battery life on those devices when they first shipped, so power management is essential and it must be kept on, we just have to find a way to solve it without disabling suspend, and there was one vital observation, I think Elia said that she observed it first that the issue only ever happens if the hub has just been suspended, never if the hub sleeps for some time already, and then the modem wants to wake up, it's always the hub goes into suspend and right away the modem wants to wake itself up, and things go wrong, so this starts to smell like some kind of race condition, so what we do with race conditions, we start playing with some timeouts, if not in hopes to fix it then maybe to make it happen all the time, just to learn something about what's going on, Martin Keplinger was earlier working on that other issue that made the modem not come back, he had some progress on that, but however he didn't really make progress on this one, when I took it over I started, I based on his work and to figure out what's going on with the kernel code in USB and started changing some timeouts, eventually I figured out that this isn't going to help because at this point where was the earliest possible point where we could query the hub for its status, it was already telling us that something wrong happened, so this didn't really help, and I think that really helped was finding out that you could reproduce it by pinging the phone, if you pinged it over the network interface, cellular network interface and set the packet interval, just right, I think it was about, a second above two seconds, you could actually make the modem reset this way, so this helped to investigate it, and at some point I also started playing with an USB M2 adapter to pull the modem from the phone and put it into other kinds of USB sockets in other devices, the idea was to identify whether it was the hub or the SOC or the modem itself that caused troubles, and I found out that with that listed kernel modules for the modem and sleep timeouts all set to zero, I could make it into some kind of reset loop, it would basically reset every second or two and keep resetting, and at some point I noticed that when it was plugged to some USB hubs, I got it pretty much all the hubs I had in my house, some pretty ancient ones as well, and with some of them it never reset it, I couldn't make it reset with some of the hubs with others, it was pretty easy, and whenever it was connected to the house directly with no hub in between, it always worked, it never reset it, it even applied to this port on the Libre 5 itself, when it was plugged to the USB C port, the resets were never there, so there was time to start to read some specs to find out what's going on or should be going on, and it turns out that the USB device enters the suspense state after three milliseconds of no activity seen on the bus, and this can happen in two ways, you can send a message to the hub to enable port suspense feature, and this is how the hub stops sending frames to that port anymore so it doesn't see activity, and it suspends itself, or you can stop any communication on the bus, we just call it global suspense, and then all the devices on that bus see no activity and go into suspense, and when the device detects that the data lines have been in the idle state for at least three milliseconds, and high speed idle state is SE0, it must revert to the full speed configuration, which is J, so D plus high if I remember correctly, and then it must sample the state of the line, so it checks what hub has or host has asserted, and if it's full speed J, then it continues with the suspense process, it is required because SE0 is also a reset signal, if at this point it would stay in SE0, it means that this is the default state that the bus is put in and the device must reset, but if it's J then it means that this is a suspense has been requested, so the device then asserts J itself and stays in J, and we now know how a suspense works and how about the resume, the host can resume the port at any time, it must send the resume signaling, which is K, for at least 20 milliseconds, and after resuming the bus the host must resume communication within three milliseconds, because otherwise the device would go into suspense again, and what if it's the device that wants to wake itself up, it cannot wake itself up after being put into suspense for at least five milliseconds, and then it can, and it must hold the resume signaling, which is still K, for at least one millisecond, but for no more than 15 milliseconds, and the controlling hub, which is the hub that actually handles the resumes, suspended as there might be more on the industry, must re-broadcast that to upstream within one millisecond and ensures that it is signaled for at least 20 milliseconds, so it kind of takes over that signaling. So now it was time to get dirty, fortunately I didn't have to do that myself, Eric Kazmenko, who is the hardware guy at Poism did it for me, and soldered some wires and put a differential probe to it in order to sniff what's going on electrically on the wires, so this could be then seen on an oscilloscope and recorded, and this is an example of what's going on. We can see here at the beginning some kind of high-speed communication, as it's a lower voltage than full speed, at this point we can see that the modem went into suspense, this is the J state, for some time, and then here we can see the K state, which means that it was either resumed by host or it wanted to wake itself up, and it happened, cycled this way for some time, and eventually something went wrong here, so to zoom it up, what happened here is that there was some kind of high-speed communication, it stopped for three milliseconds, at which point the modem went into suspense, and there was a J signal for another three milliseconds, then it went into K state, we can assume that the modem wanted to wake itself up, and it lasted about 20 milliseconds, but then the bus went into SE0 and communication did not resume, it stayed at zero, at which point after another three milliseconds the modem just suspended itself again, so this is somewhat informative but still not enough. My hypothesis at this point was that the specification requires a great period of five milliseconds before sending a remote wake-up request, but I wasn't quite sure whether the wording isn't ambiguous, because it says that it needs to stay continuously in the idle state for five milliseconds, but if we check here, we have two idle states, there is high-speed idle state for three milliseconds, and full-speed idle state for another three milliseconds, so when is this point where it starts? However, there is also a side of English description, there is also a bit more formal state machine description in the specification, and after deciphering that it turns out that both of these idle states actually counted as one continuous idle state, so this probably wasn't it. So we go back to getting dirty, and this time instead of just sniffing what's going on between the modem and the app, we also sniffed what's going on between the app and the fonts processor, at the same time, which required quite interesting contraptions to be made, but it worked, and we got some data, and this is an example of things going wrong, and we can see some USB micro-fames here, so host polling the devices, and then some communication actual, and then nothing for three milliseconds on the modem port, on the bottom we can see the part between the app and the SOC, and there the micro-fames continue, and the modem goes into suspend, and after I think here it was too many seconds, it wants to wake itself up, so with assets K, and the app takes over, then 20 milliseconds later it stops, but what happens here at the bottom, and the micro-fames continue when the modem is suspended, and when it wakes it's up, starts to wake itself up, the communication still happens, until this point, then it stops, this is the point where the app has been suspended by the host, and then after three milliseconds the app went into suspend process by itself, and what happens here is that at this point, at this exact point the app started to wake itself up, however at this point also it should start sending frames to the modem, start forwarding frames from the host to the modem, but the app itself was waking up, so there was no data to transmit, so it all fell apart at this point, and I started looking closely into the specification, and following the state machine, and I couldn't really figure out what the app was exactly supposed to do in this case, when the upstream-facing port went into a suspending state while a downstream-facing port was already in the resuming state, and I wasn't sure whether it was my misunderstanding or whatever, what was, at this point in time the host has no way to know that the downstream-facing port is already attempting to wake itself up, if here we would query the status of the port, it would say that it's still suspended, there was no indication, and that's actually how it works in the spec, so that information only becomes available when the port already finishes resuming, so now I knew what was going on, and I had the knowledge what to put into the search browser, and I found this email from many years ago from Alan Sten, who is a guru of USB and power management subsystems in Linux, and he stated that the USB to spec does not take into account that possibility, so Alan basically validated my suspicion, yes, before I made that suspicion, so at this point I could safely assume that my suspicion was true, and what's worse, that mail ended with, I don't know what we should do, suggestions, anybody, there were some replies but it didn't really went anywhere, and however that mail pointed to an IRATA, and IRATA said that there is a very unlikely possibility of a race condition, and this issue can be avoided if system software suspense are downstream-facing hub ports before suspending a hub, I completely forgot to check IRATAs, at this point this was the first time I seen it, and I was so happy that this was the first time I seen it, because what the hell, this recommendation suspending the port before suspending a hub is exactly what makes this issue happen, and Alan Sten said, so himself in his mail that this IRATA is completely bonkers, so I'm so glad I didn't see it because I would be so confused, so there were around what I did to actually stop, prevent it from happening, so I have added a port query in the USB subsystem in the kernel, which when it was enabled the port was never actually suspended selectively, Linux only pretended to suspend it, but didn't actually send the command to the hub, since this would cause troubles as if we just pretend that the device is suspended, we stopped pulling it for more information, but the device isn't actually suspended, so it can't wake itself up, so to prevent that from happening, we keep such quick port active whenever any sibling port is active as well, and when the hub gets resumed, all ports marked with this quick are also resumed as well, and this lets us rely on global suspense when we just stop sending any communication, and all the devices suspend at the same time, preventing this race condition from happening, and this works well with the topology on the LibreM5, but wakes apart on different topologies, if we added another device, for example on this third port that also wanted to use remote wake up, it wouldn't work, there's the code, so what can we do now? This hack isn't really a sweet table for mainlining, it's really a bad hack, so for now it stays in our downstream tree, however I believe there is a way to do it in a way that could be potentially upstreamed, it wouldn't be the default, I'm pretty sure, because this it would be quite inefficient, but I think it should be possible to have this as an option if you have such devices that are resettled in this way, that you could actually have them work reliably and wouldn't have to disable power management completely, and to do so we would have to ensure that no downstream wake up capable port is suspended while the hub goes into suspend, and there's also another thing that made me implement it as this hack instead of a proper solution first, is that while the proper solution is less efficient, this hack actually gives us some efficiency because we can skip suspending each device one by one, we just suspend them all at once and it takes less time, so this lets us make the modem go to sleep more often saving more battery, and so that's basically it, I'm available for consulting so I can turn your money into code if you're interested to have something done in mobile gaming space, and if you have some questions like my reviewer had here you can ask them now, thank you. Great. You already have a question here? Oh, you mentioned the influence of the modem manager on this effect, can this be explained with your findings? Yes, this is because when the modem manager is polling every 30 seconds, it's the host that initiates the communication, but if we switch to unsolicited messages from the modem, then it's the modem that actually initiated, so it wakes itself up more as opposed to the host waking itself up when this issue never happens. Hello, thanks for your presentation, how many man hours went into this bug fix? Oh, I don't really know, it took many false starts let's say and red herrings, so this is obviously just a chunk of it because I had to feed it into the presentation, but yeah there were many approaches that when we were really in the dark at the beginning, I didn't know anything about how USB works, initially I had landed from scratch, so it took some time. Hi, quick question for you, actually two questions. The first one is, is the USB the ideal way to connect the modem or is there a better protocol that we could be using in the future in another design? It depends what you have available, perhaps on the LibreM5 we could theoretically use PCI Express, however PCI Express would be at least on this SOC would be much more power-hungry than USB and USB makes it easy to find such devices that you can actually have on a replaceable card that you can put into the phone pretty much off the shelf, so the options are quite limited in this place. And second question actually on that, when it comes to adding a different modem, this isn't a modem issue, obviously it didn't come down to which modem you were using, but are you guys looking at releasing a gemalto modem because that would be pretty cool? I'm not really a person that has anything like any power in this regard, so I can really say much about it. We have a question from the Metrix channel. When will it be fixed upstream, hopefully? Hopefully soon. Making this presentation, submitting it here, was actually a way to force myself into going through this again because after getting this hack done, I just wanted to take a break from all this USB stuff, so maybe soon, maybe not, we'll see. I think it should be pretty simple, in fact. We'll see what the maintainers will say, whether they will be happy to take such a quick approach or maybe they'll have another idea. We'll see. Are there any proper solutions to this problem, like in the USB specification, for example, are there any hubs that don't have that issue? So the specification of USB 2 never fixes it. USB 3 works in a completely different way and there are also supplemental low power modes in USB 2 that could be used and that also don't have this feature, but you have to have a device that supports those modes and we don't. So we can say that it's fixed because it's all completely different in USB 3 and higher. And for USB 2 devices, it's all up to the hub and how it's implemented. If it's implemented to the word of the spec, there's a high probability that it will have this issue, but some hubs are like, specs gives you some time to do things. You can do it like the minimum and maximum time and some hubs are faster and then you may not see this issue happening with them. So yeah, at this point with USB 2 devices, it's probably up to your luck with what components you are using. I'm working on open source USB debugging tools, sniffers, software, so I'll be interested in talking to you about capturing this as a test case to make sure that we're able to spot this happening on the wire in future. Okay. Very nice. Yeah, first another from the chat apparently, then a further to you. Is there known other mobile devices that suffer this issue? I relate some aspects of the bug on Pinebook Pro Wi-Fi. Honestly, I have no idea. This was the first time I experienced this issue and had to basically go through what I told you today. So I don't know. This was known for years. The email was 12 years ago and Alan Stem has said that this came up in testing. So obviously this came up somewhere, but where it was and which devices were affected, I have no idea. So you mentioned the other USB bug you were facing where the whole bus died. Did you fix that as well? And can you say like two sentences about that? Is there once again? The other bug you mentioned in the beginning where the whole USB stack died and the modem didn't come back. Did you fix that as well? And can you say maybe two sentences about what's the possible? Basically that one was pretty boring. It ended up to be a missing queer queen, the host driver that was already implemented, but wasn't enabled in the device tree. And at some point, actually, NXP has enabled that for all IMX8 and board. So this is fixed now May 9. So please give another round of applause. Thank you.
The Linux Phone App Ecosystem
Okay, our next speaker is Peter from linmob.net and linuxmartphones.org and he's talking about the Linux phone app ecosystem. Please have a round of applause. Hello everybody. So I hope everybody can hear me and yeah, this is my first talk and I'm really glad to be here. It's amazing that this conference is running every year volunteer based and that we have another room this year to have all these great mobile Linux talks. Now we'll have one that's maybe less great but I don't know. So I think I need to hold that. So this is an important thing. You can use those devices, Qualcomm SOCs or the little five and whatnot and you want to touch on all of it but it does have no apps, right? So in theory this could be so simple. You just install Waderoid on your distribution, simple and then you install asteroid, free software apps and then maybe you need some proprietary stuff so you can do that and you have all the apps. Well, you know, I've done that with Linux, it was in the past and so on and microg is amazing and whatnot but there's always issues and especially with virtualized Android and so yeah. There are good approaches and worse but I think I would rather go with native if possible so this talk is only about native apps, whatever that means. But not so fast, let's have a brief agenda. Who am I? Some dumb puns maybe. What's not in this talk? I don't have a slide for that because why? And then apps on Safe for Sure has absolutely a bunch of touch and the new contenders so what I do with the links on apps.org or what others and I do with the links on apps.org because I don't develop any apps as other people and I don't add all the apps. Can't do it. And then highlights we have, gaps and challenges and Q&A maybe. So motivation. We already heard of three major projects, realms maybe mentioned like with Safe for Sure has and you want to touch and all these new Linux distributions that's born up that we'll get into later and I think this is a small space in terms of market share but to solve that it's heavily fragmented. So maybe there's something to learn. Maybe another platform project, whatever you call it, product does something different and that's great and maybe others can learn from that. And then I wanted to spend some time with you, want to touch and Safe for Sure has after a while but yeah I don't know, broke happened so yeah that part is going to be rather thin. So then I had some assumptions at first so surely stuff like email that's easy, document protocols, well maybe quite complicated but it's there, metrics, it's there, XMPP, just do it and then stuff that has free APIs also yeah you know people will do it and then everything that has an API even if paid should also be doable. So yeah, let's get into it. So Safe for Sure has. When, oh I forgot the introduction part, shit. So yeah this is my website, it's lin.net that stands for linear mobster's network, no actually not. So this is a logo, you may know it from YouTube and this is the current homepage, weekly updates, a lot of work and now how it started because I think that part matters a bit. So it started in 2007 and even back then we had plenty of Linux mobile projects, community and others coming over from the handheld age to the smartphone age, handhelds.org, linux.go.org, I don't know if anybody was in those IRC rooms at the time if you are in your year, great. That's real stamina, what would you call that? So I somehow stayed around, well I left briefly because in 2011 we had like a major two things killed by CEO, so what happened with Nokia and what happened at HP, new CEO and then boom mobile linux, look promising, died, also open moco. Now to get into this talk, while I was doing a blog and totally into the Aztec in 2020 with a pine cone and oh my god what can you do, this thing only lasts five hours but hey I want to use it so is there a list of apps from this, forked it, eventually turned it into this because the previous implementation would no longer work on those low Linux phones and it's still pretty bad, I think there's an issue tracker on Framigate and we'll get to that later maybe but yeah so improvements of Alka I say but there's a lot that has been learned and I think it can be helpful so we skip that so say a fish, like we just said Elop killed the Anain and Harmaton Nokia and from the ashes raised YOLLA and they introduced the YOLLA phone into 2013 and it's quite modern so it's BTRFS well yeah who cares file systems, Wayland system the 2013, Wayland really and then there's a going, troubled surely don't make any more on devices you can buy a license bring you on Sony device and they've got something that's quite interested for those that need those proprietary bits to close the gap that's Android app support not a topic of this talk so what do they have so there are multiple interfaces to get software so there's the YOLLA star requires the YOLLA account no for pay apps has no web interface so I did not count those apps maybe there's an API or something we didn't look into it but yeah it looks quite nice and that's not the only source of software that's well organized there there's also open repost on that now that one is really old if you go on to open repost on that you will see that it lists one app for the LibreM 5 or for POS but it also has many apps for the N900 which I think many people still have fond memories of and the N9 and there's even some development still for the N9 so people are still using that thing today yeah it has a Storm in frontend for Safefish has also no for pay apps it like I said lists up for the projects and it has approximately 1800 apps and counting listed for Safefish but I don't think that with the transition from arm 32 to 64 bit and the long history of four major releases that Safefish apps you will be as you will be able to use all of them now this is what it looks like a little bit less entertaining than the YOLO Star but also I think quite fun and then there's a newest contender of course because more options better and that's Chum it since recently has a web front end it also has no for pay apps it has 170 apps listed for Safefish and it includes and this is for me a total highlight because it's this cross project collaboration I'm talking about it includes some Kyrgyz apps by packaging a modern version of Qt because Safefish uses silica for its widgets and it's stuck on Qt 5.6 forgot to tell you that earlier I mean who wants to talk about those sites that aren't so nice and shiny but people made it work and you can run like cast Angerfisher web browser which is nice because sometimes you may want a Chromium web browser because the real web browser in Safefish or as is Gekko based which is also really unique and there will be a talk about that later on so yeah highlights I did a little impromptu poll on Masterdun I wanted to do something better but these are the highlights of Safefish OS so if you're using Safefish OS and you haven't installed those apps I mean what are you doing just take out your Safefish phone and then install them and maybe enter your security code yeah and then you can do this nice multitask gesturing thing I will not go into demoing apps on Safefish OS I did that for YouTube and I failed miserably people were making fun of me doing that again so yeah there's a lot Safefish OS connect by the way integrate with KDE connect so if that's not obvious and then we have even had contract so if you were like me having a relative that was in deep danger that was something to appreciate at the time I mean now no more tracking why would we so yeah then next one just at Safefish now let's go for a bunch of touch it's about as old if not older envisioned in 2011 this is nice quarter on there so it was in 2011 that it was announced and it would Ubuntu would support smartphones tablets smart TV smart screens smart watches head units whatnot everything maybe peak Ubuntu I don't know and then they I left out the prepaid crowd for everybody else about that one and then had the first commercial device in 2015 February 2015 so like nine years ago by now my man time flies and they'd used mirror which these days is a way than compositor but then wasn't upstart because yeah and unit 8 their own convergent thing unity hate is amazing it's now you know Mary thankfully because canonical eventually would drop that all that great effort because didn't have market success so another death by CEO if you will but it was picked up by the community and could be picked up by the community because it was completely open source so maybe that's one of those lessons so only trust projects that are completely open source because then it doesn't matter if they go under and yeah you be porters doing great drop the latest release was just a few days ago and the store situation is also pretty simple as the open store it has a web interface so you can browse it without having even to touch device and get an idea of what would be available even as ratings so can look into is this actually working and it has more apps for the older one than for the new one so really I think that those numbers you know with 210 whether it's about 610 I think it's actually 217 215 by now but yeah who cares about the exact number that really should improve the open store has one neat feature I wanted to put a screenshot of that into my slides but who has the time so when you install an app on the open store it basically sometimes if that's specified next you for donation to the developer and I think if I remember correctly it may do that later on again and I know nagging nobody like likes to be necked me neither and nobody wants to feel bad because they don't have the time to fill out the details and do that stuff that you need to do that donation because it's also complicated because payment but I think that's a nice idea because you know giving something back and not just feedback does not work for me fail I don't know this is garbage you know maybe communicate friendly that might help and maybe donate if there's a way to keep this going you know we need to do that and then of course other ways to install apps so you can do you want to contain over 604 this was totally uninteresting for you know all those new apps we get to later because well in 1604 you want to 604 not much was mobile friendly in the GT cable they can type that and neither in KDE land really and then with 20 or 4 it's a little bit better but you need to bring your own environment variable variables and then there's new development only work on some devices snaps are you can install snaps on you want to touch now snaps are known to be controversial but on a system like you want to touch which is also in a way immutable air quotes and was very early with that so that's another thing that's great I think it's nice to have another option to distribute software more widely and if snaps what's been added first got a sticker on my little tablet here that's just what I would have preferred but it's good to have really nice and well you need to bring your own and worse to make it scale properly but wouldn't be fun otherwise highlights you must know so if you do a poll on master then apparently people favor message on clients it's weird and Weber a tool for web app creation generally you want to touch has a bunch of web apps which is great they have a way to do those other projects should do that too because it's maybe is relatively simple way to make a service seem available from an app store because people don't think that there's web browser that they could use then Deco great email client well might use some work to get GPG award but I mean come on it's an email client didn't have that when it was on the canonical throne that was fun when I first tried you want to touch it was like what the fuck because the only email client that shipped was a Gmail client again whatever past memories and then you nav for navigation and then there are more some of those really should be brought over just some highlights I think you can read those yourself so fluffy flesh had flutters interesting because they did not ship GTK and flutter in that click package as far as I could concern they made it a web app so they flutter can do web apps and then they went that way so also interesting hack would like to see more of that and then there's an app for scooter for scooters you know those urban mobility shit supporting two services really great I don't know whether it works didn't try be friendly if you try and have bugs Tesla app don't have a Tesla no idea Nostar nobody needs Nostar but they have a client and it works for me because I try to go there with my blog and whatnot but and then of course it's body for a premium client because like assumptions earlier it's body for premium IP I works good so and then gaps briefly for this metrics apps maybe so yeah not not really happy with that situation it's interesting the element adaptation is something like a hack some CSS hacks on top of element desktop nice approach but of course something like that is prone to break it you basically patch the more moving target how to do just ask all the Android custom ROMs and then XMPP of course and desktop Firefox we want to touch that's one for the poll yeah that would be great now new contenders and that's the area I'm competent about which why I spend so much time on other shit to not talk about it too much so up top you see the UIs for or also mom shell I could have put another logo there plasma mobile and then as a joke because I'm not going to talk about thermal apps sorry as XMO it's it's awesome I use it on my pine phone pro and then distributions you know dang next post macros mobian fedora there's a mobility stick then that fun icon anybody know what the icon on the right is any hands yes no it's open mandrieva made made one image for the pine phone but I had to put it here we are right now as some rolling of but you want to mobile next OS nice to have that too and of course open zoos the lizards are here too so yeah and then of course how did that get started it's all history 2017 maybe 2020 live in five pine phone what's a project based on desktop distributions like we saw I've got two times there being in that list and many eyes plasma mobile with kirgami for apps for with first lipendi a widget library to make GTK apps more adaptive and then these days lip of data for GTK for which really made us rain to go so that's more of a success than I would have hoped for as a spectator on the sidelines really impressive and then the downsides well no proper app store solution ish hands links for naps for dog org you know I was really hoping that we wouldn't need that by now because you don't want to maintain a website that lists like I don't know 500 maybe including games these days apps and has to you have to check those and then does it still work oh no I don't know who has the time so these are all the fun UI frameworks that I used in apps listed on Linux for network most of these don't really matter and I already mentioned the ones that do really matter except maybe flutter because that's going somewhere well but we will touch the later this is just as an overview so there are plenty apps with Kirigami it's like hundred and forty naps listed so plus my mobile is going rather strong there no side goes a little bit stronger up top with a lipendi I mean I could also call the account GTK for and GTK 3 but some of those don't really super fit very well you know only with foch scale to fit heck and whatnot if you've been in that arena you've seen that rodeo it works and it's great to have it but it shouldn't be there so yeah the panty 66 lip advisor 156 used to be more in the panty camp stuff is moving over which is I think good to see I don't know why I've got one you bunch of components at there yeah I think it was future five before it was an open store and then programming languages well I think everybody here in here is more competent to judge this than me I can do a little bit Python and some CSS and HTML and whatnot and barely do JavaScript but so it comes with the with the toolkits right there are also some things that I did not know before I started this list I didn't know that there were GTK apps made with C++ I always assumed that was all cute but yeah you learn so looking at the interfaces you can use to browse software here's one that's really nice these days it's no software see that fun little thing there that says adaptive yeah that's great that's metadata if that were everywhere I could stop working on Linux phone apps boy would I love that so but we're not there yet so yeah it's tough so can show that and then there's even a fork that only lists the apps that are indicated to be adaptive you know you can always write anything in metadata nobody checks so you could claim your app is super adaptive and it's not but then you will get that feedback so don't do that and also don't do it because otherwise I really can't retire that website at any time yeah so and then discover well it doesn't show adaptiveness but the thing is if something is kirigami most of the time it should work except a few few things that don't but you don't need everything in your phone right then there are of course also some cute widget apps that also work only barely and if you're lucky yeah and then metadata it's beautiful so my day job isn't publishing and in publishing we still love xml and abstream metadata is also xml and so this is a common specification that has been extended over the years I think that started I don't know decades ago or maybe but it's definitely more than one decade at this point and there I have some links on the site on a blog post and form fractures how to specify those before that there was an intern specification by purism and you can put your licensing in there you can put description release loads you know go crazy and the good thing is except for the release notes if I execute a script I can pull all those nice informations into linux phone apps are no ain't that great so yeah if you are developing an app please add metadata maybe there's a meta info creator that makes that relatively easy I know it's some extra sure and it sucks and nobody has the time but it's I think it's really useful for people and if you maybe want to contribute and run through the code forges and find apps that don't have metadata and make merge or pull requests adding that metadata go for it thank you yeah but with that express the metadata sorry about my excitement for xml nobody likes that anymore I know highlights for apps I don't think I need to iterate through the app just highlight itinerary it's really a better travel companion than the app by Deutsche Bahn for example which I know very much unfortunately because it's generally not only taste you about delays but it also tells you how to get from that one platform you have to start changing trains on to the other platform so you can see that because it's not always that numbers that are next to each other are on the same platform and that matters if things are delayed once again and then angle fish nice mobile web browser also on SafeWishers like I mentioned and then pure maps pure maps again we had that before could also have been on the Ubuntu touch list pure maps well everywhere oh I forgot cast so sorry cast is it's also great it's really feature rich does chapter markers I like podcasts sorry and then highlights in the norm side well chats and calls because you know sms mms calls who wants to get phone calls but yeah people do and if all your stack works it works even as a yeah very worst client again that's from the poll and also it's really nice 10 gram that's little thing for web apps you can also use it on your desktop all of these apps are also available on your desktop so if you don't have Linux phone you can also use all of these apps on the past two slides and they are also great on desktop because adaptive apps aim to be great anywhere and I think these listed here all succeeded that and then of course communication railway like maybe maybe I trouble the trains too much I don't know can you travel to the trains too much no idea and then spot Spotify premium again API magic and then flat sale because helps sometimes and then other highlights so these are apps that are on kiri gami and I've put two matrix clients and they may be I use too much metrics yeah and I must use too much metrics so one is using cute quick components to Nico and the other one is using flutter so special one apps that run anywhere on mobile Linux we had no pure maps maps navigation whatnot and maze fish smart watches and stuff is that and then kaitan that's an xmpp client and yeah it's only in ubuntu touch 64 that's why the asterisk is there but otherwise looks like building cute apps that are cross-platform is possible another special apps that run everywhere including legacy platforms so iOS and Android well see next talk flutter maybe I don't know we're really interested looking forward to that and then current gaps so what if you are have time and want to start here's the list we already saw that some of these things are solved somewhere I think you're going to touch also as a cryptocurrency wallet if you need that I don't know maybe you do and then of course what's that yeah tough and then more current gaps that I found elsewhere attention grabbing social media I think we need Instagram and TikTok to make that mainstream and we need Facebook for the Grand Parents and we need office clients to edit fucked word documents and shit and axler well you need that there are some approaches by the way that's one kt app and then yeah so gaps this brings me to packaging um aside from metadata you know releasing an app helps I'm not explicitly said stating that I'm looking at k delta chat in this very moment but I am so that's nice app it works delta chat is encrypted chat via email protocols nice but no release so not package anywhere aside from a u r and xOS yeah and also I mean maybe maybe flat up so in my little impromptu poll one answer was and that made me really so yeah this app seems great I'm looking very much forward to it land in db and stable and I'm like oh god this person is patient should learn something from them crazy yeah so please if you maintain an app maybe do that toggle thing release it at some point you know don't release it while it doesn't work won't help anyone but maybe release it once it when it won't when it barely works because it works barely but works then of course flatter apps build only for x86 64 linux electrode apps build only for 886 64 linux what the fuck signal and then generally apps build only for x86 64 linux you know aside from doing this mobile linux phone thing I've been running arm laptops for years and it's I mean now with fast arm laptops it's less of a problem you can compile shit but oh god imagine the pine book and then compiling a big electron app I mean you can't do that but boy that's like waiting for stuff landing in db and stable yeah then future challenges things get worse actually more and more services disappear behind apps and they in apps that are you know on the android side require play service often and thus don't easily work in bed right and that's a deal for public and private services so I think this is some german examples who cares but yeah we need virtualized android maybe we need to reverse engineer other things or we need to push government well governments I mean we're in brussels here double capital Belgium and the EU and NATO they're not state whatever but yeah so technical solution obvious one is the web and then of course what would I like to see more cross project collaboration in the app space I think I stress that enough but I've made it it's stress it enough to access to non-distribute sources easier and distributions and now that's controversial like enabling flat up from the get go and maybe even the snap store oh god people with throats brings at me and then donation bagging and other app install things maybe a future for software thingies and then a bug tracker like mozilla's platform tilt if you don't know they list this stuff whether disadvantage by last large companies also goes into that political avenue and help with linux phone apps or or so yeah yeah I want to make it a progressive web app I want to make search and filtering better but yeah who has time so conclusions I hope this wasn't too overwhelming or boring there may be more apps than you'd think regarding initial assumptions I think honestly despite trying to prove it people are just scratching their itch and that is perfectly fine so thank you this is the stuff where can reach me and where the next four minutes are always and if you want to contribute from agate it has issues with sign in so send just send that page to the mailing list and that last link is a really cursed really bad my skills at web development level thing that helps to create things time for questions thank you very much Peter any questions from the audience ha successfully over they're all taking it in still bored them to tears I'd ask the question oh it's actually not a question it's a statement this is David but no I just wanted to I wanted to thank you for taking all your time and preparing the weekly post as a user of mobile linux not so much a coder it has been huge to get me in the community to keep me in the community keep me up to speed with everything that's happening I realize that one person can't always do it but I just want to say thank you thank you that helps keep going another question or statement yeah in the back we'll take a second so I too want to have a Linux phone so can you please tell me how much time it's suffering do I have to you have to give to to achieve that goal depends on your approach so I think it's impossible to answer without knowing your specific use cases and the services you want to use and how much pain you're willing to go through or whether you're going to be like well you know wait right fine and also it depends on which hardware you choose but to go to hardware choices we need first to establish which distribution you go on and then go down some huge decision tree maybe that's a talk for next foster I have a pine phone but it's lying on my desk for I'm so it's catching us like most of those yeah I've got one of those too so many two of those yeah so pine phone of course since I've been paid by post marketer has no post marketer is amazing Mobian is also amazing think those safe choices and then try to solve your issues one thing at a time but if you have issues with your carrier and reliability and stuff then yeah it's get tough so maybe maybe different device maybe different carrier it's it's complicated okay I keep on dreaming do that a question from the matrix what do you think of box 64 I think we can use this to run some of our x86 64 programs as a current worker on until we have a 64 version of the binaries I think in some cases this is definitely useful and I think people love that for proprietary games mainly um with some electron apps you can actually use an arm 64 electron runtime and then run that so it's not always necessary to go that route but I mean why not so I personally haven't played with that because I am too thick to understand the instructions and don't have the time but yeah box 64 also great just emulate shit works all right another question yeah there's one okay please pass on Mike hi once again I echo the comment thank you very much for your weekly lim mob log of everything that's going on in linux mobile but my question is I well I think it's about purism about a year ago talked about a payment mechanism for developers I think maybe it's like a theory of it but I don't know if there's any you know anything about that about how that might be changing the landscape of linux mobile apps well I think it would be very good to have some thing like that and they are in a place to do that as a business they've got an easier route to that than all these non-profits um but I haven't don't have any news so I very much look forward to something like that but as far as I know it's not happened yet thanks please give another round of applause
Flutter - about the nightmare of cross platform development targetting Linux mobile
Okay, next up we have Brage, she's going to talk about Flutter apps. Please give a round of applause. Hello, yes, I'm going to talk about Flutter, but not like about the fancy ecosystem as we were just introduced in the previous talk, but I'm going to talk about development and rather about the nightmare of development targeting Linux mobile. Because from the perspective of app developers, there's still much work to do until we can properly target Linux mobile as with cross-platform software. Who am I? My name is Brage. I do Flutter since it was publicly released in 2018 and I work in healthcare actually, so my work has nothing to do with what I'm presenting here, but I find it interesting topic anyway. I use ARM by the way, that's why I talk about Linux mobile. Even the talk is held on ARM, maybe people recognize the laptop here. You can reach me via matrix since I do matrix for work, so when you have any questions at break, colon, and that leaks, I am from France. Back to topic, why would we like to use Flutter? We had a fancy overview about the Linux native ecosystem, about GDK progress, about KDE targeting Linux mobile. Why Flutter? Because Flutter is a decent cross-platform ecosystem. Unlike if I develop a GDK app, I do not uniquely target Linux, but I target giant ecosystem consisting of iOS, Android, desktops, maybe even web, and I can potentially also target Linux mobile. It has a fancy plug-in system for developers to access native resources, so we are not bound for example to web browser APIs, as we know it from many JavaScript based cross-platform toolkits. We have an amazing UI toolkit and that's what they love Flutter for. You have animations, the 2024 style, and it's fun to use. It renders perfectly, it renders 128 frames per second on high-end devices unless you have some vendors doing weird things and then it won't work. And it's no JS, no XML, so we have design as part of the code, so no external design resources which makes it quite fancy to refactor to use it for development. Yeah, Flutter, but let's talk about Linux and especially Linux mobile. We will talk about both in this talk, but the goal is finally what are the issues about Linux mobile. We have a giant ecosystem, I already told, like there are 10,000 apps in the Google Play store made with Flutter, a bit less in the Apple App Store, but we have a giant ecosystem and all these ecosystems of Flutter apps could target Linux and Linux mobile too. They are optimized for mobile views, they're actually handy to use on Linux. We just need to make it happen. And we have big players into it, namely Canonical and Google. I know they are very popular here, but they use Flutter, especially on Linux and push it. Unfortunately, that's a problem too, that they are the ones pushing it, not the community we will see that later. Yeah, like what are the key points in targeting Linux mobile and Linux in general? The first is like, okay, if I have the application, it should not have runtime issues, it should be usable on the mobile screen, it should have functional interaction with the user. The second from the developer perspective is I should be able to debug the app. I should be able to compile the app for my Linux phone, there we get to a big problem. And the third thing is redistribution. I first of all need to redistribute Flutter in order to have a package system which can target Linux distributions with dependency trees, with Flutter as a build dependency. The second thing is I need to package my Flutter app for Linux distributions. It sounds easy, but it can be hell. This is the first thing we are going to talk about because that's the most complicated when talking about Flutter. Afterwards debugging and runtime, I will give you a brief showcase of Flutter on Linux. Yeah, Flutter redistribution consists of two parts. We need to build the Flutter tool chain, so everything we need to develop and we need to package it in a way we can use it on Linux distributions in order to have it as dependency. Yeah, let's look at packaging because that's easier to understand at that point. If we follow the instructions on docs.flutter.dev.slashgettingstarted how to install Flutter, we simply clone a Git repository. I mean, that sounds amazing. It's just a Git repository. It should be packageable. You download that Git repository or you clone that Git repository, you execute Flutter for the first term and you see that. We're downloading lots of things. First of all, we are downloading the Dart SDK. We could use that one as system dependency, but that's difficult. But then we continue downloading. Let's look where are we downloading to? I mean, should be a user directory or something like decent location, which is user configurable. Yeah, no, no, no. We download all the stuff. We download to the installation directory. Now imagine how it is like with packaging stuff for Linux distributions. It's a bad idea if your runtime has hard coded to download stuff into the installation folder. That's a bit annoying. But that's something you can work around with patches to apply. Yeah, step by step. What is it downloading? Like you download the Flutter source, blah, blah, blah. You execute Flutter for the first time at loop and it's downloading the Dart SDK. So Dart is the underlying programming language Flutter is using. And yeah, afterwards, it's creating the snapshot from the Flutter tool. So it's compiling the Flutter tool written in Dart in order to have an executable of the Flutter tool itself. Then this compiled Flutter tool, remember, you clone source and it's first compiling stuff. Then we download the engine, the Flutter engine, and dozens of platform dependencies. And they keep changing each and every release. Good luck capturing that. So what do we have? We have fonts, we have Android stuff. If I use the Flutter tool to target Android development, I have different compilers all per architecture, compiled with native APIs. I have the Web SDK for target web. I need to download Skiya, CanvasKit in order to render in the web. All this is going to be downloaded. Generic Flutter SDK tools, platform compilers for Windows, Linux, Mac OS, FrontRenderer, for example, the GTK and better on Linux. And then I'm mostly done. Let's look at where these downloads come from in order to capture them and in order to improve that. Get a release now, now, it would be too easy. Some package registry, like, I mean, that could be a hosted Nexus or something. Better Chromium CI, the build system of Google for their open source proprietary components. They build all these components you need at runtime in order to save time while executing, I don't know. And it's built in Chromium CI and then downloaded at runtime. So you need to capture that somehow. You cannot know what's happening in this Chromium CI. No one knows. It's just we download blocks from proprietary storage and this is not very open source of you. It's held to package. It's held to work with that. But back to the topic, how can we package that? Now that we know where all this stuff is coming from, we could take all this stuff from Chromium CI. I mean, it's the easiest approach. I just want to have Flutter function. I want to develop my apps. Let's just package that stuff we get from Chromium CI. We could pre-cache it at prepared time of the packaging process. So download all these dependencies, create the snapshot and so on. And then just have it packaged in the distribution package with ship. Other option would be, and I won't give a definite answer on it. It's just prospect. You could also patch Flutter to make this user configurable. I made a merge request for that like two years ago. It was rejected because the Flutter authors did not see any use case. It's obviously a perfect idea to download stuff to the installation directory. Yeah. But even better, we could build them ourselves. Because actually, when I talk about Floss and mobile devices, I do not want stuff dropping out of this Chromium CI. I have no clue about what's happening in. Yeah. Building Flutter next topic. I don't know. Has anyone of you already built Flutter? Like the Flutter Engine, the Flutter tool? I guess a couple of people here. I guess you had fun. At this point, very special thanks to Lauren. Amazing work on patching Flutter to be able to make Flutter buildable with more or less less-vendored tool chain. Amazing work. So the next few slides are going to present actually the work done by Lauren. Yeah. Issues on Flutter Floss builds. Like you have, first of all, vendor packages. Like everything you could use Assistant to Sensey is being vended from some random upstream source from Google. We do not want that. Yeah. It's coming from Chromium CI, by the way. Also, the Flutter sources themselves are written in a way it's not muscle-compatible, existing patches, adding muscle support to the Flutter Engine were so far always rejected. Same applies to existing patches making it compile on BSD. Those are not that functional yet, but there were clear statements. There's no interest in adding support to that. There's no use case in it. So the Flutter team is not willing to accept these patches, this work done there, which is super sad in my opinion. Yeah. So the tool chain to build Flutter itself, it's basically a G-client solution. So you get the fancy repo, Depot tools from Google and download the solutions, and it's downloading lots of stuff from Chromium CI. This is a screenshot, can you see it here, from the Alpine package build files for building Flutter. You have, I don't know how many are, it's 15 patches only to make Flutter compile. There, you have some patches affecting the Engine, so for building the Engine, and some for runtime for the Flutter tool, and in both cases it's giant overhead just to package this simple tool. Yeah, it's sad. Yeah. Upstream work, nah, so far not wanted. It's not appreciated. There was upstream work until all patches were rejected, like it's already known for a while. So far all aims to improve that were rejected, and that's why there's unfortunately lots of downstream work going on. Yeah, mostly rejected. There we are. So in order to build Flutter on using a Floss tool chain only, you first need to patch the build route in order to have the function environment to build the Flutter Engine itself. First of all, things like, hey, use my local Python. I do not need your Vendor Python. Use local libraries and stuff. By default everything is Vendor. Afterwards, you need to patch the Engine to, for example, work or functionally work on muscle. This is though not required if you target G-Lib C devices, but the post market OIS people and Alpine people in this room, maybe the Void Linux people might be happy about that. And there are the patches are pretty similar to target BSD because Flutter has lots of stuff hard coded to function on Linux only, though it could at many places work on BSDs too. I'm talking about BSD because I love using BSD actually, and I'm sad Flutter doesn't work there yet. And afterwards, if you got to patch the Engine, you still need to patch the Flutter tool. Like we were talking about that. These artifacts, we do not want to download the Dart SDK. I want to use the Dart language installed by my distribution package manager rather than like some pre-compiled stuff. At the moment, it's usually, for example, Alpine has the Dart muscle port packaged there in order to work around that. So there's no canonical way yet. There's no clean way yet, though there is work ongoing that. And yeah, so that was the brief overview. I mean, I need to hurry. The talk is way too short to dive deeper into it. Like the second thing is debugging and cross-compiling. If I have a Linux mobile device, it's usually another CPU architecture compared to my host machine. Though host machines with ARM CPUs are involving now, like most people still use AMD64 devices, and that's why in most cases for debugging Linux mobile app targeting like this device, they need to be cross-compiled. And that's the moment where I wished Flutter was go because go is fancy and cross-compiling and Dart is like, oof, it's crappy. But wait a second. There are these fancy arguments existing for the Flutter tool, like target platform and like target sysroute where you can like specify a sysroute of, for example, R64 installation. Let's try that. That's the reply you get. I mean, nice that you added these parameters, but that's not exactly what I expected if it's shipped. So yeah, you see, there we have the aim of the upstream team to make it support, but it's too slow. There are other solutions making it better yet, and now I'm going away from the upstream, presenting some possibilities like to get Flutter to debug and to cross-compile on your ARM device, on your Raspberry Pi, on your watch and whatsoever. At that point, I can also recommend the embedded Linux talks on Flutter taking place in this system. They are driving deeper into the solutions I will present. Yeah, the shark is very confused by this output. Yeah, if I just want to compile, I could also just use KMU and like compile if it's functional for release builds, compile the stuff on my host tank. I could use KMU, use a static binary. I have my ARM binary. Okay, it's compiled. I could ship it, but I actually want a debugging session where I can use the fancy Flutter features like HotRestart, HotReload, where I just do Flutter run, show my beryllium instead of building it locally, pushing it, debugging it, not debugging it, checking whether it works, manually checking some outputs. Compiling is not debugging. That's a huge difference in it. Yeah, for cross-compiling and debugging, there's no canonical way yet to do that. You can compile Flutter cross-platform using KMU static binary. Thanks, but that's crappy. We actually don't want to do that. You could also just have your standalone ARM64 build server. That's what I do. I have ARM64 CI devices at home with which I build all the Flutter stuff I build in order to have test builds targeting, for example, Debian on mobile. Or you use custom devices. Flutter supports custom devices, which means you have configuration files. You tell the engine, the Flutter tool at runtime to use or to run on device configurations actually not supported. And there you have projects dropping in there. You have Flutter in Linux, embedded Linux developed by Sony. It's the Flutter embedded devices. Okay, that's duplicated, but yeah. It's basically a wrap around the Flutter tool, which enables you to run on ARM devices also remotely and you have Flutter Py also uses the custom devices API in order to target remote devices on Linux. But again, there is no build in way. There are these fancy projects enabling us to do that, but there's no Flutter build in way and that's sad. Yeah. As of now, it's easier. I have a full Linux installation on here. It's easier if I have my Flutter development environment installed on the device and SSH on the device and debug on there because that's way more functional than using the typical stuff you know from the second phone Android here. I just plug in the device and debug. That's not the state of debugging here. It's rather easy to develop on the target device itself if you have a decently powered CPU and like a desktop Linux distribution there or like can do it by SSH, that's way more convenient. And you should hopefully see an image. No, that's a joke. I have prepared a short showcase for you. It was number seven. Yeah. That's like showcase of Flutter. In a few moments, you will see me opening a Flutter. I recalled it while traveling here. That's why it's a bit blurry. Like that's an example of a Flutter app. Like you see animation rendering is pretty decent. Animation is crappy because it requires upstream patches in order to have defaultly handle Linux touch events as touch events and not as pointer events. There it's getting crappy but from the UI part, Flutter is fancy. And for example, like some Flutter apps ship these patches like to get scrolling to work. Most others do not. Some vendors ship patches. For example, Alpine again has patches to include a scroll behavior treating Linux touch mouse input as scroll drag enabled input. I think it's broken. I know it's broken since the last few releases but I think that's because the patch must be adjusted. Originally Alpine had a patch. It's no longer functional but it had a patch for it. And one could adjust that patch to still function. And like short summary, the first point is the touch is considered as mouse. That's why if you swipe it selects instead of scrolling. Scaling is sometimes an issue but that's an issue everywhere in Linux mobile. These devices have full HD or even higher resolution so everything is scaled dozens of times. You saw a GTK header bar which is pretty annoying. I do not want to see your header bar but that's again a GTK issue, not an issue of Flutter. And multi-window is pretty crappy because if I start a new instance you run into issues with any database connection you have open if you use local databases and you mess up your applications. Though you run into those issues in Android 2 but on Android it's handled way better because by default it does not start at two instances of your app. And yeah, that's state of the art. It's crappy but there is momentum. There is work going on. If you use all the patches, all the tool chains around Flutter, if you actually use them to target Linux mobile you can target Linux mobile in a pretty decent way. And I hope it's going on. Some work is going on upstream. Unfortunately most of the work is going on downstream which is pretty sad. That's not very open source of Google. But I mean it's Google. Yeah, so let's get Linux mobile ready as a cross-platform target and that was my talk. Awesome. Does anyone have questions? Yeah? You talked about the upstream not wanting to support muscle. But doesn't Android already have a libc other than glibc and do they even support that? If we look at Flutter we are talking about a completely different target of Android and Linux. And the Flutter Linux engine does not support anything apart from glibc and upstream. Of course it supports Android. That was what it was initially developed for but there it's another completely different components of the engine. And yet they compile with Android libc. Forgot the name. Yeah, by Jonik. Any more questions? Yeah? Martin. Your demo video showed a Flutter application running pretty smoothly. What defies what? Sorry? What defies your demo video running? That was a few years old smart from Shomai. It's a Shomai Pocophone F1 running Debian. No, how is it called? Mobian. Ah, okay. So, Friedrino. Yeah. Okay, thank you. If you tried on the Pine phone for example you won't have that experience because the GL driver is broken. That's exactly what I saw the last video. I often have that in my issue list, believe me. Any more questions? Yeah, that's one. So it seems like quite a pain to get Flutter to build and compile and get it all the way an app running on a Linux phone. Is it worth it? Is there really nothing better to get an app running on a Linux phone? As of now I consider Flutter as pretty liable for targeting Linux mobile because you have this giant ecosystem of existing Flutter apps. You have thousands of them which could theoretically run on Linux mobile but simply do not target it yet. You have 10,000 proprietary apps in Play Store. Okay, we do not want to have them. We have dozens of apps in Android all by the end. All of them could run on Linux if we made it easier. And all those patches are usually not patches. I as an app developer need to apply to my projects. Okay, I need to apply some patches too. Are the vendors shipping my app? But it's usually the vendors or the distributors shipping the distribution package to ship Flutter. I can easily build the Debian package for Flutter app. But if I want to do it the fancy open source way, if I want to use Flutter as a build dependency shipped from my package manager, then it's difficult. But I have the vision of getting there one day where I do not need to install, use my local Flutter installation with Flutter.dev slash getting started. But using Vendor Flutter, Vendor in the upstream of my Linux solution. And then it's harder but it's not the work done by the end developers. So I think it's worth it because it's only the distributors who need to do most of this work. Okay, thank you. Questions? Okay, in the back, one second. Thank you. So not related to Flutter, but if you said that's so painful to get something upstream from an open source perspective, how difficult would be or what would be the challenges, for example, to say, okay. As a community, we fork Flutter and we start supporting this fork because the maintainers don't want these patches on the official one. And we as open source citizens, we adopt this fork. How difficult would be that culturally? Well, forking Flutter entirely would be pretty complicated because Flutter is a rapidly moving ecosystem. There are many patches in the upstream and that could always break your fork with the giant company standing behind pushing Flutter development. So you have on the one side this giant company, namely Google, working on Flutter with a giant community and you would need to maintain your fork of the entire Flutter system on your own. What I consider as more realistic is patching the build route and like single components of the Flutter ecosystem, you could use as drop in dependency when shipping Flutter as a Linux distribution, for example, that would be way easier and also that's where currently see the Flutter floss Linux mobile ecosystem moving towards. So this work is more or less being done, but it's at the beginning stage. But I would not consider like forking Flutter entirely as a new framework. Hey, with this one you can target Linux mobile too because then you would lose all the big players already having their apps and continuing using Flutter. Thanks. Please give another round of applause.
5G in ModemManager
All right, thank you all for coming. So next up we have a very exciting topic, 5G and Modem Manager. Have a round of applause for Alexander. So let's talk about Modem Manager. Let me know if you don't see me because I'm not sure if this is going to work very well. I read about me first. I think I'm going to keep it like this. I have been the Modem Manager, Maintainer and Developer for the past 12 years and I've also been involved in developing and maintaining the two libraries that we use to communicate with Modems, which are the QMI for the QMI protocol and the MBIM protocol. I'm not working at the Google Chrome OS team since two years ago. And this talk is going to be about not only how we're going to add 5G support properly in Modem Manager, hopefully, but also how we added 4G and which are the issues that we had when we added 4G and how we are going to overcome these same kind of issues when developing 5G support. So we will look at what went well and what didn't go that well with 4G support. So before I joined the Modem Manager project, there was already support for 4G in the sense that you could connect the Modem, it was using 4G and the Modem will tell you, hey, I'm using 4G and then we will expose it and that's about it. So we were treating 4G yesterday as a different mode, we had 2G, 3G and now we have 4G, nothing else. So when I joined the project, I started to review the 1.0 API suggestion that were in the main list and the major focus at that time was to support multi-mode devices. So at that time we had two separate families of Modems, we had 3G, 3G-Bb-Modems, GSM, MGS, LTE, then you had another family which were 3G, VB2, you had CDMA, EVDO Modems for 2G and 3G. So 3G, VB2 had its own standard of 4G they digitized and then they started to use LTE as the standard 3G, VB2 Modems as well. So we had these strange 3G, VB2 and 3G, VB, Modems, multiple modems that had to be managed kind of in the same way, but it was very different in nature. Like 3GPP modems require a SIM card, 3GPP2 modems, most of them require to have some kind of activation with the network to bind your user account to the device itself, and it was a manual activation, automatic activation depending on the carrier. So there were many different things. And managing these new multi-modem devices, we thought this was the most important thing, but it wasn't, because 3GPP2 no longer exists. So can anyone tell me which main feature of 4G we missed? Because we didn't think of it. What's the mind here? No. Much more important than that. Actually related sometimes. So what we missed is the idea that when you attach to the network in 4G, you are actually creating a data network interface between the modem and the network, even if the host hasn't seen it yet. So you actually get an IP address, full data setup, communication between the modem and the network in the user plane, but the host knows nothing about that. And why did not we catch that? Because most operators didn't really care about that. They would allow you to send a blank APN during the touch, and then that was fine for them. They would tell you back, which is the APN that you are using. That was one approach. The other approach was that the settings used for data connection were actually going to be the same ones as used for attach. So you actually, when you kind of connect, you're actually configuring profile number one, which is the one used for attaching Qualcomm. There were lots of assumptions happening at the same time. There was also no consolidated approach to define these settings in non-protocols. The NBIM 1.0 spec did not have a way to specify attach settings. And many of the APIs that we developed at that time were based on looking at what NBIM capable modems were doing. So there's a use case where this does not work, which is when the settings are different. And so in 1.10, we added the support to explicitly specify attach settings. This is the case of Verizon, for example, where they have one specific attach APN and one specific date APN. So now we were able to say to the network, okay, we want these specific settings for registration, and then the network will tell us, then, yeah, you could have those, or you could have like a subset of those. You may ask for v4v6 and then only get back physics. So that's a very, very common thing that may happen. And this was added very late in 1.10, like many years after the 1.0 device API was introduced. Another thing that we missed in 1.0 was the support for profile management. So right now, up until that moment, the way you connect the modem, you specify all the settings that you want in the connection attempt. And in 1.18, we added the support to say, we already have a set of profiles, maybe even provided by the modem itself, because when you insert the sim card in the modem, the modem itself will build, not build, but with some carrier-specific settings, with some predefined profiles. This is very common in US carriers. So you insert the Verizon sim card, the modem boots, with already profiles defined as the way Verizon wants them. And then in that case, you can just say, connect profile three, and that's about it. So we did miss that. We missed some other things, which are maybe not as important as that one. Where did we do it? So the first API that we defined for 1.0 had multiple PDN connections in mind from the very beginning. Even if we did not support them in the same way as it's implemented now, at that time, we had modems that would expose two network interfaces at the same time, physical network interfaces that we could choose, okay, please connect this one to this APN, please connect this other one to this other APN. The multi-PDN support that we have right now is based on multiplexing, so we can have one single physical network interface, but then we can say, okay, I'm going to connect three different PDN connections, I'm going to create three different virtual network interfaces. And then the host can assign different data flows to each of these PDNs separately, because you have three different network interfaces, so you can do all the routing logic in the host itself. And this very same support was used to support Qualcomm SOC boards with the IPA driver, for example, which requires multiplexing by default. Now, where are we right now with the 5G support in Modem Manager? The picture is very similar as what we had before 1.0 for 4G. We just have the way to say that we are using 5G. We can say that we are using 5G SA networks if we only expose 5G access technology, but we also have the way to say that we are using NSA, so we are registered in 4G and we will use a 5G extra-barre when the bandwidth requirements happen. And that's about it. We don't have any other 5G-specific feature for now. What are we missing? So I'm not going to talk about 5G-specific features that apply, for example, in the radio interface because Modem Manager does not really care about any of those. We only want to be able to support things that the host is aware of, and that is completely hidden to the host. So one of the things that we are going to try to support is 5G Slicing, which is this important word that if you investigate about 5G, it's everywhere. So in 4G networks, there is no clear separation between different types of UEs. A UE is the combination of host and Modem. And so in 4G networks, you don't have any differentiation between different UEs. They are all treated in the same way. And in 5G, they do define specific types of UEs with different quality of service requirements. So you may have a UE that wants to have a bigger bandwidth. You may have a UE that wants to have an extreme low latency. You may have UEs that may send data to the network once a day or twice a day, but they need to be spread across a very big area. So in order to support all these different kinds of UEs, 5G introduces the concept of slicing. And so with the slicing, you have one single physical network, but then it can be logically divided into separate virtual networks. Each of them with its own quality of service requirements. And the separation, this is very important, goes up to the base station, which is something that 4G did not have. So imagine this UEs case. We have thousands of people here and for them, all of them with a phone, and all of them trying to get access to the network. There's congestion, there's a lot of radio interference between older devices. With 5G, what you gain is that you could have a phone using that slice that has a specific base station only for that slice. And so you get priority access to the network through this slice. And this may happen even with the same PDN. So you have one single APN that you want to connect to to the internet. You may have different paths from your host to connect to that same APN, based on the quality of service requirements that you have. Now 5G, as I said, is a logical partition of the physical network. And they are defined, they are specified or named by something called single NSS AI. It's a really bad name, I think. And so how are we going to support this in model manager? There are two main things that we need to support. One is during the registration, we want to specify which is the slice we want to connect to. At the time of registration, and we can't do that. And then you may ask for multiple slices, the network will give you back, okay, you are allowed to use this, you are not allowed to use this one, and you also have available this other one. So this is one simple way of binding, for example, all the traffic of the system to a single connection, to a single slice. This is the case that I told you before, single, this is a UI connected to two different slides separately, or both of them going to the same internet APN, and they use completely different virtual network connections in the operator side with different QoS settings. The complex way of using URSP rules is by using, the complex way of using five-year slices is by using URSP rules, in the way that the operator will tell you which is the way that you need to route the traffic through that network. So they will give you rules, the UI receives the rules, in this case the modem will push the rules to the host, and then the host needs to make all these separate traffic differentiation and move one data flow from one slice and one data flow for another slice. The UI should not be capable of deciding by itself which slice to use, so because this is mandated by the network, and so if you try to use a slice that you are not supposed to use, they may kick you out. So that's a way that the network has to control the access to the high privileged slices. Any modem manager that supports the slicing will look very much like a multi-PDN connection. We will have virtual network interfaces created for each slice, and that is about it. There are all the 5G features that we could consider, but I'm going to name them here only. So non-GPB access support that's basically accessing the operator network through Wi-Fi, for example, you can authenticate to the network through Wi-Fi, and then you also have non-IP based 5G connectivity. If you have a network connection between machines using different protocols, you could virtually create a 5G network connection between them without using the IP protocol. Now, how it's going to look like for the next 10 years? I think we need to focus on what went right and try to avoid the mistakes that we made in 1.0, but we also know the limitations because everything changes, and what is important now may not be important at all in 10 years. So the planning needs to be done carefully, and actually made it in a way that if in the future you need to change cores, then you can do it more or less easily. The first thing we should be doing is remove legacy features. A lot of the structure in the modern manager code base is based on this logic of having 3GPP devices as a separate type of devices. We can remove all that. Same for the ports, plain old telephony system, like these dial-up models. We said we would implement them 13 years ago, and we did not do anything. I think it's time to say that we're not going to do it. We had enough time to try to do it. And then obviously, all the plugins from modems that are very old, we can't remove them. There is no point in having them anymore. The focus should be on 4G and 5G modems, and on PCI and USB modems that expose the network interface. So we acknowledge that there are other types of modems, that is serial modems or USB modems that don't expose the network interface, and you can only do 80 plus PPP connections. Those would still be supported, but let's say like in live support only, like bare minimum data connections set up, and not thinking about adding many features to those as well. For example, not thinking about trying to add 5G S-Lacing in those devices. It wouldn't make much sense. We may want to have a new API. This API that we are using right now has been mostly untouched. We didn't break API into more than 12 years. I think it's time to do some some breakage. As I said before, remove interfaces that we don't want, and probably not the same process as we did for 1.0. In 1.0, we spent, I spent one year and a half with my branch, until it was mostly ready to be launched. I mean, I want to change that. That cannot happen again. I don't have as much time as I had that time. So the idea would be to do it progressively and start to add new APIs, at least the basic ones, and so on. We will have registration settings as a first-class citizen in the APIs. We no longer treat them as something automatic, which is what we do right now. We now want to configure 4G-attached LTE settings. We want to configure 5G registration slide settings and 7L-over common settings that you may have in the modern, like the manual versus automatic settings. All those should go in its own separate API with the idea that in the future we may have more. So it should be open to updates in the future. Regarding connection management, I think it's time to use profile-based connection management as default whenever possible. There are many reasons for this, especially when you use carrier settings, where the modern gives you all the settings that you need to use. There's no point in trying to add new settings on top of those when you already have them. So using profile management is the way to go there, and enable multiplexed connection by default. So as I said, the primary modems to use would be the ones that expose the network interface. All those allow you to do multiplexing. Most of them allow you to do multiplexing. So we should enable that by default. This is one of the main things that I would like to change as well. So right now when you have a modern detected by modern manager, and it happens to have voice support, even if you're in a laptop and does not have any audio connectivity, modern manager will try to configure voice-related stuff, call waiting status, all that. It doesn't make any sense to do that if you know that you're not going to use it. So let's move that to separate interfaces as they are right now, but as a way that you can actively enable that. And if there's any application with the intent of using voice capabilities, you can, hey, please, not a manager, enable voice capabilities in the modern. Then we will enable all the URCs, all the unsolicited message support, and everything that needs to be done to support voice, for example. Oh, no, that's another one. Yeah. This is extended to each list. So things that I would love to have, even if they are extremely difficult. So we have QMI proxy, NBIM proxy. Why not have an 80 proxy? Other programs can use 80 commands through modern manager, through the proxy, to do other stuff that does not interfere with the modern manager own control. So if you could have that, it will allow many applications to use 80 commands as well. Then we could move our GNSS location out of modern manager completely as a separate team. There's no reason for modern manager to have all this support for configuring AGPS and injecting extra files to the GNSS module. We do that because the modern has that. But if we have the proxies in place, there would be no reason not to do it out of modern manager. And, yeah, draft maybe for binary parsing of messages and all that. That was something that was already investigated. And that is all I have to say. Thank you very much for this great talk. Do we have any questions in the audience? Yeah. Thanks for the good talk. I was wondering how do you test all this? So what is your CI? So in Chrome OS, we have a lot of automatic testing for the moderns that we use. So I do rely a lot about that. Like when I joined Google, I found that there were a lot of information metrics about crashes and things, back traces. I was like, I need to fix all this. But I do rely also on my own testing. I do have a home network, a home LTE network with SLSLT, open 5G, as I have my own SIM cards. And that allows me to do a lot of testing that otherwise I would not be able to do. Because all the slicing stuff that also is very core network dependent? Yes. So you might run into problems. Oh, yeah. I know many operators are doing pilots and like private pilots. I do some open ones. I think also in the US, T-Mobile is doing it. But for example, for a 5G slicing, I think that my home network is enough for this kind of testing. Thanks. Next question from the back. Hi. I'm debugging voice calls from my device. And from what the manager I see messages like gained, audio, lost audio. And I have no idea what happens after that. And whenever I try to... So how do you use 80 commands to control the modern? No, it does it by itself. When I'm trying to get to the bottom of what's going on in the code, I only see interfaces behind interfaces behind interfaces. But where can I find the actual code that makes the audio? Like where should I look? Is there a problem? So modern manager is only in charge of starting the code and hanging up the code. That's all accepting an incoming call. Nothing audio related. I mean, modern manager knows absolutely nothing. About the audio path. You know, who is responsible for getting the audio? It depends on the platform, of course. So if you're using Libre and Firephone or something like that, then you may need to talk to them. Thank you. Thanks. There was a question from the matrix apparently. I'm rushing to the matrix. So somebody's asking, can we anticipate 6G features such as sharing machine learning data for connection optimization? I have no idea about any of that. I'm still in 5G. Maybe in 10 years we will talk about. Same talk for 6 years. You talked about REST for the protocol parsing and how there's already been experiments. And it's on your wish list. So I assume those experiments are somewhat successful. Can you talk any more about what those experiments are? So not much. I mean, it's useful. I think it's very useful. And I still keep finding bugs. For example, in the 3GPP PTU parsing, which we wrote 10 years ago. And there are still bugs there. Nasty memory related bugs. So REST is very promising in that regard. Cool. Thanks. One more question. And I'm back. Thanks for the talk. So the question regarding the AT proxy. With all the possible vendor crap, etc. So how do you plan to define if the comment is going to interfere with model manager or not? So is it going to be a low by default or is it going to be for be by default? So that's why we don't have a proxy yet. That's the main reason. Especially because model manager handles a lot of crap that manufacturers push in the AT port. So the idea would be to, in the same way that model manager disables a lot of URCs that knows that may happen, the proxy could do the same. And so we could still need to work with known URCs as they happen in the world. But I hope that manufacturers will start to use other things than AT at some point in 20 years. Give a round of applause for Alexander.
Droidian - Bridging the gap between various platforms with convergence
On ways. So. So thank you all for coming. The next talk is about Droidian from Bardia. Please give a big round of applause. Good afternoon, everyone. And welcome. My name is Bardia, as you've heard. And if you've been following our project, know me as Fake Shell in the community. And I'm one of the core devs of the Droidian project. And if you have any interest in embedded systems, mobile devices, that's why we're here, obviously. You might be particularly interested. So today, our topic of discussion is going to be Droidian and what we're doing, how everything works, and how everything goes together, and why the whole project even works. No, I'm sure. Like I said, I'm prepared for that. OK. So who are we? Well, we're a number of fos and privacy enthusiasts committed in building a free and open source project and operating system that is user friendly and open that can be utilized in different environments, such as phones, maybe even single board computers, et cetera. Tablets, different things. So Droidian is, as the name states, based on Debian. We take the core of Debian and add our own repository on top of it and add our own so-called finishing touches. So Droidian utilizes a number of different projects. Should I go down like this? OK. That's too far. OK, I messed it up. So Droidian utilizes a number of different projects. Some of the more well-known projects are Hallyum. We use lip hybris and Gbinder from Joly. We use the stack from GNOME, as you guys may know, Fosh. And we currently have a selection of devices supported in our official CI or build system. And I think it should be over 20. We haven't updated that device page, so it's not exactly up to date. It should be 25 or 26. So devices vary pretty largely from different manufacturers. Different release states. We have the OnePlus 3 from 2016. We have the Pixel 3a, the FX-Stack phones, the Galaxy S9, Lenovo Think phone. Like, the list goes on and on. So the barrier of entry for getting into Droidian and development and porting is fairly low, because there's already a number of devices that do exist. And they mostly cover most of the possible cases in the Android space. So for Droidian, one of the main things that people who just get to the project need to know about is our porting guide. So the porting guide is mostly split into three sections. It's the kernel compilation guide. It's the Routefest debugging and Routefest creation. Kernel compilation is going to be the initial testing and compiling, changing a few parameters in the kernel, and packaging it to get a Debian output, because we need Debian packages to do over-the-air app kernel updates. We have Routefest debugging, which occurs after the phone actually boots into the Droidian root file system. And last but not least is Routefest creation, because we obviously need to somehow get built for each device. So how do we actually get from Android to Linux, or so what we call Linux? So on Android, there's usually the bootloader, LK, loading the kernel, kernel loading the RAM disk. And the RAM disk does everything to start up the inner process of the system partition to actually start the system. And then system mounts a bunch of stuff, mounts product now to vendor, and a bunch of other garbage. So on Droidian, we take the same kernel that there was on Android, and we change the RAM disk. We have a modified fork of the Hallym RAM disk, which the Hallym project and UBPORs used to maintain. Now, in our fork, we have some support for a bunch of stuff that we use that is not in the upstream Hallym RAM disk. The Hallym RAM disk mounts the user data partition, which is where Droidian actually resides. We don't use system, which is kind of a basis base, but it is what it is. It mounts user data. It does a bunch of Android bootloader stuff to get everything up and running. And it starts in it, which is system D, obviously. So now system D starts, and system D starts up all the usual services. We have system D time sink, the system D resolve, and all the other stuff. But then we have our own services from system D. We have a service that starts a very small container that runs Android. And that Android starts and mounts a bunch of partitions, Android partitions, modem, and everything that the firmware and the drivers need. And the vendor script starts, the system GSI script starts, and we get all the drivers loaded, all the firmware loaded, and a bunch of interfaces start from Android. Now, then we have the usual file system of Debian. There's the user interface. There's like, file feedback, the end-to-rest. So from the Android services, we have hardware composer, which we use for compositing to the screen. We have audio flinger. Well, not exactly audio flinger. It's Droid media, but ignore that. We have Droid media for audio and camera. We have the radio interface layer for us name states radio. And a bunch of other services, lip, perf, manager for power, NXP, NFC, et cetera. So all the communication that we do from the Linux side of things to the Android side of things is done through Google's binder pipeline, or the binder IPC. And we'll explain how we actually use the binder IPC, how we actually communicate with it directly to the interfaces. So from the Linux services, everything looks familiar kind of. There's Fosh, obviously. There's feedback for feedback. There's Ophono, kind of ancient. And because nothing in the modern Linux stack can actually talk to Ophono, we have Ophono 2MM, which kind of exposes modern manager interfaces as a drop in replacement through Ophono. It's kind of a hack, but we don't talk about that. Yeah. So we have Joid.DNFPD. It's a fork of Sailfish community FPD, which is used for fingerprints. We have Call Audio.D as usual for Call Audio. Again, we have custom backends because Android. And Pulse Audio, again, ancient, but Android. And a bunch of other services. NFC and GeoClue, again, needs its own backend. But we're going to talk about these later. So most of the components that we have are not directly used by the user. So for camera, which is for Joid media, it's abstracted. And users just see the Joid.DNFPD camera app. For modem via Ophono, but users just see kind of a modem manager sort of imposter. For fingerprints, this part is completely customized for Joid.DNFPD. We just forked the settings. I haven't had it, everything. For battery management, Batman, very funny name. That does the work for battery management. I started that project as a shell script. It was a mistake. So Batman does a bunch of funny stuff, turns off CPU cores, sets governors, sets power save, whatever. It doesn't watch nonsense. And then we have Fosh, which is the user interface. Again, we maintain our own forcofosh because sometimes stuff happens, stuff breaks. We kind of have to maintain our own. We have bad experiences. We don't talk about those ones either. We don't say that in public. Joid.DNFPD needs to have a good image. Then we have the encryption service. Again, a custom tab and settings which uses Lux and LVM2. And the unlocker, which was, I think, initially developed for post market OS. We added a mini UI back end through LVGL. Again, custom back ends Android. I mean, it's the usual. So now how does everything actually go together? So as we mentioned, we have a bunch of custom back ends. We have a bunch of custom plugins. We have the Qt5 camera plugin from the days of, I think, Canonical, which developed it. There's the Ofona Binder plugin, which was developed by Joela, nice of them. There's a bunch of Pulse Audio modules that allow us to talk to the audio hell, like Droid Media itself, not exactly audio. And get audio through the hardware working, microphone, speakers, everything. We have GSTDroid. Again, talks to Droid Media to give us a nice and shiny Gstreamer pipeline that we can use for camera. And well, that's pretty much it. For back ends, because not for everything, we can add plugins, not all different pieces of software accept plugins. So we kind of had to hard fork a bunch of stuff. Some of them are not that frequently updated. So that was good luck for us. But GeoClue is barely updated. So we just added the Hypers back end, slap it in, which just works. We have the W-Root's Harvest Composer back end. I don't even know who started that. I know a bunch of people are involved in that. It's a mess. We have the Color-UD back end, which routes a bunch of stuff through hard-coded values. What if it works? And the Feedback-T back end, which talks to the Android vibrator how through IDEL and HIDL and gets the job done. It's not beautiful, but it works. And for MinUI, as we mentioned for Unlocker, we added a MinUI back end to LVGL itself, so it can draw to the screen without GPU acceleration, of course. Who needs GPU acceleration in the RAM disk? Anyways. So for Woot Animation, I think all this was used by Muff. We also have a MinUI back end for PlayMuff. I think it started life as the MoUI back end from JoLa. I don't remember. So to actually talk to the Android services, there's two main pieces that are doing the job for us. One's Lepipus and one's Gbinder. They allow us to craft. I mean, the Pibus has a bunch of compatibility layers and Gbinder that gives us a way to craft transactions and send it to the Android interfaces. And the whole system, how the whole thing works, pretty much ends there. Stuff's maybe hacky at times, I'm going to admit. But it works because we use pre-built vendor services and a bunch of stuff that was provided by the vendor itself. Stuff works for now. Maybe futures too. I'm joking. Like, stuff actually does work. So what is next for JoDian? Because the services work and the system itself starts up, everything works for the most part. But in reality, one of the main issues of the whole Linux ecosystem is app support. You don't have apps, must be honest. And no one wants to develop any either. No other big companies do. So I guess start integrating Beidjoy better into the system. Getting like zero startup time on Beidjoy, maybe developing something that replaces Beidjoy, again a drop in replacement. And clean up all the garbage that we added. We have a lot of garbage. So it's not pretty. We definitely have to go through everything. At least I do. I'm not a good programmer. We have to go refactor a lot of code, clean up a lot of code, see what we have to do. And possibly actually add some new features. So some of the actual features that I had in mind that I have been working on was wireless displays, which has to go through a pipe wire of using old version of pulse audio. So it's kind of tough. So I don't want to do a drop in the basement of pipe wire. I'm kind of tired of hacks. So we kind of have to fix up pulse audio to actually get pipe wire working. Then we can get pulse audio working because there's like an XTG portal for it. So that's one of the stuff in my to-do list that I actually have some work put in. Face unlock was something that I've been working on for the past two months. We can get face detection working through G-streamer. And G-streamer will actually move as you move your face along. I'm going to admit it's like 3 FPS. But it does detect. And the rest of the work can be done with OpenCV because not all Android devices do have the sensor to do it in hardware. So that has been in my to-do list. I've been working on it. Maybe we can help out other open source projects if they like face unlock, maybe. And two other very annoying features that are kind of deal breakers for others is once MMS. MMS, we don't have MMS. I tried many times. I couldn't get it working. MMS is very important. RCA is more important. But MMS also, at least in Canada and the US where I live, Android users are always using MMS to talk to the iOS guys. So MMS is very important. Dual SIM is very important as a deal breaker for many. And we have to work on dual SIM. That is a very big priority for me also. We've seen many users who actually looked at Android and they were like, oh, yeah, this is great. But you guys don't have dual SIM. So I'm out of here. That's not exactly the nicest. And besides all that, we still do have to work on app support for Linux and the ecosystem. With LibitVita and GTK4 becoming very mature and things working out, I have been at the very least working on porting all the old GTK3 applications that I've been using to GTK4 and LibitVita. Not exactly joy-dien specific, but it will benefit everyone. So that's something. A lot of applications are very slow. Settings app, as we all know, is very slow for the GNOME settings app. Much of the stuff is not threaded. Everything is running in a single thread. It's just horrible. A lot of code we have, I mean, well, I do have, that will soon possibly become a PR for many different projects, making many things threaded. We at joy-dien have a big PR to optimize GTK4. Speeding everything up, we've had a user who was working on a Blackberry, and he was seeing 70%, 80% performance improvement on his on GTK4. Because apparently there's a lot of issues in GTK4. Who could have thought? And the very last issue is that we don't, as the joy-dien people, we don't allow community devices in our build system. So if one of us, Core Devs, has a device, it can be made an official device. So like, be added to the build system, get stable builds and nightly builds. But we kind of don't have that for other people putting devices. So you should probably look into having a way to allow community people to port their phones and have them in our build system. I know many community porters have worked on devices, and they saw that, oh, they couldn't add it. So they just gave up. And the most important thing, documentation. And that's something I have to do, because none of the code I wrote has documentation. We have to do a lot of documentation. We don't like, at least the stuff that I worked on basically has nothing. I just worked on it. I slapped it on. I was like, yeah, it works, whatever. That one has to be worked on a lot. And that is at least my to-do list for now. No. Don't go down. Don't go down. Don't go down. Don't go down. OK. OK. OK. So if you want to contribute to Joyden via our device page, via our website, via our telegram channel, which also sync to our matrix, I think you can also find the matrix group for Joyden project. I don't use matrix much. But apparently, if you have a group that has a bunch of channels in it, I don't know. So you can find us there as well. And one kind of announcement that I have is we have been working towards getting phones with Joyden as the first pre-installed on phones. What a weird sentence. We have been working with an ODM to get Joyden phones, or so-called phones, with a Joyden-based system installed on them. And have that be sold to have kind of as a way that 0.64 does it. But it's like, yeah, we as Joyden developers are doing it. So we understand the system and we understand the hardware. So it's going to be much easier to develop on, because we also understand the system itself. So you might want to look out for that. Few relapses, not very labs. Few relapses, please. And possibly the bigger news of this sort of project of getting Joyden-based phones will be coming out in a few months. But you can be on the lookout for it. We have a website at the moment, kind of not exactly the best. Still being worked on. We have a survey asking users, if they wanted to have a phone with a Joyden-based system, what would they want? What specs would they want? What would they want the devs to be focusing on, et cetera? So you can expect a Linux-based phone sold on the market in a few months. Thank you. Thank you very much for the great talk. I know we have a lot of questions in the matrix, so I'm going to pass it on. So the highest upvoted question right now is, do you have any plans to switching to motor manager from Ophono? OK. So I have looked into this. I'm going to be 100% honest with you. I have looked into this. I am by no means a professional. And when I tried getting this working, I could never get a motor manager kind of back end to register a command over the binder IPC, the G-binder. Again, I am by no means a professional. And this is probably doable. And it will be a huge step forward, which will make the whole modem stack a lot better. It doesn't have to go through this, this, and this, and this, and this, a thousand things, then user sees some and gets it working. So yes, it will be great. I spent some time. I couldn't get it working. But it is in my to-do. One question. You mentioned that you implemented a WL roots back end for, I guess, to get fresh running. Is there any plans? For example, I currently use postmarker S on my phones. This is actually running in mainline kernel. So I guess it's a little bit of a different situation. But for example, different other Linux mobile UIs, like Nome Shell, just the Nome Shell branch for mobile, stuff like Plasma Mobile, SXMO. Is there a project to get those running on Droidian as well? Or is the only focus at this point? So at the moment, I actually understand the question. And we have a lot of questions like this, like getting different UIs running. So each UI that uses an underlying graphics library needs its own back end, obviously, because we have to use Harper Composer. And I know that there's like VeyFire that uses W roots. So that one works fine. There's a bunch of other W roots that works fine. But as an example, Plasma uses Kavein. There was what they used to be a Kavein back end for Harper Composer. And it's pretty old, or it's really old. And someone has to revive that to get it running. I currently don't have the time. I have a full-time job, and I'm a student. I'm kind of already under a lot of pressure. So for GNOME, which uses mutter, well, that's a beast by itself. Because Kavein and W roots are modular, somewhat. But mutter is the opposite. So the code for the RM back end, or frame by frame, or whatever, everything is baked in so hard that it's a very tough task actually adding a new back end. And let alone maintaining it. Because no one's going to accept any of our back ends upstream. Because no one can test it other than us. So if someone spends a time sure, but for GNOME shell with mutter, I really doubt it. Because it mutter itself. I might piss a lot of GNOME people off. I use GNOME myself. Mutter is a mess, at least when I looked at it six months ago. Thank you. How does Droidian support standard Debian, like Bookwam, Bozi, Deb files for RM64 targets? Well, yeah. You can run the packages. Right now, Droidian is based on Debian Trixie, the testing branch. We also have a branch for stable. Well, we have a snapshot for stable that you can use. It doesn't have many of the new features, that is based on the Bookwam. But any repository you add, any dead repository you add, if the packages are built for RM64 or the architecture is marked as all, like Python packages and stuff, everything will work. Flatpacks work, Snap packages work. If app images built for RM64, app images work, it's just like a computer. Thank you. Thanks. Maybe another? Yeah. OK, you. And then another question from Matrix. Thanks. Just a quick question about the strategy, because you mentioned that all these hacks you've built around to get it working. So my initial understanding was that you built Droidian to foster the development of these apps for Fosh, for instance. But now you're trying to also have a phone delivered with it. So does this really make sense to have a device running these, let's say, many hacks from the start? Well, yeah, that's a very good question. Well, we're trying to eliminate every single thing that we think is like a big hack. But it really depends on what you consider as a hack. Is libhybers a hack to you? Then the whole system is built on nothing. But to my eyes, I kind of have a different look to it. And in my opinion, we can slowly get rid of most of the hacks. Again, we have custom backends? Fair enough. But I don't see there as a hack. But in my opinion, a lot of those can be cleaned and can be made ready to be shipped on a phone sold to customers. So it's not that far gone that I would consider a waste of time. I would consider working on it a waste of time. I still think that it is very doable getting it done. Give a big round of applause again. Please, thank you.
The Journey to Ubuntu Touch 20.04 on PINE64
Hello everyone, thank you for coming and thank you for all the live streamers. A little bit about me, I'm a college student living in the US and I've been doing a lot of tech tinkering on open source stuff since I was little and there's been a lot of that experimentation in my house. Ubuntu has also been a very common operating system in our house just as much as Windows or Mac OS and so I have a particular affinity to it. On top of that the ASAHI Linux project that came out in 2022 sparked up an interest in me and reminded me that what my mobile devices were capable of running on their chips and so at the beginning I was running virtual machine Ubuntu images on my iOS devices but that wasn't native, that was virtual machines so I wanted a native Linux first device that was also affordable and accessible and that is where Pine 64 particularly stands out. And another important fact is that Orin actually means Pine so I've had a particular connection to them and affinity with them and dedication to their work. And so what makes Ubuntu Touch on Pine 64 different from most devices is split in two ways. One, Pine 64's devices are not like most Ubuntu Touch devices and that is that like many of the other talks earlier today have mentioned Ubuntu Touch runs on Hallyum kernels as opposed to mainline kernels which means that there's a lot of extra components that are thrown in the middle to do some abstraction to get a lot of the sensors and modem and such working. However on Pine 64 devices we don't have to use that, we have to use instead our own middleware often and also Ubuntu Touch is different than a lot of mobile Linux distributions because almost of those distributions allow you complete control over your operating system with a read write and file system and updates as they come. Ubuntu Touch does a read only file system to provide an immutability layer as well as over the air updates so updates happen in big chunks at once rather than individual packages as they come. So these pieces in particular have made adapting Pine 64 devices for Ubuntu Touch a challenge but a welcome one. So some background starts with the original 16.04 port came at a pivotal time for both Ubiports and Pine 64. For starters there was ongoing work to move to 18.04 from 16.04 although that work was later abandoned in favor of focusing on the jump to 20.04 as the project was focusing mainly on migrating away from legacy tools like Upstart when Canonical was developing the project and towards a system D based stack which the Ubiports team has done a great job with. They also announced around this time the renaming of Unity 8 to Lumiri which is still an ongoing process and involved not just the changing of a name in one place but in every single bit of code which has provided some incompatibilities as we will find out later on. The original PinePhone Community Edition came with Ubuntu Touch as well as the original Pine tab and when both of these were developed they were done primarily by one guy Dalton Durst who did a lot of work for not only these ports but also for the entirety of the Ubiports team and so he was handling a lot of internal infrastructure which meant that when the team was working on the eventual switch to 20.04 the Pine 64 port had to be pushed aside in favor of a lot of other stuff that Dalton was working on. And then another pivotal moment came in 2022 when first Dalton had left the development team to go work on other projects which left the PinePhone port completely abandoned at that point and Pine 64 also came out with the PinePhone Pro Explorer Edition which was around the time when I started getting interested in the device but notably the device didn't have an Ubuntu Touch port which means that I had to make that. And so my process with this port originally began with looking at some of the other builder scripts that were around. Notably there's one that is linked on the wiki called the DPA image builder that taught me a lot about how the structure of the images are compiled which allowed me to create this chart here and what's important about the PinePhone Pro is that the bootloader is separated onto a separate SPI chip rather than within the images themselves which meant I didn't have to pack those anymore which is a great benefit. We can also use particularly tow boot as our bootloader which allows us to dual boot using the volume keys or even switch into mass storage mode to flash directly to the device from any other machine. But as I quickly found out most of the fun was in the kernel and it didn't work immediately when I booted it because at the time the PinePhone Pro device tree files were not in the kernel yet and so I had to pull them from downstream. Particularly a lot of my kernel work has reflected Medjy's work and it was looking at his work that helped me figure out how to get those device trees in. Once I passed that process I had a booting and boot-to-image but this was not a distributable boot-to-image it was built manually and was heavy. So I had to switch to making a port for a boot-to-touch. It uses a very similar process but slightly different rather than reboot strapping from scratch. We actually pull a CD image from Ubuntu server and then use a program called Devo's which can open a Docker or Podman container and build on top of that CD image to create our final distributable images. And last year at FOSSTEM I wasn't here but an early stage of my PinePhone Pro port was shown off at the FOSSTEM stand and this year I now have four devices, the PinePhone, the PinePhone Pro, the PineTab and the PineTab 2 all running on a much stabbler version of the port. So once I got the PinePhone Pro ported it was time to move on to the PinePhone which was still stuck behind on 1604 and I didn't have the PinePhone myself but I could do some research in the meantime and so I found out actually that there was no reason why I couldn't include both architectures for the devices inside of my kernel image which I also learned from Meji's stream and once I had a unified kernel I also found out that we could use tow boot on the PinePhone as well which once again split out that necessity of having to pack the bootloader into our images and I asked someone to try it out on their device and sure enough it worked which was wonderful which meant we had both the PinePhone and the PinePhone Pro up within just like two weeks of each other. Shortly after that the PineTab 2 pre-orders went live and at this point I was looking to make another port and the UB ports team actually reached out to me and said do you want us to send that to you so that you can make the port nice, happily obliged and they also sent me one of the original PinePhone to maintain at this time and then the PineTab 2's port was very similar to the other ones and I had most of the hang of it by this point but it was too early for a tow boot port to be out yet so we had to use the UBOOT binaries which meant I had to go back to learning how to pack that into the image properly but luckily besides the bootloader the rest of the process was essentially the same and then after we had the PineTab 2 port another community member reached out to me and said hey I see that you have these other three devices ported up and I've got an original PineTab sitting in my drawer not doing anything would you like me to send it to you so that you can create a port for that as well and once again I said of course and unfortunately tow boot doesn't work on the PineTab either because the run for how many PineTabs actually came out was quite limited so the main maintainer of tow boot never got his hands on the device to create that port so we used the PineTab 2's process again and just packed the bootloader back into the images and that had two congruent sides, a PinePhone set of images without the bootloader in it and then a PineTab set of images with the bootloader in it. Notably the PineTab and PineTab 2 do use different bootloaders because they have different architectures so there are individual images for each of those devices. I was also warned about using kernel versions greater than 6.1 on the PineTab because apparently it would cause a kernel panic and an infinite reboot. I found that this was partially true but it was a very easy problem to solve all I needed to do was move a module from internal to external which allows it to run after the DRM system that it was relying on to run and then it never has that kernel panic because it never starts before it's supposed to. As I stated previously though a ported device doesn't mean all of its features are working so there were a lot of software component hurdles that I had to get over to get to the state that we were in today. Two of the biggest ones have been rotation and modem both of which were due to the niche circumstances of trying to conform to Ubuntu touches, Hallym software stack. So in particular we have the split of what most Pine64 distributions use versus what Ubuntu touch uses for starters modem manager versus ofono which has also been mentioned in a few talks earlier. Modem manager generally has a lot better stability with the EG25 modem that the PinePhone and PinePhone Pro use but with several scripts we were able to get ofono in a similarly stable state. Another of those components was the difference between IO sensor proxy and sensor FW. Sailfish OS also uses sensor FW and we also use the ofono sailfish port but the thing is with sensor FW compared to sensor proxy is that you have to write your own configuration files for your devices and it also has to use a second adapter in order to properly read from the IO buses. And so you can see here on these charts that both ofono and modem manager can use EG25 manager which handles with the powering and a lot of the sending data between the modem and that was how we were able to get a much more stable modem version on 2004 than compared to 1604. And with the sensor files even after all of those patches were properly put in and all of our sensors were reading correctly rotation still wasn't working and this was maybe my biggest frustration for eight months. And then one day I decided to look in the log files and I noticed that the display was being enumerated as unknown rather than DSI which in some places it says that correctly but in other places it doesn't so sure enough once I had fixed that enumeration in all of the places where it properly had to be rotation was working. And the other big group of struggles was read only images and recovery images both of which use a special init ram FS script and so these two components help provide that those OTA images the read only images provide a level of immutability so that a user can wipe the system into a reset state and rather than having to re-flash the whole image and it also protects the system from too much destruction but there's also the recovery scripts which allow the device to switch into that updating modes that it can install those OTA updates as opposed to installing the updates for individual packages live like most Linux distributions do. So while the 20.04 pine 64 images currently release with image files most Ubuntu touch images ship their updates through tar balls which is where we are moving towards and the recovery image is what we need for that final component to get the tar balls working and recently we did succeed in getting those read only images working and now we can copy much more of the deploy style of many of the other Ubuntu touch images and then looking forward we have a lot of different types of images that we can use. We are moving towards 20.04 on the entirety of the distribution which will likely be around when these recovery and over the air images will also be available but this rebase is going to be a welcome one for us because most of the components that we back ported into 20.04 for the Pinephone Pro and PineTab 2 will be already upstream in 20.04 so we don't have to carry that in our repositories anymore. Outside of Ubuntu touch we are also working closely with the Lumiere team that is working outside of regular Ubuntu as well as on Debian and so we are hoping that some of the changes like the enumeration to those displays can help fix some of those issues on Debian with rotation for example and right now our ports is the closest thing that Lumiere has to stability on mainline but we are hoping to get that expanded to a more generic set of devices in the near future and that's about it. Thank you. We have some demos of the devices available at the Foss on Mobile stand in Building AW so feel free to check those out afterwards. Great, first question. You talked about the PineTab 2 versions of that, the Dev1 and the early Adopter one, is it fixed for both? Yes. Thank you. Thank you, very interesting. Having heard some of the talks today in this Dev room makes me feel like this is the early days of ARM system boards or even worse like the those days where every game had to ship 36 audio drivers. Do you envision a future where we have a sort of standard platform like UEFI on PC and ARM? I would hope so. I think that the ASAHI Linux project is certainly a push towards that and I'm hoping that other companies can follow suit. Hello. Great talk. Is it technically possible that the, you mentioned that the PinePhone images are the same image for the two different Pine phones? Would it be possible that there be non Pine phones in the same image if they didn't require bootloader or is there a specific reason why they only work on Pine devices? The only reason right now is the kernel. Otherwise we absolutely can boot those images that don't include the bootloader on plenty of other devices. How did you find out to put the, was it from internal to external, the kernel module? Was it that? I was looking in the device tree files and I noticed a mention of the display driver in there, but it looked like there were actually a duplication of those mentions. And so when I went and ticked off one of those modules from Y to M on the displays, it worked and that's all it needed. And then in the kernel logs it also said that that display driver was trying to start before DRM was available. A question from the matrix. I've heard this question before today, but yeah, the question is, any plans on migrating to Modem Manager? I saw that question earlier and I would also hope so, but I don't think that actually is viable right now because that would mean the whole, wouldn't you touch stack would have to move to Modem Manager and so we instead have to rely on what the rest of the distribution is using, which right now is Ophano. It's another question. According to the picture, recovery was dropped in the 2004 layout. Was recovery functionality integrated into boot in the DRMFS? So it wasn't dropped, it's just not available yet. It's still a work in progress. I do not necessarily have a question, but I have a quick addition to the person that asked about the standardized boot format, about the DOS games. I think it was that guy. People are moving towards U-boot and chain loading U-boot on other devices and making repartitioning possible. So in the end it would look the same as I and also the pine phone that you developed. So that was a quick addition. Thanks. A follow up question. You meant kernel options before compiling with Y and M or okay. Say it again. Did you mean kernel options Y and M? Yes, yes, in the DevConfig. Thanks. Could you name a single thing that would make the porting to another device easier? What was the hardest thing? What would make your life easier if you would have to port to a new device? If the boot loader was figured out for me, then it would make it really easy. Because as I mentioned with the pine phone and pine phone pro images, it's really just the kernel at that point. It's not hard to figure out what kernel modules you need to get a certain device to boot. Maybe one more generic question. What's the current status regarding the full disk encryption in UB ports? Say it again. The full disk encryption status in UB ports. I actually don't know that. Does anyone, Alfred? Yeah, passing on to Alfred. Yeah, thank you. So it's probably not going to be so first of all, there is no home encryption whatsoever right now. But unless manually set up with scripts, so in which case you can do that yourselves. We shouldn't provide any default, but we want to provide a default. And that's probably not going to be lux based encryption, but rather file, directly file based with X4 and F2FS based solutions. Because the Android devices, they have Android partitioning schemes, they have various differences where it makes no sense to do full disk encryption in that way that we used to from the desktop. And with it being on the user data, we can ensure that selective things inside of the user data are encrypted, like the home directory of the main user of the device. In which case we can unlock it with the same on-screen keyboard that the Lumiri desktop uses without having to basically add the on-screen keyboard to the inner-dramf s early up in the boot so that they don't look different, that they're using it like that they look cohesive, that they work with similar technologies so that it's one completely fitting thing that does it all for you. So in this case, full disk encryption probably not, but file based encryption or file system based encryption more likely. There have been experiments with that and they were successful. How did you feel when you first successfully booted up Ubuntu Touch on the pine phone? It was an awesome feeling, but as I mentioned, I have been tech-tinkering for a long time so it was also a very familiar feeling of, oh yeah, I got it working. Thank you.
Towards a bright future with Mobian?
Thank you all and thank you all for attending this talk. So yeah, I'll be talking about how we can improve our future as mobile Linux users, especially with Mobian, but this all applies to other similar projects such as Postmarka 2S and so on. So first question you might have is, who is this guy? So basically I'm working as Senior Software Engineer at Colabora. I'm dealing mostly with building and maintaining custom distributions for embedded systems, so kind of related to what I do with Mobian. I've been a long time Floss Introduce and I've been a DBN developer for a few years. And back in 2020, so just the last first damn before the pandemic, basically, I got my hands on a pine phone and this prompted me to work on that, work on mobile Linux in general and start and still continue working on the Mobian project. So what's actually Mobian? It's a DBN derivative or in the DBN jargon we call that a blend, which targets mobile devices such as smartphones and tablets. It has a separate package repository and provides ready to use disk images you can flash on a few devices. It's actually a very small overlay on top of the DBN and we only provide currently 25 source packages in our repository compared to the vastly greater number which is in the DBN, which means that technically of all the packages you have access from a Mobian device, actually more than 99.9% of that is pure DBN. And so we have a few packages with downstream patches which can't be upstream at the present time. Half of those are kernels, a few others are user space applications, which we're working on dropping those patches and trying to find upstream friendly solutions. We have also a few packages which are basically workarounds because the feature does not exist in the upstream world, not yet at least one of those being for example, Millipixels, which is the camera application for the Libram 5. Once the Libram 5 gets supported by either or both megapixels and Lib Camera, we can basically just drop this package and rely on upstream applications. And finally we have six Mobian specific packages which are to be rewrote to be included in the DBN itself so we can lower the impact of Mobian and the footprint of Mobian. So we hope that we can get below 10 packages by the end of next year. We'll see if we make it, but that's our end goal for now. So latest developments, what happened the past year? We had the first table release. We just did the whole quotes around stable. It's basically that we released Mobian Bookworm at the same time as the DBN Bookworm was released. So that's our stable release. It doesn't mean it's bug free. It just means that we don't do huge upgrades and only targeted fixes. So the system stays stable and keeps working as it works currently even after software updates. So it was released in June last year. We have a few supported devices out of the box which are several Linux first devices, the PinePhone, the PinePhone Pro, the Librem 5 also. We support a few Android-based devices thanks to the work of the community, especially on the SDM845 kernel support. So we support the OnePlus 660 and the Pocophone F1. And we also provide X86 images for desktop PCs or X86 tablets such as the Microsoft Surface Pro and Go. We provide a single desktop environment in this release which is Posh. And we provide also up to today's 6.1 kernels. So the 6.1 kernel is not the latest but the former one LTS branch, meaning it's supported up until 2026 if my memory is good. And we have a script in CI which is run daily and automatically rebases all the kernel packages we have on 6.1 on the latest point release. So basically when there's a security update, usually the day after or the same day, the kernel is up to date in the Bookworm Update's repo which is basically our staging repo for the stable release. There are however a few things we wanted to include in this release that couldn't make it. First one is universal images. The plan here would be to have a single kernel package for all supported devices. It's working quite well for SDM 845 devices because they share already a single kernel and the people working on those devices all put their patches into the same repository. But for pine 64 devices for example which is based on all winner A64, rack chip, different chips also. It turns out that making a single kernel package out of those proved to be trickier than we anticipated and so we basically dropped this effort at some point and focused on having just per device kernels, at least for this release. So we couldn't make universal images obviously. We didn't find the time also to improve the hardware support of upstream. We still carry lots of patches across for all the devices I mentioned. It must be a total of 800 to 1000 downstream patches in the kernels only. So that's quite a significant amount. We'd like to get them upstream but we all had dead jobs and for now every day is still 24 hours only. So we have to make choice. Also we wanted to switch to the latest LTS kernel which is now 6.6 and finally realized that we couldn't because we didn't have any time, any resources to spend on that. So that means that Bookworm is stuck forever on 6.1 which is not too bad because the life cycle of Bookworm will end in about a year and a half and until then this kernel will still receive security updates and bug fixes. So as long as Bookworm lives, the kernel lives along with it and we can get up to date and avoid security holes anyway. However, the next release which I'm about to talk is Trixie and is already on 6.6. So what about the recent developments? We try still to unify our disk images slowly. Instead of aiming for a single image for all devices, we're taking a step along this path and try to just ship one image per kernel. Until now we have one image for the PinePhone, one image for the PineTab, another one for the PinePhone Pro and the PineTab 2 and so on because some of those devices require hardly specific tweaks to be included with configuration strips, Udev rules and so on. And so we came to a point where actually most of these tweaks weren't needed anymore because upstream had picked up and had the necessary features for those devices. So we could envision having instead of having one image per device, having one image per kernel. And so we have our kernels per architecture basically, per sub architecture really. We have one for the old winner, A64 devices. We have one for the Rockchip-based devices which are the PinePhone Pro and the PineTab 2. Two different socks from Rockchip but still we can use the same tree and so on. It was already working well on the SDM845 devices but we took this step a few weeks ago and so it quite reduced the number of images we were doing. Regarding Qualcomm-based images we had until now one image for the SDM845 devices and another one for the SMS225 which is basically the PhanPhone 4 because we used to maintain different kernels for all of those. This is going to change and actually already changed recently because we pretty much imported all the patches we needed into a single kernel for all Qualcomm devices we support. There are not many of those which is why we are managing to do that but for now we have a single kernel which handles all the SDM845 devices, 1 plus 6 and so on, the PhanPhone 4 which has a different chip and also the PhanPhone 5 which has another different chip. And so we have a single image for all Qualcomm devices and we just use a simple config file at build time to generate the boot image for the device because although the root file system are identical the boot images are really device specific because they need to have the device tree appended and the specific RAM disk and so on. But other than this boot image generation everything is handled at runtime using JoyJuicer which fetches the binary firmware from the Android vendor partition because those devices ship with Android first and so the firmware are already present on the device. This makes things a bit easier for us because we don't have to care about the firmware license, we don't distribute it, it's at runtime fetched from data which is already available on the device. And there's also a small package with QCOM 4 new tools which basically just includes a few scripts and configuration for which are basically the same on all Qualcomm based devices we support. We're also adding in the process a simpler way to add new device support at least if it's Qualcomm based. The thing is until now we needed to have a kernel package in the Mobian repo and a few specific tricks in the image build process. We created a new target for these build scripts and build recipes basically which is QCOM 4WIP, it's kind of a dummy device but the thing is you can separately build or rather cross compile your downstream kernel using the bin depth package make target which is supported by the upstream Linux so you don't have anything specific to do there. It generates a Debian package which you can drop into the Mobian recipes folder, edit some config file, run the build script and it will provide you with a root FS image and a boot image tailored for your device. Then you can flash it using fastboot and hopefully celebrate that your device can run Mobian. This is almost never that easy but the thing is we're moving the complexity from knowing the internals of the build system to just debugging the kernel booting on your device. So there's nothing Mobian specific in that, it's just general debugging and we basically made it sure it was as simple as it could be from the Mobian side. And we also have a small first-dem-present in the sense that Mobian now provides, it's been a week since the first images were published but we now provide plasma mobile images. It actually started over a year ago and the goal was to from the start have everything in Debian itself rather than carry downstream packages in Mobian. And so Marco, one of the Mobian developers, worked on that for more than a year and he managed to get all the needed packages in Debian itself including the MetaplasmaMobile Meta package which you just have to install, apt install PlasmaMobileFool for example and it will put in all the packages you need and from there we could build our Mobian image. So that's basically what happened over the last year. Now what's next? We're taking, trying to take a step further towards universal image. So I've talked about the kernel issue, unifying all patches into a single kernel but actually there's all this little device specific tweaks I mentioned earlier which have to be handled and until now we have per device packages so that means one new package to have in the repo for each new device we want to support. This is an approach that doesn't scale at all. I mean it works fine if you manage 10 devices. If you aim for tens or maybe let's hope for hundreds of devices it's just too much work for a small team. So the idea here is to have a runtime service which will identify the device it runs on using the device tree compatible property for example or the ACPI, DMI, vendor, manufacturer and so on strings on x86. Select or generate the needed config files, put them into a runtime generated Debian package and install it on the device with the ability to place triggers on that so that when one specific config file is modified by another package this tweaks package is regenerated, rebuilt and updated as well. So that's something we hope to achieve this year as well as getting closer to Pure Blend. This is a specific class of Debian derivatives and it involves having all the packages into the Debian repository. So this is our next step once we have a working runtime tweaks management but basically this would mean having all our meta packages, tweaks packages and so on into Debian itself so we can just install everything Mobian from the Debian repository. Not all hardware features will work unless you use the Mobian provided kernels of course so Mobian will stay relevant for some time at least and we'll also be still able to generate ready to use images which will make things easier for users rather than having to build themselves from the Debian packages. Another big topic is also the call audio management. A few years back we created call audio which is a demon monitoring the phone call status and switching audio profiles and routing on the go depending on the situation. This was in a post-sodial world and back then post-sodial didn't really bother with such things the only automatic switching it did was when you plug the headphones and so on and we made sure that call audio did disable that on the post-sodial side. But now we are living in a pipe wire world and with pipe wire comes a session manager which by default is wire plumber and the session manager is meant to do just that switch audio profiles switch the routing to match the current situation. And so call audio raises with wire plumber most of the time it often loses so this means that you're having a phone call and actually you don't hear anything in the phone earpiece because wire plumber did the switching right after call audio instructed pipe wire to do so. So there's clearly a conflict there and the goal here is to make call audio basically a part of wire plumber itself. This needs some work in pipe wire to make it aware of the modem and to monitor the phone call stages but we hope to submit an initial RFC implementation at some point this year. No problem obviously. And finally we plan for a few other minor improvements. I mean most of the project development process and infrastructure is under documented as it is most often the case. So we have very user centric documentation written by users but we are very few developers and we didn't take the time to do so. So we'd like to improve that because basically a significant portion of the project has a best factor of one which is me basically. So I try to change that and make sure we have backup solutions and we get more welcoming to other contributors. And finally we'd like also to keep working on upstream device improvements. The Pinefone Pro has a few low hanging fruits. We can upstream probably easily. The support for the Pinefone 2 is being merged upstream as we speak. It now has a working Wi-Fi device. We'll have to look if it can be upstream as well. We hope to support also the Pinefone V or Pad 9.5 depending on how you see it which would be the first week 5 device supported in Mobian. And we also welcome obviously contributions to support more devices to help us with documentation and to basically help us make the mobile future brighter for all of us Linux mobile users. So here are a few links. I put the slides. Thank you very much. Questions? Hi. So I was profoundly disappointed to read your blog post in October about the Travals with the Pinefone kernel and the fact that essentially all of the work that had gone into the Pinefone kernel in Meggy's kernel tree was not being upstream. Which I presumed was the case really since the Pinefone had come along. So I was just kind of wondering what had happened if anything had changed on that front if Meggy was upstreaming patches now or anyone else and kind of what the situation was with that. For the original Pinefone the current situation is that someone in Mobian stepped up to maintain and update this kernel. He also started upstreaming a few patches and is monitoring the kernel mailing list and working with upstream to improve the situation over time. So there's lots of work to be done. I know there's also another person which has started working on the driver for the Wi-Fi chip which for now it was downstream real tech full of crap basically and nothing close to being upstream able. The new driver will be hopefully upstreamed and so that's already one big pain in the ass which will be removed. So now there's a bit more hope for the original Pinefone and if things continue that way then it will probably be great. So a question from the matrix. Is there any plan to port the EG25 manager to LibGP-IoD 2.0? Right, yeah EG24 manager is a very specific piece of software for the modem found in the Pinefone and Pinefone Pro. It's using GPIOs through LibGP-IoD and there's a new release which changed the API completely. The thing is for now LibGP-IoD isn't packaged in Debian at least the version 2. Version 1 is in Debian so yeah for now I don't have any definite plan. The plan being once version 2 is in Debian then we go with it but before I'm not sure I have the time to deal with all of this. But much requests are welcome as always. Yeah so a question regarding your tweaks approach. So why do you want to build if I understood this correctly? The tweaks on the device package them there and then install this package instead of having just one package that carries all the tweaks. The thing is we will have one package carrying all the tweaks but those tweaks can conflict with each other. You can have conflicting configurations for OGO for example and depending on the device you have to select the right one. You have also devices which can't suspend because otherwise they don't resume and other devices which can do that. So you have to select the appropriate tweaks and the idea of creating a Debian package is that the packaging system is aware of those files. If you have some files and the user shares something then it won't overwrite them with a file from another package. If we don't do a package on the device and install it then if we just move files around the packaging system will not be aware of those and if at some point one Debian package ships a file with the exact same name then it will break. So that's the idea. Alright please give another background of applause for Anu. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you.
Exploring Quarkus Native: Choices and Implementation
Hello everyone, I'm Fivo Zakat and today I will talk about Quarkus native and some choices it makes and how it implements them. So how many of you are familiar with Quarkus? Know what Quarkus is? Well, less than I expected. Okay, so what it is? It's a Java framework. Well, it's an open source stack for building Java web apps, so it's a Java framework that aims to bring developer joy, it's Kubernetes native, brings the best of breed of libraries and standards and supports both imperative and reactive code. And that stopped working. So what does typically a framework do when you use it? Well, usually you write your Java application using the framework, then you package it, you save it wherever you want to deploy it and you start the application. And what it does, it will load configuration files, perform some annotations, perform some annotation processing, create some metadata graphs or whatever is needed and eventually run the application. So what Quarkus does to improve that situation is that it moves part of this configuration to build time, so you only run once the configuration and setup of your application and then when you deploy your application, it starts up faster and you don't have to repeat all this process. One benefit of this Quarkus feature is that it allows you to also go native. So instead of deploying on the JVM, you can deploy a native binary. So why would someone want to go native? We have put so much effort on making the JVM very mature, very stable, very high performance, et cetera, so why would someone want to go native? Without going in too much detail, I will list some of the pros and cons of going native. So first we will start with the pros. One of the major advantages of going native is that you get faster startup because you don't have a JVM that needs to start up, load classes, do classification, warm up, stuff like this, you get faster startup. You also get close to peak performance right from the beginning because you don't do just in time compilation, everything is ahead of time compile and that gives you close to your peak performance right from the beginning. You get a smaller standalone binary. Hint here, I'm comparing with shaping your application with the JVM. Otherwise the JAR file is smaller than the binary. And you also get smaller memory footprint when running your application because you don't have to keep all this data that the JVM keeps to track internal things. And another benefit is that if you launch the same application multiple times on the same host, they can share the heap as a copy and write memory segment. Now what are the disadvantages? First of all, you get slower development cycle. Compiling to native takes more than it takes to compiling to a JAR file. So we suggest that you develop on JVM, debug on the JVM and only when you are happy with your application then move to native because that takes some time. You also get lower peak performance because when you run binary, you don't get just in time compilation. So the compiler doesn't have the benefit to profile your code and to do better optimizations. It also can perform very aggressive optimizations relying on the deoptimizer to fall back to a slower version if something doesn't go as assumed during compilation time. Another issue is that security patches require recompilation. So even if a third-party library is vulnerable, you can just update the JAR file of that third-party library and don't recompile your code. You have to rebuild your code because parts of that third-party library might be empty in your application. So you have to recompile. Your application is also not portable. You lose the right ones run anywhere, principle. So because you are generating a binary file, it will only work on the target platform that you compile for. And last but not least, it lacks behind in terms of tooling support. So debugging is not as simple as in the JVM world. And the same goes for observability. That doesn't work. Okay. Now that we have seen that there are some benefits in using native code, let's see how it works. Quarkus uses GraVM and particularly GraVM's native image to generate the binary code from Java code. And how this works is that GraVM will take as input your Java application classes, the JDK classes, and the substrate VM classes. The substrate VM is a thin runtime layer that allows your application to run on bare metal. So it takes care of some of the system things going on. Then it performs a static analysis and this will allow it to perform dead code elimination. So it essentially doesn't compile any code that you don't need. If your application doesn't reference some part of your class path or your dependencies, it won't go in the binary. So it creates a graph like this where your Java applications reference some JDK classes and the JDK classes reference some substrate VM classes and it will eventually compile it to a native binary. However, GraVM comes with some limitations. There are things that are not supported and there are things that are supported but need manual configuration. And some of the not supported parts are currently working progress. I don't have enough time to go through this. So how does Quarkus offer, what does Quarkus offer on top of that? So GraVM takes Java and produces native code. So where does Quarkus native come into play? Because of the limitations I mentioned earlier, developing native applications for GraVM's native image might be painful and that's where Quarkus comes into play. It aims to help Java developers write their application and compile it to native without having to handle all the extra things that GraVM native image requires. First Quarkus will drive all the gathering of the metadata that the GraVM needs. So what's reflectively accessed, how many JNI interfaces are used, what are the resources we want to include our binary and stuff like this. Another benefit is that most of the ecosystem, so anything that comes with Quarkus is already supported for native image compilation. So if you want to use a library that's already supported by Quarkus, you don't have to do anything special, you just put it as a dependency to your application and it should work with native as well. It minimizes the dependencies because Quarkus already does a dependency analysis before going to native, so that allows you to pass less things to the class path and it helps the static analysis do the dead code elimination. Furthermore Quarkus through annotations, APIs and some configuration properties allow you to further find the configuration of your application for native. So some might think that that's not the only framework that does that, right? So why Quarkus? Quarkus takes an opinionated approach and it's different than the other frameworks in that it will try and build time initialize all the classes, while by default, Graph VMs native image runtime initializes the classes. And this might create some issues, so Quarkus will take care of reinitializing anything that's necessary like random seeds or some platform specific values and it will also reset fields that we don't need at runtime. It also doesn't allow incomplete class paths, so when you build everything needs to be on the class path, otherwise the build will fail and this ensures that you won't get any unexpected no class defound exceptions at runtime. And class, it uses Mandrel instead of the upstream Graph VM community addition, which is based on the Eclipse Temuring Open JDK build instead of the Laps JDK build and it's specifically tailored to Quarkus and maintained by Red Hat. So how does this really work under the covers? First of all, the Quarkus will take care of generating the Graph native image json configuration files. It will perform code substitutions wherever necessary. Code substitutions allow us to go and patch third-party libraries or even the JDK itself. So if we don't like there something or if something is not compatible with native compilation, we can adapt it. It will generate some byte code that is responsible for configuring things and it will change the defaults for Graph VM native image and it will also allow the user to pass additional parameters. So for the json configuration part, it generates these five files, one for JNI, for proxy classes, for reflective accesses, resources and serialization. These are the generation of these files is handled by the classes here. So it's native image reflective configs, let's say. And it decides what to put in these json files based on the build items that exist in your application. In Quarkus, you can define the build pipeline using these build items. And earlier I mentioned substitutions. Substitutions are heavily used in Quarkus because they assist in dead code elimination and they also make sure that things that are not supported in native code are not reachable and it will throw some appropriate exceptions for that. So Quarkus performs 303 method substitutions and 32 field recommendations in a total of 208 classes. This means that you don't have to do any of these on your own. They are already handled by Quarkus and this is only on Quarkus core. If you go and use some Quarkus extension, it performs its own substitutions and stuff like this. To see an example here, here we substitute the method allocate buffer in this class and we only do that when ZSTD is absent from the class path. And what we substitute the method with is a throw of an exception that this operation is unsupported. So if you compile your code to native and it invokes this method while the ZSTD library is not available, you will get this exception. And this is how we recompute fields. So here in Bouncy Castle's easy point, we go and reset the test random field because this is a secure random class and we don't want it to be preceded and pre-initialized in the native image. But whenever we restart the application, we get different random numbers. We can similarly change the value of a field by reinitializing from an alias. That means that we can pass whatever value we want not just reset it to null. Here we change the field unavailability cause to put a Quarkus specific exception in there. And we also substitute the method is available to return false to show that OpenSSL is not supported in this specific case. Regarding features generation, this is handled by the native image features step class and it will use Quarkus Gizmo to generate bytecode. And this bytecode is used to invoke Grail VMs APIs to perform stuff that cannot be done through the json configuration. So here is a part of the native image features that we generate. And what it essentially does is that it invokes first it gets the method descriptor for the runtime class initialization.initialize at build time method. And it will invoke this method passing it a string array with the empty string. This instructs Grail VM to build time initialize everything, which is different than what it does by default. And we can also parameterize the options that are passed to the native image build. And we do that in the native image build step. And here we see part of it. And what it does is that it always enables allow fold methods, which is off by default. It makes our application headless by default. It doesn't allow the creation of fallback images because fallback images are essentially JVM lancers. So you don't get the native application that you asked for. And we also always ask it to link at build time. And that concludes the talk. I would like to acknowledge that Quarkus participates in the IROEU funded project. And I'm ready to take questions, if any. Any questions in the chat? Yeah, the custom class loader is a bit tricky because Quarkus. The question was whether Quarkus also supports the standard JDK instead of Grail VM JDK. So this is the first part of the question. And the answer to that is yes. This is Quarkus native and this is optional. This is only if you want to go native. If you want to stay on the JVM path, you can use any JDK and it will work just fine. Now to the second question about custom class loaders. Although I'm not very familiar with that, I think that this might be a bit tricky because Quarkus already uses custom class loaders. So you have to make sure that they are somehow compatible. I couldn't hear the question, so. Okay, you find out a library and you wonder whether you can use it or not. Okay, if the library is supported by Quarkus itself, you will find it listed in the Quarkus supported libraries or in a Quarkus extension that supports this library. In that case, everything should work out of the box and you don't need to do anything. In the case that your library is not supported by Quarkus Core or any of the Quarkus extensions, then you need to use some of the trickings that Quarkus does to make it work. And Quarkus gives you some APIs and annotations that may assist you. Let's see that. There is a website like supported libraries that I can go to and have a look. I think if you go to code.quarkus.io, then you can see a list of supported extensions in libraries. Do we have time to get some more questions? One more question. Sorry. I was wondering if Worker's Native works with GNI-based providers, sorry, the provider interface, not GNI. The foreign API? No, no, sorry, like classes discovery when you want to load a specific service, SPI, that's the name, sorry, the service provider interface. I think I don't know. Okay, thank you. Okay, for the rest of the questions, please feel free to approach me on the break. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you.
An in-depth look at JFR in GraalVM and how it compares to JFR in OpenJDK
Hi everyone, my name is Robert Swenaga and I work at Red Hat. Today I'll be talking a little bit about JDK Flight Recorder in Gravium Native Image. And from now on we'll just refer to JDK Flight Recorder as JFR. So as a high level breakdown, I broke in this presentation to two sections. The first section is a high level overview of JFR Native Image and then we'll go into a low level deep dive of JFR Native Image and talk about some comparisons between substrate VM and hotspot. And I want to make note that even if you're not interested in Gravium Native Image at all, you may still be interested in the second half of this presentation because the details of JFR are going to be talking about there extend beyond just native image and also apply to hotspot more generally as well. Okay, so as a very quick refresher, JFR is an event-based monitoring and profiling tool. It's built directly into the JDK and it can give you some really valuable insights into what your application is doing both at a high level and also at the VM level. Okay, so Phoebus already talked about this a little bit, but Gravium Native Image is essentially a technology that allows you to convert your Java applications into binary executables. The appeal of this is you get much faster startup and use less resources and a big reason for that is you don't have to warm up your traditional JVM alongside your application code. And how it works is you compile your Java application to bytecode like you normally would and then you run the native image tool to convert that bytecode into your executable which you can later run. So why is JFR different to native image than in OpenJDK? The reasoning behind this is that a native image executable doesn't require a traditional JVM to run, however it still requires certain runtime components that your Java code expects such as GC and synchronization constructs like monitors, for example, and what's providing that in native images is something called substrate VM, which you can think of as sort of like a scoped down replacement for a hotspot. So it does a lot of the things that your Java code requires, but strips out a lot of the dynamic stuff that hotspot does that we don't really need in this environment. And the key here is that since a lot of the JFR code is embedded within hotspots, when we transfer it over to native image, we're using substrate VM so it has to be re-implemented in that VM instead. So that involves everything from the low-level JFR event instrumentation to the actual infrastructure that varies that JFR data from the point of instrumentation to the point where it's later consumed by a user. Yeah, so in terms of the current state of JFR support in native image, you can do things such as starting and stopping recording from the command line or from within your application code via the recording API. Several events are implemented, especially at the VM level. We have events for threads, monitors, allocations, you see, save points, etc. You can dump, snap, shot, to disk and inspect them with tools such as visual VM or JDK mission control as you normally would. The custom event API is also working, so you can create your own custom application level events. Stack traces and CPU profiling are also possible. Event streaming has recently been added as well. You can also even connect via remote GMX to flight recorder MaxBean, which practically means you can do things like from within the JMCUI, interact with JFR recordings that way, start them and manage them on the fly. How you might first interact with JFR in native image is at build time, you specify that you want the enable monitoring flag, specify you want JFR specifically, and that builds the JFR components into your executable. So then at runtime you can use the normal start recording, start flight recording option and pass all of the normal parameters that you would require, such as specifying a file name to dump the recording to or a duration, etc. There are still quite a few limitations to JFR native image. So not all events are implemented yet. It's an ongoing effort to keep up with open JDK in that area. Specifically, events related to bytecode instrumentation are not yet supported and of course some new JDK events we're trying to keep pace with that as well. Event streaming doesn't yet support stack traces, so that's one limitation of that. And we have a couple things that are in the review pipeline as well and are not yet supported in any release. That said, we've reached the deep dive, which is going to take up the majority of the presentation. And yeah, let's take a deep breath. So this road map essentially represents a very high level zoomed out view of the flow of JFR data through the system. And from now on each slide is going to contain this road map and the highlighted part will indicate the part that we're currently talking about just for convenience and easy reference. So firstly, the point of instrumentation. These are various points where JFR events are made at either an application level code or a VM level. And the screenshot on the slide is just from JDK Mission Control. I'm just using it to show some content that an event may contain. You can see there's a bunch of fields and corresponding values. And this is just one example. It'll vary by event. And you can think of JFR events as the primary thing that we're concerned with really. And the rest of the slides going forth are basically just piping to get that JFR data from the point of instrumentation to the chunk file where it can be consumed later. So yeah, speaking of chunk files, we're jumping all the way to the end of the road map. So chunk files are essentially the resting place of the JFR data as far as we're concerned for this presentation. And they must contain basically the same information, the same format regardless of whether OpenJDK or native images generating them. And they can be dumped to snapshots, the JFR snapshot which is the .JFR file format. And that's usually how people are going to interact with them via JMC or Visual VM or the JFR command line tool. Yeah, so chunk files are self-contained and they have four distinct sections. You can see in the diagram here header which contains pointers and other metadata. There is the event data section which contains the core JFR event data. Then there's the metadata section which describes the format and layout of the events in the event data section. And then we have the constant pools which contain constants which are referenced from the event data section. So the constants, in order to reduce the size of JFR data, we use a referencing ID scheme to increase compactness. And how this works is entries in the event data section of the chunk file will use unique IDs to reference into the constant pool section of the chunk file. And this helps with deduplicating the actual constants that are used by the JFR events. So in this slide you can see there's an example of one event entry which uses the unique ID 12 which is then going to be used to index the thread constant pool and reference the actual thread data residing there. So all this increases the compactness of the JFR data and what that does is it reduces overhead when dealing with it while it's in flight and when writing it to disk. It reduces the overall chunk file size as well. However the downside of this increased compactness in this referencing ID scheme is that we have a tight coupling of the event data and the constant pool data so that if they're ever separated and not found in the same self-contained chunk file then we can't decode the event data section and it's basically unreadable. So that's when down side. Right, so now that we've talked about the very beginning and the end of the road map we'll jump and fill in the middle. So now, so after event emission the JFR data splits. So the event data, the core event data goes to the JFR thread local buffers while the constant data goes to the constant pools. And in both hotspot and substrate VM the JFR thread local buffers essentially have the same purpose and same structure. So their structure in a segment way that allows for concurrent rating and reading of data and there are various pointers which define the sections. So there's the rate position pointer which basically determines where new data is written into the buffer. So when the event rate is in progress that's the pointer that's going to be in use. Then there's the committed position pointer which represents the end of the committed data section. And the committed data section is data that has been fully written so it's not an in-progress rate but it hasn't migrated anywhere else yet. The flush data section is essentially committed data that has been migrated somewhere else so it can be overridden at the earliest convenience. Eventually the buffers will fill up with committed data and will have to be flushed elsewhere and at that point all the pointers reset back to the start position. Hotspot is a little bit different in that it uses buffer pools to recycle buffers. So there's a live list and a free list and when a new thread requires a T.O.B. from JFR one will be taken off of the free list and put on the live list and vice versa when that thread goes away. But in such a threat we have it a little bit simpler. We just allocate a native memory, a thread local buffer when it's required and when the thread goes away we destroy that memory. So we don't really have to manage access to these buffer pools and maintain them. Right, in the case of virtual threads, multiple virtual threads may share the same thread local buffer of the carrier thread and that's not really an issue because each one has exclusive access at any point in time and the JFR data is eventually going to the same place anyways. Right, so after the thread local buffers fill up they are migrated, the data is migrated to a set of global buffers and the global buffers essentially act as a capacity for overflow storage and it's more efficient than increasing the size of all the thread local buffers because not all threads will be equally as busy with respect to JFR events. Right, so constant pools. Previously we mentioned how constant pools use a referencing ID scheme to reduce the size of JFR data and they do this essentially works by deduplicating constants. In a hotspot the deduplication works, one way the deduplication works is by using JFR specific bits and the metaspace data for certain constant types such as class with a K and also methods. So these JFR specific bits act essentially as Boolean toggles so when an event data reference from in a JFR local buffer somewhere references a constant that bit in that constant is flipped to indicate that it's referenced somewhere that way when it's time to actually persist the constants to disk we only have to persist the ones that are actually referenced not all of them. Additionally if multiple events reference the same constant that bit is only flipped once and that's only used to be written once so that's where the deduplication happens. There are some constant types such as stack traces that don't have metaspace data and those cases a lookup table is instead used for the deduplication and tracking and an interesting thing is in substrate VM native image there is no metaspace at all so we have to rely on the lookup table approach for all the various constant types. Right, so after enough JFR data has been generated a chunk rotation must be requested and what this is is essentially the way that JFR data is continually persisted to disk. The current chunk file and disk that's open is sealed and then a new chunk file is opened and in that process all the in-flight and memory data is flushed to that chunk file before it's sealed and the thread that's performing this the chunk rotation must flush the thread local buffers of other threads and to do that safely we have to request a save point. So the order of operations at a chunk rotation save point is as follows on the slide I want to make note that it's pretty similar in open JDK as it is in substrate VM and the space between chunk rotation save points the recording time between is called an epic and you can see in the green save point box that that's where we're actually flushing the JFR buffers both local and global to disk but the most interesting thing here is that we're writing the constant pool to disk outside of the save points when we're already starting epic 2 so what that means is we'll we're simultaneously writing the constants from epic 1 to disk while recording constants for relative to epic 2 so they're kind of mingling inside the constant pools so we need to keep them isolated however because we want to avoid writing constants perspective to epic 2 to disk into chunk file for epic 1 otherwise we'll have that mismatch and we won't be able to decode the data for constant for epic 2 the same issue that I explained a few slides back so how we do this is we tag each constant according to the respective epic to keep them isolated and essentially overall the more of the story is it allows us to reduce save point pause time by writing these constant pools outside of the save point and another way we actually reduce save point pause time is by having a dedicated JFR thread flush the global buffers to disk periodically throughout the epic time so it's not actually happening in the save points so there's less work to actually be done when we are stopping the worlds to flush the buffers to disk right um um one related note on save pointing is the question of can a chunk rotation save point interrupts concurrent event emission that may be happening in other threads so we have a scenario here where the save point actually and save points and epic transition actually interrupts the event emission and separates the constant data and the event data into different epics and different chunk files and then it will be unreadable then so that's a scenario that is in question right now um and in j in open JDK in hotspot the JFR code is written in C++ it's native code so it can actually be interrupted for a save point so it's not really an issue at all however in substrate VM it's Java on Java and the VM code is written in Java so the JFR stuff is Java code and potentially could save point at a very inopportune moment so how do we prevent that stuff from happening in substrate VM um how it's done is we have this annotation called an interruptible and what that does is that build time prevents the insertion of save point checks so that the code that's in the annotated with an interruptible annotation doesn't actually save point at all so you find that a lot of the JFR code is sprinkled with this annotation all over the place in the VM especially dealing with buffers and constant pools and event writes but this has pretty big consequences for the implementation itself because un-interruptible code that can't save point can only call other un-interruptible code that can't save point which means a lot of the JDK code that's written in Java is off limits so we can't use things like the normal hash tables, re-entrant locks, etc. we have to kind of like roll our own versions of that which are un-interruptible another thing is we can't even use manage memory on the Java heap because that can induce a garbage collection which requires save point and that's not un-interruptible so we have to use unmanaged native memory in order to craft room data structures to deal with a lot of these things so it's a little bit of work dealing with that and the last thing I want to talk about and the last difference I want to mention between JFR and substrate VM and hotspot is related to how JFR interfaces from the the Java level JFR code to the VM level JFR code and in open JDK it happens in the JVM class here you can see on the left side of sorry the right side of the slide and these are basically the points where the Java level JFR code and the JDK calls down to hotspot at the VM level using JNI so we reuse that code in native image we reuse that Java level JFR code from the JDK but there's no underlying hotspot implementation to call into so how do we resolve that mismatch what we use is we use substitutions which Feeb has talked about a little bit but I'll mention again but essentially what it does is allows us at build time to specify redirects from these Java methods to our own implementation the JFR VM level code so on the right side you can see mark chunk final is highlighted and that corresponds to the Java level code on the left sorry on the right I keep getting mixed up on the right side of the of the slide so we can see that we're actually grabbing that and then redirecting it to our own substrate VM base implementation of that code so that's how we kind of resolve that mismatch um yeah with that said um that basically concludes my presentation if you're interested there are further links for for more reading there's some documentation and some blog posts as well and you can always approach me as outside as well if you have more questions um yeah how good you are for time Chris okay if there's any questions I'm happy to answer them now you just did such a good job explaining it thanks yeah on on substrate VM is there did you measure impagant time to save point because if is it uninterruptible you know this uninterruptible trade oh time to save points yeah yeah I could imagine yeah um I'm not really sure of the exact figures I can't really give you a number but um I I know what you're saying it it would potentially an issue I haven't not really aware of it um but yeah that that's definitely a concern um but it's not just the jfr code that's marked as interruptible a lot of the gc code as well a lot of the low-level operations they they must also be uninterruptible so it's not just jfr yeah understood thanks yeah actually to tag on to that a lot of jfr code is really just instrumenting other low-level code which is already an uninterruptible so it's like collateral damage it's not really an issue to add a little bit more on to code that's already an intructible such as uh jfr gc event handling and uh slow path allocation stuff that's already you can't save point there anyways thank you okay okay uh thank you for listening
Ruby on the Modern JVM: Fibers, FFI, and More
Our next speaker is the esteemed and very famous Charlie Nutter, so let's give him a round of applause. Alright, microphone working. Can you hear me okay back there? Alright, great. I got a lot to cover. This is going to be a retrospective of all the problems that we've had trying to get Ruby onto the JVM. And then a little status report along the way about how we're doing on making the JVM catch up with those needs. Charles Nutter, that's me. There's my contact information. Been working at Red Hat now for I think 12 years. Before that worked for Ingeniard. It was a Ruby software as a service company. And then I was at Sun for the last three years as well. So I probably won't have time for interactive QA, but if you contact me online or post something in the Matrix channel, I will definitely get to it. I want to answer all the questions. Okay, so a little brief review of JRuby here. Ruby for the JVM, not too surprising there. It runs on Java 8 currently, but because of all the cool stuff and because we've ridden the Java 8 horse into the ground, we are going to be 17 or 21 minimum next release, which should be this year. In development for a long time, running Rails since 2006 and probably 2008, we started having production users. And we're the only alternative Ruby that's really had production users during that time. There's been a few other experiments, but nothing's ever really taken as well as JRuby. Maybe the most successful off-platform language brought to the JVM, Jython and Rhino Nazhorn might give us a run for our money, but given the maintenance state of those libraries, I think we're probably currently the most successful and most widely used JVM language that never was envisioned for this platform. So we've been chasing Rails all the time. That's kind of the gold standard for whether we can say we're a Ruby implementation or not. And after about two years of good work, we managed to get Rails working back then. Running Rails tests, running CRuby's tests, running all of the different libraries, suites, as much as possible. Compliance testing for Ruby has improved over the years, but we pretty much just run everything to try and make sure that we really are compatible. And very quickly, we ran into some serious challenges trying to bring a language like Ruby to the JVM and make it also usable and perform well. This is the quick summary. These are all areas I'm going to cover during this talk, so we will just blow right through here. These challenges help us grow both as a platform and as a community. They open up new worlds to Java developers, to JVM developers. They open up the potential of bringing new and unusual languages to the platform. It opens up the entire world of native libraries, native features that are out there that we don't necessarily have on the JVM. So we really need to focus on what are these challenges bringing a language like Ruby to the JVM and how can we make the JVM better to support languages like this in the future? So we'll start with strings and regular expressions. Excuse me for a moment. Okay. So one of the first things we ran into, JRuby's strings were just based on Java strings and we used Java's regular expressions. And at the time, regular expressions were being used in very unusual ways in the Ruby world. We ran into a case in an early version of Rails where they were using regular expression matching to parse HTTP requests that came in and look for, say, a mime header for an image and pull the image out. So you'd end up with a regular expression operating against a very large piece of data. And the built-in Java regular expression engine is implemented in such a way that for certain types of expressions like an alternation like this, it actually will recurse and recurse and recurse. And then very easily you can blow the stack out by feeding it too much data, just giving it too much data. To process, we'll blow it up. So we had to find other options. JRegix was an early one that worked against Java strings and we ran with that for quite a while. But eventually it came to be that the Java string itself was insufficient for us to represent Ruby's string behavior. Here's what that exception looks like. Very simple match here. It's just 10,000 of the A character followed by a B character with that same regular expression. It'll blow up on every version of JVM that's out there or anything based on OpenJDK classes. And I believe this is still an issue. So as we went forward and had to have a more robust, more robust regular expression engine that would work with a more custom type string on JRuby that matched CRuby's behavior, we ported over, or a contributor to JRuby, ported over Ruby's regular expression engine. So Oniguruma is the C library that Ruby uses for regular expression matching and ours is Joanie. It's a byte code based register machine, so there's no stack issues. It doesn't deepen the stack at all. It matches against byte arrays. And that'll be clear in a moment here why we need that. It also can do byte array matching with pluggable encodings, so regardless of what encoding those characters are in, and potentially if you want to use a different grammar for regular expression. This library was ported to characters and used by Nazhorn to do JavaScript regular expressions. They had the same sort of problems, and so they used our library but made it specific to JavaScript. So you see that I'm matching against byte arrays here and I said that strings were insufficient. Well the problem is that Ruby's string is not just one encoding, it's not just a blob of characters, it is represented as a byte array with an encoding. So within the system you can have many strings that all have different encodings and it all needs to be negotiated together when you combine them or use them against each other. So we had to follow suit essentially. We had to make a new string class for JRuby that used bytes, used a byte array, represented all the encodings, port all of the encoding logic over and the transcoding logic, which was a major piece of work. And essentially we have our own string now and we've had this for over a decade because Java strings just could not emulate all of the behaviors we needed for Ruby. This does complicate interop with Java but there are improvements coming there. So J-codings is the encoding library that we use. This provides the full set of encodings similar to what's in CRuby and the transcoding from any encoding to another encoding, which is used internally when we have two different strings come together and need to negotiate that. So where do we stand on the JVM today? Well rather than just having a character array inside strings, we do actually have a similar model now where there's a byte array but only two encodings are allowed inside that byte array. ISO 88591, which is essentially just the 128 bits of ASCII, or UTF-16, the old standard character. So this does lower our cost going to and from Ruby and Java when we are just using an ASCII range of characters, but UTF-8 would be nice to have there because most Ruby strings are going to be UTF-8 probably with at least one multi-byte character in there. So that has to all be copied over a lot more, a lot less efficiently. And Java Util Rejects does still blow the stack. I would love to see it get replaced at some point, but I don't know if there's any work being done to do that. Okay, so the next area that we ran into was that we have a nice runtime, but the performance wasn't there. We needed to be able to generate JVM bytecode from Ruby code and have it optimize like regular Java. So the interpreter was good. It was similar to Ruby 1.8 before they moved to their own bytecode runtime. It was very difficult for the JVM to optimize. We could walk through this stuff quickly and it was very easy to write as an interpreter, but you had a lot of polymorphic calls within that AST. It never really could see the optimization path through there. So we had to write a JIT. The reason that we did not just immediately start compiling all Ruby code into bytecode is because, for example, the Rails library will load into memory thousands of classes, tens of, or tens of thousands of methods, or tens of thousands of methods. That's a massive load for us to put onto the JVM when only a few hundred or a few thousand of those are ever going to be called. It also was slower for us to go straight to bytecode because the bytecode would end up being interpreted by the JVM's interpreter, which actually turned out to be slower than our interpreter after it JITs. So it made more sense for us to leave it in our interpreter until we saw it really needed JVM bytecode, and there we ended up with basically the first mixed-mode JIT runtime on top of the JVM. Later on, we did move to a more modern compiler design. We had a compiler engineer, Sabu Sastri, come in and help, and he basically helped us move a lot of the little P-Pole optimizations I was doing in my JIT up to a more modern compiler architecture. So this simplified the JIT, simplified what I had to write as far as emitting bytecode, which then let me explore performance a lot more in other ways. And then of course, as we move forward, we got invokeDynamic in Java 7. It's been steadily improving since then. It's used incredibly heavily in JRuby. If you take the bytecode of our JIT output from Ruby code, it's pretty much just stack moves and invokeDynamics for almost everything that we do. We will access local variables normally, but everything else has to have some little dynamic aspect as part of Ruby. So we use it very heavily and probably more heavily than almost any other project on the JVM. This is invokeDynamic performance over time from Java 8 up to 17. Really happy to see the performance improvements every release. It gets a little bit better. Looking at what we're doing with a more numeric algorithm, we get a bigger boost out of it. With something that's just walking a lot of objects, we're already kind of close to where Java would be on just walking an object graph, but still seeing that we do get some improvements from running invokeDynamic, making that more direct. Really cool is when we plug in a different JIT compiler here. So this is now using invokeDynamic on the Grawl JIT. And for a numeric algorithm where we're creating tons of numeric objects, we really see the impact of partial escape analysis helping us. And this is now really starting to get to the point of Java level performance for a numeric algorithm. This is the cases where it really helps. But over time, we have not seen that Grawl is generally faster and we don't generally recommend it unless you have something numeric or something that's doing a massive amount of allocation of temporary objects. So where are we today? One of the problems that we have generating individual methods or compiling at the runtime is ideally we want that compiled method to go away if the class goes away, or if it's a one-off generated method that eventually doesn't get used. So it's a class per method, and the only way to make those garbage collectible is a class loader per class per method. So every method that we JIT into the system has both a class surrounding it and an entire class loader just to work within the confines of the garbage collector. There's no other way to make garbage collectible classes right now on the JVM. There is anonymous class loader, but it's a hidden class, and we don't try to access that right now. Indie is clearly working very well. We're going to be doing more advanced call sites where we will have special case code along one fast path and then a slower dynamic path if it turns out it's not the logic we expected. It is a tricky API to use, but we have a lot of tooling that we've built around it. I've got some links to older talks of mine that go into detail on that. Okay, I think we're doing pretty good on time here. I know I talk fast. Come back to the video and do it at like half speed, and then maybe you'll catch everything that I'm trying to say here. So the next big area that we ran into was native interop. The sea ruby world really lives in a POSIX native sea environment. It's almost a DSL for writing POSIX code, really. And originally that's kind of what Mott's the creator wanted. He wanted something where he could write C, but essentially with a nice API, a nice language on top of it. So they are very heavily using JNI-like extensions to the runtime for most of their native access. This is clearly way too invasive for JRuby. It calls into internals of their object structures. It has direct access to the heap, direct access to garbage collector endpoints. Nothing that we can emulate efficiently in JNI, and we have tried. So we ended up pushing people more towards using programmatic access, like Project Panama, like libffi, rather than writing C extensions for C ruby to wrap a library. Let's just wrap the library by writing a little bit of ruby code. And so started out with the Java native runtime. It's basically our API for calling from Java down into native code and native memory. And then on top of that, porting the ruby ffi layer over with some invoke dynamic magic, try and make that all as clean and fast as possible. Java native runtime is actually a set of projects. Up at the top, jffi is the wrapper around libffi. That's where we ship about 20 different binaries in the jar for all the base platforms that we support. Libffi is in there, and we're just using standard libffi with some extra wrapper logic around it. JNR-fffi is kind of the baseline user API. If you're familiar with JNA, this is that level, where you say, I need a struct that's laid out like this. I need a function that takes these arguments, make these calls, allocate this memory. Then above that, we realize there were a lot of functions and a lot of behaviors that people were going to be rebinding over and over if we didn't provide them. So we have JNR-possicks, which is a slowly growing corpus of standard posix functions bound on top of JNR-fffi. So you can go in there and you can call things like posixpon or open a file or do native IO. You can even call fork, and it's a lot of fun to see what happens when you do that. JNR-enxio, extended native cross-platform IO, builds on JNR-possicks and provides an NIO channel that is all native down calls. So where we can't get selectable standard IO on the JVM, we can't get selectable sub-process channels, we can use JNR-enxio to have actual interactive control over standard input. Standard IO and sub-processes. You can actually use JRuby to spin up a VIM instance and it will have full console control and work properly. Basically impossible to do with the standard process builder stuff on Java. Unix socket, not too surprising, just wraps this other stuff with the Unix socket calls. And then JNR-process, like I mentioned, we have our own selectable channels for processes. You can use this as a Maven library, you pull it in and you'll have the same API as process builder but you'll get channels, selectable channels out of it instead of streams. So it's available right now for that. This is a little bit of what Ruby FFI looks like. Pretty straightforward, we're setting up a structure with particular widths of fields, attaching a function, get time of day, and then we can call it directly. Under the covers, this all uses JNR and ideally inlines as much as possible up to the native down call. So today, native interop on the JVM. Of course, we have Panama coming along, so the talk before me, Mauricio's talk, that's where all the information is about where things are going and we're really excited about that. I actually wrote the original JEP for Panama, which has now been walked away from many times, but we've been needing this for over a decade now and had to make our own but don't want to maintain it anymore. JNR is pretty much the fastest way outside of Panama to do these native down calls. In some cases, actually beating JNI because there's extensions to generate a little JNI function in memory using assembly that can cut out some of that overhead rather than just doing pure programmatic calling through lib.ffi. Jextract from Panama is coming along. We're also hoping that we can use that at runtime as a library to and access those data structures internally to generate Ruby.ffi code. This would be kind of the last mile for getting Rubyists to switch from writing C extensions to using FFI. If we could generate the Ruby.ffi code the same way we do the Panama code, there'd be nothing to stop them at that point. There is back-end work happening right now on JNR to integrate it with Panama. Michelle at Oracle is working on that and I'm hoping that we'll see something in the next couple of weeks. A little more review of some of these ideas. If we have Jextract that can generate Java code, we should be able to use Jextract to also generate Ruby.ffi code. That'll be the next big fun toy to play with is of Java 22. We also use the existing SQLite JDBC driver. Rubyists like to use SQLite for local development. But it's going through a JNI back-end. You have to make sure it's available for the platform that you're on. They are also playing with Panama behind the scenes. Early numbers look like two-ish times faster than the JNI wrapper around SQLite that they have. So this is coming along. We also are integrating a new Ruby parser called Prism, which is a simple C library that we all the implementations can share so that we are using the same Ruby parser. That we will integrate through Panama as well. And use Panama to make it much faster for us to downcall into this library, get our AST back out, and then proceed. Interestingly, we're also exploring using Prism as a Wasm-compiled library running on the chicory Wasm implementation on top of the JVM so that we can parse Ruby code using a native library even if we're not on a platform it's compiled for. And that's amazing that it works. All right. Moving along here. So lightweight threading is the next big one. Around Ruby 1.9, they introduced fibers, a coroutine-like concept, a micro thread concept. You would still have your native threads there, but they can bounce around to different fibers at any given time. And you get little structured concurrency, structured use of fibers, allows you to do multiple tasks in the same thread. There's also been a push toward structured concurrency in the Ruby world now, where fibers can wait on I.O. or make a blocking call on I.O. The runtime will see that and schedule another fiber to run in its place while it's waiting for that. So you can easily handle tens of thousands, hundreds of thousands of concurrent connections, for example, without blocking that many threads or having to write your own select loop and what not. So fibers on JRuby, without a coroutine API at the JVM level, of course, we've had to use native threads. And that clearly only scales up to a certain number of threads. With the structured concurrency example, we could have potentially thousands of fibers in the system, and it's almost impossible for us to support that with full, heavy native threads all along the way. Ruby also primarily uses internal iteration. Collections just have to implement an each method of basically a for each. And all collections in the system then expect you to pass a block of code into it. Well, how do you turn internal iteration into external iteration? You have to use a coroutine that can yield values back out while staying inside that loop. So now we've got that potential for all sorts of fibers, hundreds of thousands of fibers all over the system, just because we're iterating collections with an external iterator. I'm going to kind of blow through this because the next talk will cover fibers a bit more. The example here of handling requests on a thread. We've got a thread, a request comes in. Now it's waiting for more information, the thread's not being used. Finally we get more data, we can proceed with the rest of our request handling. With fibers, of course, we can use multiple different fibers handling different connections on the same native thread. So the request comes in, this fiber's waiting on IO. Well let's spin up another fiber that can handle the next request that comes in. And they can multiplex use of that same thread. This is what we're starting to see more and more in Ruby, and this is where it will be critical for us to have lightweight fibers, lightweight coroutines on J Ruby. Okay, so here is a little benchmark, a little example of trying to test how long it takes to spin up 100,000 fibers and run them all to completion. So they are 100,000 live fibers in the system at any given time on this benchmark. And of course as you would expect this doesn't work. We can't spin up 100,000 native threads, and it just crashes in horrific ways. I'd love to see this crash in less horrific ways, but ideally we just move away from this problem altogether. And that's where we get project loom. So JVM Today, as of 21, we now have an official API for lightweight coroutines for essentially fibers that maps almost perfectly to what we need in the Ruby world. And we've already got this integrated. We integrated it a year ago actually, and have only made minor changes along the way. I'd like to show this just to demonstrate how much work we had to do to switch from our built in native fiber, native thread fibers to the virtual thread fibers. I was shocked that this was all it took, and suddenly this benchmark actually could run. It could actually spin up all of those fibers and run them to completion. So amazing work on the loom side, and very happy with the results. Once wise, so here I drop it down to 10,000 so that I can actually try and get the threaded version to work. Clearly we're getting significant gains on passing, context switching between different fibers, because loom is just better at that, and there's a much lighter weight process for going from one fiber to another on the same thread. Not quite as fast as C Ruby. I suspect this is probably due to us relying on a very general purpose scheduler for the virtual threads behind the scenes, where we really just want to say, this fiber's done, now run this one, rather than unblock that fiber and wait for the scheduler to pick it up. I think we can make up most of this overhead. Similarly on M1, I don't know if this is general to arm or not, but this is the performance results we have. Could not get 10,000 to go on M1. I got to drop it down to like 2,000 or 3,000. The impact is a bit more here, but again I'm hoping that as loom evolves, as we use it better, we'll see improvements. Five minutes for the last section here. The classic problem with J Ruby is still startup time. If we did not have startup time, we probably would have won the Ruby war a long time ago. It's the number one, two, and three complaint about J Ruby is how much longer it takes to start up. The JBM is just not designed to start up quickly. Most of the core JD code starts in the interpreter. It takes a long time for that to optimize, and then your application can start getting fast. We make it worse because we interpret Ruby code, and then every once in a while we'll just throw more byte code at the JVM, like okay, now this call site's actually bound to a byte code method, not an interpreter, and we're just confusing the hell out of it all the time. This is one of the reasons we actually do lazy compilation to byte code, because we want to reduce the amount of overhead we force onto the JIT at the JVM level. Walk through J Ruby's architecture here quick. We have our Ruby parser, gives us our Ruby AST, we compile into our intermediate representation, interpret that for a while, and here's where it becomes mixed mode. Then eventually we will generate byte code for those methods, and then hopefully the rest of it all works and optimizes to native code. One of the early ways that we've tried to improve startup time is basically to turn most of that off, rather than turning anything into byte code, rather than even running the C2, the fast JIT in hotspot. We turn only to C1, we use the simple JIT in the JVM, and we only use our interpreter. This improves our startup time by about 2x. By far the best thing we've had so far. Now, another way that could be potentially a way to fix this would be ahead of time compilation. Of course, GrawlVM solves this very nicely for that world, but it completely disables all of the native things that we want. General purpose, invoke dynamic, and method handles just simply, essentially doesn't work. Then beyond that, we would have to pre-compile all of our code to byte code. We'd have to link it in some way that it could ahead of time compile the native. This is just not going to work for us. We're hoping that Layden will actually pick up here with a ahead of time option that can also do some dynamic stuff at runtime. Where are we today? The solutions we're looking at in the short term are mostly surrounding the checkpointing features. Checkpoint and restore in user space, the CRIU API on Linux allows us to run JRuby to a certain point, like just after startup, and then save off a copy of it that we can start quickly with later on. This is being standardized in Project Crack, an unfortunate name, but a lovely project. This is working pretty well with JRuby right now, just experimenting with it. We are still hoping that Layden with some ahead of time compilation that still enables the rest of JVM features will be our ultimate solution. You can see here, this is CRuby on the left side just doing a baseline startup. JRuby's baseline startup without our dash dash dev flag, which turns off all of the optimization. The dev flag here, not quite 2x, but giving us a good boost. Crack of course, significantly faster than all of those. We've actually gotten to a point in execution where we can start running Ruby code now, starting to get competitive with CRuby, which was essentially designed for fast startup. Same example generating a Rails app, again, getting very close to where CRuby sits on these numbers. So, wrapping up in the last minute here, JRuby is a test bed for all of these crazy JVM things that we're doing. We're pushing all of these edges. So whether you care about Ruby or not, we are the best invoke dynamic torture test. We're going to be hitting Panama extremely hard as it gets integrated into the system. All of the structural threading will be massively exercised by all of the structured concurrency stuff coming on the Ruby side. So if you're interested in helping us integrate any of these features, or if you're an implementer interested in testing these features at scale, JRuby is definitely something you should look at. This is more background. I'll let you take a quick picture of this if you want. These are talks I've done in the past that basically cover all of my many complaints about the JVM. That list of complaints gets smaller and smaller every year, thankfully.
The Challenges of Running the Fuzion Language Natively on the OpenJDK
Okay, people, we are ready for the next talk. Please listen up, quiet down, and get ready for the next talk. Thank you. Okay, thank you, Andrew. So I'm going to talk about the fusion language and a bit more concrete about how we are running this on the open JDK. It's basically the problem of mapping a functional language to efficient data bytecode. The background of me, I'm basically doing compilers all of my life. Don't go into details, but what is important right now is I'm working at Tokiwa software in a team of three at the years where we together develop the fusion language and its implementation. A quick overview of this talk. I'm going to give a very short intro into the fusion language, and the rest I will show you through examples in the code. So I'm going to go into mostly different types and how they are realized on the open JDK. We're talking about tech union types, about product types that have values and managers, about type parameters, how to do margins and dimensions. I'm talking a bit about the class of the class. So I start. You can't hear me in the mic. Can't hear you? Yeah, yeah. We're not getting anything. Is it all plugged in properly? Turned on. Is it on? No, it's on. Okay. I'm sorry. Okay. Sorry for those online who missed that. Okay. I will start with a quick intro to the fusion language. Fusion is based on simplicity. It's all based on one single concept, which is a feature which takes the role of Java classes, Java methods, functions in other languages. Fusion is aiming at safety critical systems. We see more and more systems becoming safety critical. And in that area, we see that tools can make the developers' life much easier. Short the language, fusion is statically typed. It is polymorphic in different ways. It has union types, parametric types, and object-oriented style inheritance. And it is pure using effects to model side effects. The fusion tool chain looks like this. We start actually with fusion source files that go through a front end, compiled into fusion modules that are then processed by a middle end and by a static analyzer into a fusion application represented in the intermediate representation. And that is then the input for different back ends. And in this talk, I'm going to go into details of the JBM back end that then transfers this intermediate representation into bytecode in Java class files that are run on a JBM. The first aspect I want to focus on is tagged union types. I'll explain immediately what that is. As an example, I take an oven. I implement an oven in fusion, and an oven has an input, has a setting that oven can either be on, either be off, or it can be on with a temperature setting given in degrees centigrade or Fahrenheit. So there's three options in that union type. And off is just a unit type. It's just a value with no other state. While the temperature settings is either a setting that gives a centigrade or a temperature as an integer or a Fahrenheit temperature. That's a float in this case. And within the oven, we can then use a union type value and match on this or match on the setting and do different things depending on whether it is off or it is a temperature giving in centigrade or Fahrenheit. Now, I have to make some space here. Now, when we compile this to Java, it gives Java source, not to Java bytecode, to explain what we do that. We actually compile such an tagged union type into several fields. First, we have a tag field. That's why it's called tagged union types. That decides, do we actually have an off value? Oops, I have to make space again. Do we have a temperature or, and what kind of temperatures? And in case it is a centigrade temperature, we need another field to store that value or for Fahrenheit. So we have basically several fields to store a tagged union type value. I'll drive this example a bit further now. This is the most generic case. We have the tag and the different kind of values. And during the talk, I will go more and more specific until I reach the point where the oven literally disappears into the void. So the next step towards that is if we do a bit more an object oriented approach, we can use a temperature that is actually a reference to an object that provides a inner feature to get the temperature as a Celsius value. So it's a union type of off or the temperature and the matching changes accordingly. Now, how this could be implemented is we would have a tag that now decides whether it is off or it contains a temperature. But this is not what somebody would do in Java typically. This is a typical case where a null pointer or null reference would be used. So this is also what our back end does. It just uses one pointer value in that case and uses the null value to represent the state off. So that's the so-called nullable implementation of the tagged union type. Going further, having a more complex example, now we extend the parameter to also have a clean mode and to maybe have some error state, which essentially means we have four different possibilities and we need the temperature and the error state. But of course, there's never the case that the temperature and the error state are both used simultaneously. So we can join them into a single object value because only one of them is used. Now, the tag field decides on which of these four states we have, but actually we could use specific values for the object reference to represent the off and the clean states such that this all collapses into a single reference field for all four states. This is also what our back end does in this case. So such a tagged union collapses into a single field. Getting even simpler if you have a very simple oven that doesn't allow any temperature setting. It just has the modes on, off, clean, or error. That is basically a classic enumeration. Internally, this is just an integer, so we only have an integer type for that. If we have an even simpler oven that could just be on or off, there's only two states. So that falls down into a simple Boolean type and is compiled into a Boolean value by the Java back end. We can go further if we have now an application. We have a static analyzer for our application. And if the application actually never uses an oven that is on, that value can actually be determined to be a void type value. Avoid infusion is not like in Java. It's much more like a never type, so the result of a call to exit or so, that really never occurs at runtime. So if you have that, we don't need any data to store because all ovens in that application are always off. Can even go further if you have an application that doesn't even produce any ovens that are off. So maybe an application that only uses a list of ovens and that list is always empty. So both are never used, so we don't even need the oven anymore because this all can be thrown out. So that much to take union types. Next point is product types and fusion has value semantics while Java has reference semantics. A small example, I want to do a small graphics example here. I'll start with a very simple product type of point, which is just a combination of two coordinates, x and y, two integers. And I pass these points to a draw function that should draw these. I won't go into details there, but I just show you a bit of the syntax, how fusion uses effects, in this case, a graphics effect, to document that this function actually has a side effect. It requires an environment where graphics is available to actually draw this. Now we create a point, store it in a local variable p1, and pass that to the draw function. Now the question is how do we do this passing of the argument? How do we represent this point? We have value type semantic in fusion. So what we do is actually split this up into two fields, two parameters here for the draw function that are passed separately in a call. Similarly, when we create a new point, that point is split up into two local variables, or two fields that are then assigned separately. And finally, when we make this call, we pass on a call these two values individually. That works nice, is performant. Problematic in the JVM backend is the case of returning a value product type with value semantics. So here we have a shear function added to our feature point that creates a new point that is returned as a result. So Java cannot return multiple values, so what can we do instead? I need more space for that. And I've looked into a number of possibilities how we can return such a product type in Java. The first baseline in that analysis I looked at how, at inlining. If you would just inline the call, returning a value is just assigning between local variables. So we can use that, but of course that doesn't work in the general case because inlining will blow up, will not work for recursion and many restrictions. That's why I put such a flash behind that. That is not a solution for all our cases, but it gives a baseline for comparison. The typical way in Java to do that is that the call method would return a newly allocated temporary container object that just contains two integers. We could also do the other way around. We could have the caller preallocate or preallocate an object and pass a reference to that object to receive the results. The fourth solution I looked into was we could use static fields. So when returning two integers, we're returning a point, we just have two static fields, X and Y, and we store the value in there and the caller then retrieves them for that. I put a flash there as well because that is not thread safe. It doesn't work when we have more than one thread. What would we thread safe would be using thread local variables to return this value? Or if we would put these two static fields into our own thread instance. If our threads have a specific instance of that that could have fields, then I'll use for returning values thread locally. I've analyzed these different possibilities using the Java micro benchmark Harness, AMH. Actually, to my surprise, the allocation of a temporary object that is returned was even faster than the inlining version that I analyzed. But unfortunately in AMH, I couldn't analyze the last case of using my own thread type and fields in the thread structure. So I added my own ad hoc benchmarking framework to do the same measurements and I did a fairly different results but basically the same, but I also cover the last case. Now, I exclude those cases where I said that they are not generally applicable, so the inlining and the static fields. Of course, we can't use that in our implementation. Next, using thread local fields, thread local variables, which are relatively expensive, so kicking that out as well, we are left with allocating temporary objects and relying on the jit to optimize this because the jit does very well, but I don't know what the heuristics are behind there and whether we can actually rely on that. So for now, we're using thread local variables to return structured data on a call. So we're moving forward to seeing Project Valhalla coming into life because Project Valhalla will introduce value types and will use type descriptors for so-called Q types that provide value semantics for method arguments and method results, which is exactly what we need here. What I don't see from the current state of Valhalla is whether you actually that when you return a Q type, you actually don't have any allocations. So I would like to best have the guarantee to have value semantics and no allocation on a return. So next, type parameters. Generics would be the counterpart in Java. Here's a small fusion example how type parameters can be used. This is a function that calculates the arithmetic mean of free values that could be of an arbitrary numeric type t, and it just sums up those three values and divides them by the value of three, but it has to first create a value of three in that given numeric type. That could be something like a complex or a double or whatever is fed in there. And now we can call this with integers or with floating point values. Java's implementation of generics uses type erasure, so there's no type information at runtime, but a fusion uses a so-called monomorphization. That means we have specialized versions of the code for every type that is in use. What that means is that our back end, in this case, creates for every actual type that is used for a generic function, a specific implementation for that type that has all the types stripped to the actual type parameter. So that's quite straightforward. Yeah, inheritance, fusion has multiple inheritance. The question is how to implement that. The ways we've looked at is actually putting some type identifier into our instances and then doing some kind of table look up and invoke static to call this. We looked into how invoke dynamic could help us. Unfortunately, fusion is so static, it doesn't help us much at all. And finally, the invoke interface is actually the most useful solution for us because that supports multiple inheritance. So in effect, what our back end does is in most cases, our dynamic binding just binds to one possible target, so we can just use an invoke static. And only in few cases we actually see that there are possible dynamic targets and then we compile them to an invoke interface and we have specific interfaces actually defined for every single function that is called this dynamic binding. So we have a case where the classes that we generate could actually implement really, really many interfaces and we have to see how that scales with bigger applications. So coming towards the end, class file verifier, not much to say that, but the class file verifier helped a lot comparing the development of the JVM back end to the C-back end. Did we do that before? Because we saw so many errors much, much earlier than we would see in the C world. So the status of fusion at the moment is the language definition is getting more and more stable, base library, there's still a lot of work in progress. We have a JVM and a C-back end. Then we have basic static analysis tools available. And if you came to see the dogs, this is Fedex and Shadow who disturbed me working on that during the last year. This is where you find more information on that. Follow us on Twitter, give us our give us styles in GitHub. Stay informed. Thank you. Any questions? Hi. Hi, so you mentioned monomorphization and you had this example with this function that takes t which is numeric And then you generated I think 10 different versions for all the numeric types But then if you have like three type parameters, which are all numeric do you generate a thousand different versions? If there's three type parameters, there will be one version generated for each combination of these three type parameters Actually is used in the application. So it's used in the application. So it's like kind of a closed world Yeah, so so we have a static analyzer over the whole application Okay, so you don't have incremental or separate compilation It's static at compilation. So we've been very compiled. We look at the whole application We don't have any dynamic adding of more code. So we don't luckily don't have the problem of having to be open to add More there, we know exactly what will be in there And do you have a Java FFI with the JVM backend? Do we have a Java foreign function interface? Can you call into Java? At the moment not we are we are looking forward to using your Panama there as we've learned Because that would be a big step for us to also helping on the C interface even if the C backend We don't have any FFI force calling C code at this point Okay, thank you Do you have your mind made up on in terms of the Approach to concurrency you want to take I mean on the JVM virtual threads could be an option But at the same time if you have a C backend that could be really expensive to implement on your own We do have a very simple concurrency support right now, but it's basically limited to starting threads But there's no not much synchronization or anything going on Our current idea is that when when we do something concurrent That we want to use the static and Analyzer as much as possible to to set it we prove that there's no race conditions and that the code is safe The question is what channels do we want to provide? to actually allow interfed communication and all of that and We are still looking into possibilities there are many many things we could do it's not decided yet, so Thank you Maybe
OpenJDK Project Wakefield : The Wayland Desktop for JDK on Linux
Hello. Hi. Everyone, please take a seat. We're about to, well, we are starting. This session is OpenJDK Project Weight Field, the Whalen desktop for JDK on Linux. It's going to be a bit of a show here because we have three of us, so we're going to swap over. There's only two mics. We'll do our best. Very quick intro and then I'll turn it over. So I'm Phil Race. I work at Oracle. I'm the lead for the client libraries on OpenJDK. Next to me is Alexei Ushakov. He actually used to work in the Java 2D group at Sun a long time ago, but these days he works at JetBrains. And there's Nils DeGraf, who's a product manager or product owner, product, whatever, at Red Hat in the desktop group. And so he is going to do our first session. I'll hand it over to Nils. Okay. That's good. Maybe I should. You should. I should. This is going to be very interesting to do. Yes. I did not sign up for some dancing, but it's fine. The first, quickly about structure, we're first going to quickly explain what Whalen is and then explain how OpenJDK, then it tries to work on it and the whole weight field project. And then finally we have an actual demo and some explanation by Alexei. So quickly for Whalen, because we don't have that much time. So a bit closer. Okay. That should be better. First of all, what is Whalen? What is X11, for example, that it tries to replace? So it's about displaying. So rendering things into a, let's say, a matrix of pixels, something called, which sometimes called a frame buffer. And you usually try to get that on a stream, but maybe also try to stream that somewhere over the internet. And why do we need something fancy around that? It's because once you have multiple applications trying to render something, you want to have something like some kind of decisions around, okay, do we put them next to each other on top of each other or that kind of thing? So basically window management or tiling or whatever you want. And that's where a display server comes in. And you talk a display protocol between the apps and a display server. And it's usually also very related to input management, because, you know, if you have typing something, you want to go to your browser, for example, and you don't want to type your browser in some key logger application or something. So quickly about X11, which is, let's say, the old thing. So X11 started from the white. So we have the X11 server, which you start up normally using something like startX, which is going to listen to a socket, usually based on the display name, usually something like zero. And then each of your applications, I mean, imagine those two top applications being your browser or your file manager. And then they will actually say they will also connect to that socket. Sometimes if you have to do a display equals variable for the environment, and that's when which kind of socket it will try to listen to. So it will define the server. Now you're going to say something like, OK, now please X11 server, can you create me a new window? So X create window or something with this window, with this width and height. And then you can do your fancy things that you normally would expect to be able to do from a window manager. Now, that whole logic of should we be doing tiling or should we be doing overlapping windows and so on. That's usually what another X11 client, which is then usually the window manager comes in. And that actually helps in setting all the logic, let's say. So that's how the usual thing in X11 goes. Now X11, a very oversimplified. X11 is old. It's from the 80s. Now being old is not necessarily a problem. But it is older than, for example, Java and Linux. But one of the things is that it made a lot of assumptions that don't necessarily hold anymore and that are baked into the core protocol. For example, it talks about a root window, which we, once you have multiple monitors, you can still try to have like one big frame buffer that spans all of those monitors. But if you have multiple TPI, you can do trouble. Once you have GPUs get complexer and complexer, there's overlay planes and you want to do fancy stuff with that for performance reasons and battery reasons. There's security. X11 allows you to screen share anything and do input sniffing and snooping and spoofing without any kind of consent of notification. There's a dev room about Linux on mobile. I do not want a device that actually could do all of that with my private data. And there's also new UX paradigms like HDR, tiling window monitors and so on that actually make it be a bit harder, especially HDR. It's very hard to do, for example, in X11. So at some point people got together to create a new display protocol, which is that Wayland. So it's very minimal, but really, really minimal. It really tries to make sure it does not fall into the trap of making assumptions again. It goes to just say, okay, we have clients that want to send something rendered to a server, a compositor, let's say, and then we can then extend things over. So it doesn't even have a concept of top level windows, for example. You actually need an extension for that. It's called the XZG shell. If you ever want to look it up, it's very fancy. And some things you just don't want to have that in the display protocol anymore. For example, screen sharing is also related to videos. So we said, okay, let's try to put that out of the protocol, the core protocol and something with portals. But I will explain what portals are later. So what does a typical Wayland session look like? It's again, we start from the right. So we have the Wayland compositor, which you start. That's going to be, for example, with GNOME. It's going to be GNOME shell with KDE Billy Quinn. This way it will be something else. And then you will start a Wayland socket, which you can again listen to. You will talk to Wayland protocol. And a Wayland client will say, okay, please create me a surface. And then using that extension, for example, the protocol extension using XZG shell, you will then have something where you can say, I want to create a top level window and I want to do it this size. You can do positioning Wayland. Always fun. There's a lot of reasons for that. And for example, another Wayland client can be X Wayland, which is its own X11 server. So actually, inside your Wayland session, let's say, you can actually also then create multiple X11 clients, which will talk the whole X11 protocol and X Wayland to the translation to Wayland itself in the best way possible. And I did lie a little bit that this is not, or I did say that there's not everything yet. So there are some things that we don't want to put into a display protocol anymore. We want to do that in the portals. So we did something that that's new with Flatpak and snaps and all these containerization fancy methods. We want to make sure that there's some kind of permission thing, let's say that allows you, for example, I want to do screen sharing, let the user choose if that's okay or not. And then the deepest interface within a Flatpak, you can access that. And then, for example, go to Pipe Wire and other components which do not necessarily need to live in the compositor. You will go to those and then we can go to the next step, which is how portals can be maybe implemented, or, you know, how this can be used from within Wakefield. And I think that's going to be the part where Philip is going to come in. Okay. So Neil's described what Wayland is. And today, what we now have is a project to be able to support that Wayland compositing desktop from openJDK. And what's it all about really? Well, JDK is clearly a platform. It's not just an application and it abstracts away that underlying platform. So we're not going to be exposing anything about Wayland. Today, on Linux, it happens to be an X11 client. It's basically an X application at a, you know, crude level. But to be able to support that Wayland desktop, we need to make some changes in JDK. Some of the policies that Wayland has that Neil's touched on around security and things like that mean that the things that just are supposed to work in JDK won't work, even if we're trying to be that X going by that X Wayland product client that Neil showed on his diagram, right, which is what we call X compatibility mode. So that's the Wayland's compatibility mode for X11 applications. And although we don't even work in that today, even if we make it, even if we start to work in that, is that really the long term solution that we want, what we really would like is to maybe, you know, be a full on Wayland native client. So Open JDK project Wakefield, there's a URL there, has started a couple of years ago. And there are people from Red Hat, JetBrains, and Oracle who are working on this. We have a repo at the standard kind of in the standard format for Open JDK project repos. And what are our goals? Well, first off, we're going to try to support that compatibility mode properly. So we'll have a fully concomit, conformant JDK, and everything works as well as it can do, as well as it should do on the Wayland, when you're running on a Wayland desktop, but we'll be talking the X11 protocol. You know, we don't talk about it here, but, you know, most people will see that these days, if you log into a, you know, Linux desktop, they, and if you pay enough attention, there's an opportunity to switch between pure X.org and the Wayland desktop, which supports that compatibility mode. And right now, JDK only supports the X.org. So the longer term goal, as I just touched on, is that we want to be able to support native Wayland. So the X11 compatibility mode is some touch things that I'm going to touch on, but the much bigger thing is that native Wayland port. And there's a list of, there's a list here I won't read out of the different kinds of things that we need to deal with in making all of this work. And some of what we need to do for the native Wayland port is really just starting to emerge in the latest versions of GNOME. So it's not, this doesn't, this work is not intended for some, you know, older versions of Linux. This is something that you'll want to, or have to use on upcoming versions of Linux. And yeah, the policies of security that Wayland enforces, I think is the right word, are going to be some of the drivers for the things that we need to change. For example, the issues include, one of the most important things is we have an API that lets you capture screen pixels. And capturing the screen is, you know, Neil's touch on is something that Wayland very early on, I think, and is very clear about, doesn't like people to be able to do for privacy reasons. But AWT has expected to be able to do that forever. We expect to be able to do things for, like move the mouse, grab the pointer. We want to put our windows where we want to put our windows. Wayland will say, no, that's kind of our job. And you can't actually find out where windows are on the screen. Also in the X-Wayland compositing, in the X-Wayland mode, it's, there's the high DPI support is not complete. So in some of those things that I described above, Soundlight, they're things that you'd only need maybe for a test framework, but these are actually part of our standard APIs. And these, not many applications use them, but we have to be able to support it. And there's a bunch of bug level fixes that we've found we need to do as well. And we actually, you know, as the project went on, we actually found some bugs that were on the Wayland side as well. And there's a whole bunch of, all of these things that are described in detail at that URL, which I obviously, I'm not going to read out for you. So where are we now? JDK 21 pretty much did all of the work. We got the spec changes in that we needed. And there is a complete new implementation of the robot screen capture that was done almost entirely, well actually entirely by Alexander Zaginsev. It's using the screencast portal and pipe wire. And it, so the first time somebody tries to do a screen capture, there's a manual process of saying, yes, that's fine. And then after that, it's okay forever in theory. And there are some follow-up fixes going into JDK 22. We basically, if you have a desktop with GNOME 42, we should be good. And that will probably mean that we'll be able to, vendors will be able to officially support running on the Wayland desktop. In this compatibility mode with 22, which should ship in a month. And that's when we shift real focus to the pure Wayland toolkit. But there's, you know, what's involved a bit more about that. Complete new toolkit spans all of AWT in the window management and the rendering side. So all of these things here, creating windows, configuring windows, the management of everything, integration of the desktop, how you render to the surface. We can't use X11, open GL, sorry, XGL really. Or X render, which we, an X11 and X render is the default way we do everything on Linux today. Desktop integration, all of these different things I'm listening here need to be redone. When I was trying to describe it to somebody who's sort of more of a VM person, it's like, well, we need a new JIT and we need a new GC. And so that's the kind of scope of the work. So how would you do this? Well, I mean, a lot of GTK4 makes it fairly easy for a lot of applications to just, you know, port over because it deals with all of that, hides it from you. And then, you know, you would, Wayland, one of the things, it doesn't have a window manager, so it's client side decoration. It's all client side. And GTK would do that for you. And everything, I think, there, it sounds like it'd be easy to get a lot of things up and running, but it brings in a lot of baggage. If you do an LEDD or something on a running GTK4 process, you'll be paging through it for a while. And, but really, the, one of the problems was it's just really hard to get the level of management of when you render to the surface in the way that we need to with GTK4. There's more work with using the low level LibWayland, which is basically the equivalent of LibX11. But, you know, we've generally, when we've tried to do something with, in the JDK, like the most, the last example was we were doing a pipeline for Mac OS for their new rendering engine. And they have, like, a high level toolkit that's intended for applications, but we needed to use the lower level one. And it just sort of works out that way, seems, every time. But anyway, there's some new challenges that Wayland brings that, native Wayland brings that aren't necessarily there, that aren't there in the X compatibility mode. We need a new library Lib, EI, or A, however people pronounce it. That's just prototyping in GNOME 45. Well, it's the API, I believe, is final, but, you know, I think it's the first time it's been out there. That inability to layout windows that I touched on, it has some oddities like splash screens come up in the top left-hand corner of the screen. And, you know, that's not great from my perspective. So there is a, so there is a, already a toolkit in process and Alexi is actually going to be showing you that right now. And it's called WLToolkit. So, hand over to Alexi. Okay. Okay. Yeah. We use a separate thread for event handling in our prototype called WLToolkit. And actual handlers of Wayland events are dispatched on event dispatch thread. So, on Wayland rendering happens on client side. So, the client need to notify the compositor about the region that should be updated. So, we need to have a complete frame content and then submit it to the Wayland. That's required some changes in AWT and swing code to make sure that all the content is ready for submitting. Also, we use software rendering for our toolkit. Actually, software rendering was the only option in early ages of Java platform for swing applications. But since then, the situation has changed and now Java platform provides hardware acceleration for all major desktop platforms. Surprisingly, in current WLToolkit implementation, we have sometimes better performance than in X2 kit in X11 compatibility mode. So, for example, if you know a swing mark benchmark, it shows about 40% better performance than in X2 kit. So, it's quite enough for desktop applications. So, we can use it now. However, there are some areas where we still have lower performance. For example, in current implementation, we have about three times worse performance comparing with the hardware-executed X2 kit. So, yeah, of course, modern desktop applications need rich text capabilities, including our IDs. So, we're going to work on this and improve the performance of front rendering. Our current plan is to use Vulkan as a hardware acceleration with our WLToolkit. And let's see our demos. So, let's try to run them. It's quite unexpected resolution here. Here, you can see a special aw.toolkit.name property, a standard property that is set to WLToolkit to enable this toolkit for us. And that's run, as swing said. Oh, yeah. It looks like it fit the screen. Okay. You may notice that here we have unusual controls, actually, if you'll mention that. It's because Wayland, in this core part, doesn't provide client-side decorations. So, these controls were rendered by ours, so in WLToolkit. Okay. And let's see how it works, actually. So, here is frames. So, buttons. So, we have some animated curve here. Some combo boxes, dialogues. Okay. And some checkboxes working. Some more dialogues here. So, progress bar demo. Scrollable image here in scroll pane. And sliders. Yeah, here we have split pane. Tabot pane with some animated images here. Tables. Yeah, they work quite well. And tooltips. Yeah. And tree with expanded nodes. So, as you can see, all the controls are properly rendered here. And then I would like to show one more standard demo, at least demos bundled with platform for many years to show. Swing UI from work capabilities. It's Java 2D demo. It shows some advanced graphics. So, we can see here curve rendering. Actually, it's not full speed, so we can reduce the delay and see how it works actually quite well. Yeah. So, we have many different things here. Some clipping, color conversions, compositing rules, font rendering, image ops. So, it's some conversions for images. Some stroking capabilities. Here's a mix it. Yeah, demo with different primitives, paint modes. We have also here path rendering and transformations. So, as you can see, performance is quite acceptable. Now, we'll try because of the resolution to launch real-world application. Community version of IntelliJ IDEA. Yeah. Probably, it would be... Wow. Yes, it works. Yeah. And here we also see that we use the same property at little kid.name, that in special property file that we use for IDs. Yeah, here we can see actual well and implementation here. So, it's the constructor. It's quite difficult to see something. And here is the separate thread that I mentioned that handles well and events. Okay, that's it. Thanks for listening. Any questions if we have time? We have a minute. So, any questions? So, what is missing for... I'm repeating the question. You showed IntelliJ working, so what is missing? It was in details. We have Piki users and there are some stuff that in some corner cases that it's not working well, but it's generally workable. So, we have some users who gave us feedback about the quality. So, yeah. But we still need to polish it. Yeah, that... If it wasn't completely clear, right, you can try this for yourself. The project that I actually... Didn't you have a slide showing the branch? So, you just can check out that branch from the Wakefield repository and build it yourself and try it. Yep. Yes, over there. It does it work with JavaFX? No, this is at this point, this is implementing within JD K. JavaFX is like a separate toolkit entirely and we have to repeat this exercise for that. Unfortunately. Yes, over there. Sorry, feedback. Did we fix bugs in Whalen? No, we reported them to... We had friends at Red Hat. So, we've had some calls and some of the developers, a couple of the developers who kind of work on the desktop and even on Whalen directly there and will say, yeah, I file a bug here or yeah, I think it is. So, we've reported bugs and they've been fixed. Yes. Yes. Any plans to support fractional scaling? Any plans to support fractional scaling? My recollection is that Whalen itself fundamentally decided not to support fractional scaling. There's an extension? There's a... Of course, there's an extension. So, yeah, there's a protocol extension to do fractional scaling and if the WR2 gets wants to implement that, it can do that normally. But it should work. It should definitely already with the native Whalen mode, it should already be better with then the whole blurriness sometimes gets an excapable. Yeah. The one thing about that though is just generally, I mean, with fractional scaling, we don't have to deal with fractional scaling on Mac because that's always multiple. With the Windows look and feel, we are still sorting out bugs trying to make that work. So, we would undoubtedly find a whole bunch of new bugs with the GTK look and feel when we started doing that. So, it's not just a simple matter of saying, oh, yes, you know, there'll be a mess to be sorted out that's separate from the Whalen project. I think we're probably out of time. Yeah.
Zeroing and the semantic gap between host and guest
Hello. I want to start. Hi, everybody. So my name is Foyker Simonis. Hi, guys and girls. So my talk, my slides and my examples are on GitHub. I will show this link one more time at the end of the talk, so you can take a picture if you want. I'm currently having some fun at the Amazon Coroeta team working on OpenJK and I did the same along for quite some time in the past at the sub-JVM and submachine team. Today I want to tell you some details about running Java in containers and in different containers. One is Cue and the other is Firecracker. So what is Cue? Cue is checkpoint and restoring user space. That's functionality in Linux which allows you to serialize a whole process tree to the file system, basically to an image or a set of images and then it can be later restored from this image and run at the same state where it was checkpointed. It only saves anonymous pages of the process so it's quite efficient. It doesn't save the shared pages. We will see what impact that has. And correct, that's a coordinated restore and checkpoint. It was mentioned before in several talks. That's a project in the OpenJK which has basically two goals. One is to create a user land checkpoint and restore notification API which allows it applications to react and take actions based on a snapshot or restore event. So before the snapshot they can do certain things like zero out, secret memory or stuff like that and then to restore for example they can restore network connections which they tear down at snapshot, things like that. And gaining quite some traction in the community, the new versions of the popular frameworks like Spring, Micronaut, Quarkus or even AWS Lambda, they support this API so if you write applications or code for this frameworks you can already use this API. The second part of the goal of the correct project is to make the JDK itself SNAP safe. So the JDM as well as the JDK. This means that it uses this notification API for example in the JDK classes to take the actions I just talked about. And this is sometimes useful or even required to make, to react appropriately not only on checkpoint, on restore but also on clone because once you've checkpointed an application you cannot only restore it, you can basically restore it many times which I call cloning and then it's important for example if you have UUIDs or secrets to as I said to either wipe them out or reset them or recreate them. And Quark is using CRIU as a tool to do the actual checkpoint and restore process but as I said the API can be used without CRIU itself and we will see how that can be used with Firecracker for example. So let's dive into a quick demo. So I will use Pet Clinic as an example here. Oh this is the wrong window. So this is for CRIU. So I just start Java with some default settings which I pick up from Java options. It's basically Java 20 or 22 I think running with 512 max of memory, running the REST version of Spring Pet Clinic. And it takes about 10 seconds to initialize and then I use URL to access it just to make sure that it works. And yes you see it works, it really works. Now we use PMAP to look at RSS of the Java process. It's about almost 450 megabytes as you can see. And we can now use CRIU to dump this process. Oh I think it's hard to see. Yeah I will scroll it up. I just start to dump. So this was just the command line here to dump the Java process into a directory. And once we've done that we can take a look to see how much memory that used. And you see that the image itself is smaller than the footprint of the process itself. That's because what I said the image only contains the private and the dirty pages of the process, the shared pages from the mapped files for example. And we can now restore this process. So we use CRIU restore from the same directory and it works like in about 200 milliseconds. And if we use PMAP again you see that it uses, after we store it uses less memory, about 20 megabytes less than before. So why is that the case? Again that's because of shared pages. This is the diff of the whole PMAP output for the initial process before it was snapshotted and after we store. And you see the basic difference here is that for a lot of libraries like system libraries, LAP-NSS for example, we used 140 kilobytes for startup but this memory, these pages are not required anymore after we restore the process. So CRIU has still recorded that the process can access this memory but until it doesn't touch these pages they won't be paged in. So that's why after we store the process uses less memory which is a nice side effect. Okay so what other possibilities do we have? We start the application once again and it always takes about 10 seconds. So it works again. Now there is a feature called Trim-NATIFIP which was introduced by Thomas, my former colleague Thomas Stüffer, which basically frees the Maloch, the G-Lipsy Maloch buffers. And this can have quite a significant impact on the footprint of the process. So we see that the G-Lipsy Maloch cache used about 60 megabytes. And if we run now P-Map again we see that the RSS is much slower now, much slower now, just about 450 megabytes. And we can now... I also experimented with the new option which zeroes part of the heap. So it basically does a system GC and all the unused parts of the heap will be zeroed. If we do that and look at the memory footprint of the process we will see that the memory footprint got bigger because now parts of the heap which weren't mapped before get paged into the memory but they contain only zeros. And I have a pull request for the QIO project which such as QIO can recognize zero pages and ignore them while dumping. If we check point now with this zero option, it's basically the same like before. We just used the skip zero flag which is not standard until now but I hope it will be integrated soon. And if we take a look at image size we see that the image size now gets considerably smaller. So it's just 200 megabytes because all the pages which contain just zero bytes are replaced by reference to the kernel zero page. So basically it's a sparse file, the image file. And when we restart the process the memory footprint will be smaller as well. So we restart now from the new directory and when we take a look at the P-map output you see again it's just 270 megabytes. This is a little cumbersome so why not using the crack release itself. And the good thing is that crack basically does all what I've showed you what you basically can manually do with a normal JDK and with a normal QIO release. This is basically built into a crack version of crack build of the open JDK. We use the option crack check point two and give it a directory. So we run the application and then once it initializes we see it works and then instead of using the QIO command directly we can use a J command to check point. So I scroll it up here so we just scroll J command with the PID of the pet cleaning application and we execute the JCMD JDK checkpoint and that killed the checkpoint and also killed the process. And we can now restart that again with the help of Java by using the second crack option which is crack restore from and then give it the directory where the file was saved to. And this takes just a few milliseconds again and we see it works and again the memory footprint is like before it's like 280 after the first restore so it's considerably smaller because the heap was shrink. So what crack is actually doing it's not zeroing the memory but it's unmapping all the unused parts of the heap and I also recently added the feature to call into Tomas Trim native memory functionality to also free the JLPC buffers. So to summarize like in for a spring pet clinic application it has about memory footprint of a good 500 megabytes and after restore it's a little smaller because it doesn't have to restore all the shared pages. Image size is about 500 megabytes. If we zero out all the heap unused heap and use the skip zero flag of crew the RSS goes up just before the checkpoint but instead we get a much smaller image size and also a smaller footprint when we restore. And that's the same with crack because it basically it doesn't zero but it unmapped the memory and it has the same effect so it would wonder why do we need the zeroing at all then and not just use crack so I hope that will get clear in my next example. So for the next example I will use firecracker which is a KVM based virtualization. It's basically QMU on steroids it's a stripped down version of a virtualizer it has only a restricted set of network block device network device. It's rest base configuration it's written in Rust and it's open source under Apache 2 license and if you ever used AWS Lambda or Fargate for example that's the technology which drives this offering so every Lambda function is running in its own KVM based firecracker container. This is a diagram of how it works but I think we don't have time to go into the details today. It said I want to show you how this works practically. So I have another prepared another window here. So I use a script which which basically starts firecracker and inside firecracker it then starts again the pet cleaning application. If you take a close look this basically boots its own kernel which is 6.0 this here Linux 6.1.7 so it boots its own kernel in a virtual machine and then inside the kernel it starts firecracker and now if you see a rail this virtual machine has its own network address so we cannot use localised anymore. So we have to use the IP address of the virtual machine running on our host system but apart from them it works exactly the same and we have to look for two footprints now. We want to know we have to look at the footprint of the firecracker VM itself which is about 670 megabytes. Slightly bigger than that of the whole process and we can also look at the size of the JVM inside the guest and we see the JVM size inside the firecracker guest is about the same like when you run it locally which is basically clear. And we can now snapshot the whole firecracker container. Again that takes just a few seconds to this directory and if we want to see how big it is it's like 670 megabytes about the size of the whole firecracker container had in memory when it ran. So just to demonstrate how it works now we can restore from the snapshot. This again basically spins up the whole virtual machine in about 200 milliseconds and we can check how much memory it takes and you see it takes very few memory because it takes only 570 megabytes. Because it only pages in the pages which are really required to start the virtual machine. Crew paged in all the pages from its page file into the newly created process. Whereas firecracker does this lazily that's why initially it needs so few memory. The funny thing is that if you look inside the container by SSH and to the container and to a P map the Java process within the firecracker container still basically needs 500 megabytes but the VM itself only has paged in like 50 megabytes of memory. And what we can do now is, yeah we wanted to see, we already saw how big the, yeah we are sorry, we just do a request. So you see it's still working after we store the network devices are restored interfaces and it works and if we look at the image size of firecracker after the restore you see it gets bigger like 270 megabytes which corresponds mostly to what the crew process used. So that's actually the crew restore Java process so that's about this 270 megabytes seems to be required in order to process this request in pet clinic. So now how can we get this smaller the image size of the firecracker container because 690 megabytes is quite big. So again we run firecracker and you can see that I started a starter firecracker process with the checkpoint option so I with the crack checkpoint option so I can actually use the J command version now to check point. So we have the Java process in the KVM guest. Again we use SSH to SSH into the firecracker container and then inside the container we execute J command to checkpoint. This is a special version of checkpoint where we doesn't make sense to use crew within the firecracker container because anyway we will snapshot the whole firecracker container so instead we just use the special version of crack which only executes all the callbacks and thus all the optimizations but doesn't really call crew. So that's where we have to restart it inside and when we look at the memory. Inside the the Docker container we see that it's about 290 so it was it went down the SS but unfortunately the container it's like firecracker process itself still uses that much memory and if we. If we snapshot it. That that works. But. Let's take a look at the size. It's still 600 megabytes so that's why I called why I choose the title like that that's what I call the semantic. Gap between the guest and the host like even if I free memory in the guest container the host colonel does not cannot know that these pages are not used anymore by the guest system and they are still dirty from his perspective so if I if I. Snapshot. The container the whole VM it's it has to save them to this which makes it inefficient. So there are different possibilities to cope with this. One is to use. The the trim native image and the the zero heap options are showed you before because then firecracker has the chance to wipe out these pages from the image which make the image. Sight smaller. So I have I've summarized this here in this table so initially the firecracker process needs about six or almost 700 megabytes of RSS the JVM inside like before 500. Snapshot is about 600 megabytes after we store 50 megabytes and after the first request again 266. If you run this crack and do the checkpoint we can minimize the memory size within the VM to about 290 but the but the image size that the snapshot size itself stays at at at 600. If we do the trim native and zero unused heap the the memory consumption of the of the of the of the virtual machine goes up because again we we touch all the pages in order to zero them. But we get a much smaller image size because now the virtual machine manager again replaces these pages by the kernel zero page so we get a much smaller image and faster start up time. There's another possibility and that's called in it of in it on three that's a kernel option so the kernel usually when you give when you am a page and give it back to the kernel. The kernel doesn't do anything with this memory the kernel zeros the memory when you allocate it when you am up on your page the kernel will give you a zero page a page only containing zeros. But there is an option called in it on three which does this in the other other way around so it's and it's. Just for example in security critical application where people want to make sure that once they release memory this memory is immediately zeroed out. The thing with this is that the initial memory size of the container goes up because when the kernel boots up it was zeros the whole memory so it touches all pages so the footprint of the of the. Firecracker process is like one gigabyte which is what I gave him for the for the guest on the other side. When we snapshot this. We we get down to a four hundred twenty megabytes which is already quite nice. Last feature which I wanted to mention briefly is ballooning that's a special device inside the guest which can allocate memory can sing of it like a file cache. And then it has a means to communicate is back to the KVM manager and the host and tell the host that the host can now reuse this part of the memory so with this. If we inflate the balloon we can decrease the the footprint of the whole virtual machine but unfortunately the snapshot again gets bigger because from the. Host site this page is still look tainted so we have to combine ballooning within it on three then we get like all the benefits small very small footprint of the running KVM process and the smallest image size so with that I came to the end of my talk there are some references here. And I linked to the to the examples I showed you and this is where you find the presentation so thanks a lot. Thank you.
Java… to unlock GPU acceleration for Polyglot Language Runtimes
Okay, can you hear me? Excellent. Thank you. So it's a pleasure to be here. I'm on goal this amazing speakers today. So I'm Thanos. I'm a search fellow at the University of Manchester. I'm part of the Tornado VM team. And today I will talk about polyglot language implementations, which enable programming languages like Ruby, Python, and to run on top of the JVM, along with Java, of course. And I will try to make a step forward and show you how they can harness GPU acceleration from the JVM. I'll start a little bit with the polyglot programming, which has been here for many years, but in a sense it has been reignited by the advent of the Truffle framework from Graal VM. And in a sense it enables multiple programming languages to run on top of the JVM and interoperate. So that means that one Java class file can invoke a Python function and the Python program can invoke a Java method. Well, this is very interesting. It comes with many advantages. But what about GPU programming? Well, GPUs from Java. Well, this is not a thing yet. That's why we have been motivated at the University of Manchester and we have done all this research in the past eight years and we have created Tornado VM. Here is a link to the resources of Tornado VM with all the presentations that explain the programming model. Because my goal today is not to go very deep, to dive into the Tornado VM very deep, but to present the interoperability with the other programming languages and how they can use GPU acceleration from the JVM. So Tornado VM is an open source plug-in to existing JDK distributions. It is compatible with JDK 21, as you will see later. And it has some very cool features. So it has a platform agnostic API. So developers, they don't need to know GPU programming, FPGA programming. It comes with an optimizing compiler. So we extend GRAL with new phases that they can take Java methods and compile them to GPU code. We have a feature of dynamic reconfiguration at runtime, which means that the method execution can be migrated from a GPU back to the JVM and then go to the FPGA if it is appropriate. And with the latest release 1.0, we have enabled support for off-heap data types. So data can be allocated off-heap with a foreign function and memory API. And this is the API that Mauricio described earlier today. So feel free to follow Tornado VM in Twitter to engage with the website and of course to fork and try our examples which are open sourcing GitHub. So I spoke a little bit about off-heap data types, so I'll give an introduction, an example, because I'm not going to dive very into the API. So here we see two snapshots of code. On the left side, we see a main method that contains the allocation of float array by using primitive types, but is allocated as an object, in a sense, on-heap. So to migrate from such an allocation to the new allocation API that's exposed by the Tornado API, we have created the float array object that inside it can allocate memory by using the memory segment of the foreign function API. And it will allocate this memory off-heap. So this memory segment could be used directly from the GPU without the need to worry about GC collections and this stuff. And the cool part is that even if you don't use GPU programming, even if you don't want to execute on GPUs, you can still use this API to allocate memory off-heap. And here is a link that explains more. I hope it's visual from your side. If not, you will find my slides online in the Fosdome webpage. So the motivation for today is that Graal VM enables interoperability between programming languages like Ruby, JavaScript, and other programming languages. And Tornado VM enables hardware acceleration for Java. So what if we can combine them together and harness GPU acceleration from all these programming languages that are running on top of Trafl? Let's have a dive into the tech flow. So in this slide, I present a software stack from Graal VM for Trafl. So on the top, we see the Trafl framework and many implementations of polyglot runtimes like Graalpy, Graal.js, Trafl Ruby. And others because Trafl enables also programming language implementers to create their own programming languages by using the Java API. So I have grouped Python, Ruby, JavaScript, and Node.js in this side of the slide. And then beneath them, there is the Graal VM Zit compiler, so an optimizing compiler from Graal. So Java is also running on top of the JVM, of course. And all these languages, they start in the interpreted mode, and once they reach a hot state, then the optimizing compiler kicks in. And the cool part with such a polyglot implementation that enables polyglot programming is that there is, for the compiler enthusiasts, there is one Graal IR. So the nodes, at runtime, they are rewritten. That means that it can adjust. So if we kick in a Python function, then the node can be rewritten, and the Graal compiler will take a shape and will emit at the assembly code that will run on the CPU. So this solution offers the interoperability and offers the execution among different CPU instruction set architectures. But what if we have this heterogeneous hardware, like GPUs, FPGAs, which are available in some systems and servers? Well, then we'll have Tornado VM that enables Java methods to be compiled for GPUs, FPGAs, etc. Tornado VM has its own JIT compiler, which is an extension, a superset, I would say, of Graal, the Graal compiler, that it is enhanced with new phases in the compiler to automatically specialize the code from a method for GPU acceleration and FPG acceleration. So at the backbone of the compiler, we have three backends at the moment. We have OpenCL backend, CUDA, and SPV. And such a solution would enable many things. So if you want to learn more about the APIs, you can scan this QR code. And the code that is implemented with Tornado VM, it can harness besides the off-hip data types, it can also harness the execution with a Tornado VM profiler. If you want to learn more about the characteristics of your application, you can see how many data will be copying in the GPU memory, how expensive is the IEO maybe, because this could be very critical for the performance of the system. And you can customize even how many, how the data transfers will be performed. Because, for example, if you have a method that consumes redoneally data, then maybe you need to copy the data once, instead of copying the data every time you execute the kernel. Okay, so let's jump to the deployment. As I said, Tornado VM is compatible with different JDK distributions, so it's not a JVM, it is a plugin for JDK distributions. So it can be seen as a library, in a sense, because it offers an API in Java. And it is compatible with all these distributions. And on the other side, we have the compiler backends that makes it compatible with different heterogeneous hardware accelerators. We can emit vectorized code for multi-core CPU execution through OpenCL. We can run with different GPUs and FPGAs. In this particular talk, I will focus on GraVM, because we want to leverage polyglot, and NVIDIA GPUs, because I have created Docker images that they run on the NVIDIA GPUs. Now, regarding the GraVM deployment, I will focus in this slide in GraL Python, which is one implementation of polyglot runtime. This is shipped in two different standalone versions, releases. So we have the native standalone, which comes with the native image. And then we have the JVM standalone that enables the execution of Python programs on top of the JVM, and it has also the JVM compiler. The version that we tested is the 23.1, because tornado VM is compatible with this version of GraL. And here you can see that we have downloaded the community, and that's JVM. So we have the JVM standalone version downloaded. Well, we need the JVM standalone, because we want to run with tornado VM, and tornado VM will extend the GraL VM compiler. So this is the reason. The problem is that we tried it, and the JVM distribution is shipped with the JVM standalone, with a compiler built that it is built with libgral. So this comes with not many compiler modules, and that breaks the consistency for tornado VM. When we tried it. And this is because they wanted the image, the footprint to be lower, which makes sense, but it broke the compatibility with tornado VM. The good part on this story is that GraL is very active. The GraL community is very active in Slack workspace, so we managed to figure out what was the problem. On the bad side is that the solution was to build a GraL Pi and GraL VM from source, which was quite painful. And in order to avoid this pain for anyone who wants to try this work, we decided to build a Docker image that has inside GraL Pi, tornado VM, and we have also added the NVIDIA driver. So if you have a Linux machine or any machine that has an NVIDIA GPU, and you have also the NVIDIA container toolkit in this machine, then you will be able to run this image. The Docker file, the image is open source in GitHub. And on the other side, you can see the QR code that has the acceleration library. So the code that we have implemented in the examples module of tornado VM for the computation part that we will upload on the GPU, like K-means, matrix multiplication, and those are the examples. But there are also other compute examples that we have in the GitHub. And you can also pull the Docker image from Docker Hub. So we will jump into the examples. So as you see here, we have the Python and Java with tornado VM. So we have the Python program that imports Java, and then it loads the class from the compute examples class of the tornado VM repository. And then we have in this Java class that we have loaded, we have two methods that can be accessed by the Python program. The first one is the set inputs that set the actual data points and the number of packets that will be used for K-means. And the second one is the run with GPU. So this will invoke the actual GPU compilation for GPUs and the GPU execution. And on the other side, we have the Java tornado VM, where we use Java and the tornado VM API to create these parallel implementations of K-means. In this slide, you see, well, the steps, how to clone the repository that contains this Python program. And we see also the Python program, the K-means.py. So we see here beneath that we have the invocation of the actual method functions, Java methods, sorry. And here is the link for the Java implementation of K-means. And now if we jump into the Java part, which contains the computation that will be offloaded on the GPU. No, before we jump to the computation, we have the set inputs and I wanted to make a connection to reflect on the off-heap data types. So with these two, with a new vector float, this is an API type that is exposed by tornado VM and can allocate data vector types off-heap. And then we'll have the create matrix of clusters that does perform some initialization of the objects and also allocate some other data, like the clusters, which are going to be allocated off-heap as well. And now we are ready to move into the actual computation part. So on the left side, you see the run with Java implementation of this method. And on the right side, you see the accelerated one with the tornado VM API. So as we see here, the actual computation has been in this method, has been performed by this method. So they assign clusters. And the corresponding one on the right side, that is the tornado VM implementation, is this one. So in this one, I would like to focus on two parts. So you can see the task graph implementation. Task graph is an object exposed by the tornado VM API. In a sense, task graph enables you to define what code will go to the GPU. So what's going to be the actual computation and what data should be used on the GPU. So the input data and the output data. So in a sense, the task graph enables programmers to define what is going to go to the GPU for execution. And the second API, once we have done this, as you can see here, we can define also the data transfer mode, how often we want data input, input data or output data to be copied back and forth from the GPU. And once we have defined that, we can move to the second part, which is the execution plan. So the execution plan is another object that enables programmers to define how the execution will take place. So it could be, for example, with the profiler enabled, without the profiler enabled, with a custom grid size, which is defined by the programmer. And once we have defined how the execution will be done, will be performed, we are able to execute the actual task graph. So with execution.execute, it is this part that enables the actual execution of the code and the GIT compilation. So the second time that we will execute the assigned clusters, well, this is going to be the second time that we invoke the actual execute of the execution plan. And the second time that we will invoke the execution plan, the execution of the execution plan, this is going to be the time that the code will not be GIT because it is already GIT. So the code, the OpenCL code or the CUDA code will be all retrieved from the code cache of Tornado VM. So now we can move to the actual example to run. I have recorded a video that enables the execution of K-Means and MathExfoom.liblication because on my MacBook, I don't have an NVIDIA GPU. So we will fork the actual repository with examples. And now that we have forked, we will go inside, we check out the FOSDEM branch. And this is the Python code that we saw earlier. So it has these three. First, we load the class, and then we are able to invoke the Java code from Python. And here we will run, first, the Java implementation and then the GPU accelerated implementation. We can also pull the Docker image that we have created. And here in the repository, we have a launcher script that enables to run. So at first, we will try the Tornado devices to query how many NVIDIA GPUs exist in the system. And here it is the 2000 GPU that exists in my machine at home. And once we have done this, we will run with Truffle, the Python program. So Tornado Truffle, the Truffle flag and Python, will be able to run the actual Python program. And we will see here that at first, it will bring Hello World from Python. And then we run the Java implementation, which is a sequential, that I'm with Java. And then they run with GPU method. And as we see here, they take the first one, one second, and the second one, 140 milliseconds. So here we will try the same example, but with the thread info, which will enable the printing of the actual threads that have been used on the GPU. So as we see here, we have the number of data points that we passed with the set input. It has been the number of the global thread size that is uploaded on the GPU. And now we move to the second example, which is the matrix multiplication with Tornado VM. So in this example, we run five times the matrix multiplication. So we see here the execution time of matrix multiplication on the GPU. So the first time it was half second, and then it has moved to three milliseconds. This is because the first execution, it involves also the GIT compilation, which is expensive. Then the second time, third time, the execution time has been saturated because it is the actual launching of the code. Okay, I have showed you example of Python with Gralpy, but this is not the only one. We have also the key images for the other programming languages for JavaScript, Ruby, and you can find more details in those links where we have a blog post. And we explain also the polyglot programming from Tornado VM. So now we will try to find the other examples so now I will jump to the summary of my talk. So as key takeaways, I would like to emphasize that GralVM and Traffl enable Java interoperability with other programming languages that run on top of the JVM. Tornado VM afflows Java methods on GPUs, FPGAs, and multicore CPUs, so you can create parallel implementations. And that Tornado VM offers a Java API, so programmers, they don't need to know GPU programming. It is a Java API, a Java way to express parallelism. And we have also new off-hip data types. So finally, yes, it is possible to create high-performing implementations of code for data science libraries in Java, and reuse them by other programming languages. This is a slide that summarizes everyone who has contributed as a research staff for students at the University of Manchester, and these images are from our campus. And this is a surprise that it was taken and it was not raining. So I would like to invite you to join our community, follow us in GitHub, join us in the Tornado VM Slack space if you have questions, or if you want to interact with a team for discussions, and also to try our examples in GitHub. And in my last slide, I would like to acknowledge all these research funds that have supported their work at Tornado VM, like Elegant and Crip, Tango, Iro and InCode. So with that, I conclude my talk, and I think we have time for one or two questions. Okay, I've got the mic here, but first, I lived in Manchester for five years, and it doesn't always rain. Just mostly. Just mostly. Thanks for a great talk. Like one of the first pictures you had showed Tornado VM in parallel to the GrowlJIT using the JVMCI. So do you interact directly with JVMCI for generating code? Correct, yes. So the JVMCI enables other JIT compilers to be hooked in the JVM, and that's how we run, because we extend. So do you work with the standard JVMCI in upstream or open JDK, or you need the lab JDK with the latest JVMCI changes? Because the GrowlJIT compiler, as far as I know, requires the lab JDK with latest changes. We work with the standard JVMCI, yes. Thank you. Thank you. So when you write the kernel code in Java, then is it usually high-level code that you write, or do you try to write optimized code in Java? Like usually when you write, let's say, Qtacode, then you try to write a very specialized, use warp intrinsics and that kind of stuff. Is that something that is like in scope for turn out of VM, or not so much? No, that's a great question. Well, to answer this question, we do both. So we have two APIs. One is created for Java programmers. We will have, let's say, a computation that has four loops. So this is something that you can paralyze if you don't have data dependency. So we expose an annotation in this case, similar to OpenMP. So you can do add parallel in the four loop in order to give a hint to the compiler that this can run in parallel and will create parallel implementations in OpenCL or CUDA. And the second part is that if you are familiar with OpenCL and CUDA and you want to have access to low-level intrinsics, like, for example, use barriers or local memory, allocate local memory, then we'll have a second API, which is called kernel API. And with that, you can pretty much access every interesting that exists in OpenCL and CUDA programming from Java. So personally, I have used the second API to port existing OpenCL kernels in Java with Tonedo.
How to bring up GCC for your new chip
Okay ladies, if you'd like to get yourself settled down, because of the way we're running this room back to back, my talk has already started and I haven't got much to cover but we'll do what we can. So, yes, so that's everything that makes up the GNU toolchain. I'm going to go through some of these slides very fast because it's reference material so you can go back and look at the video afterwards if you want to check something. This is only going to look at GCC so I'm not going to worry about the assembler or any of the other stuff, I'm just going to look at the compiler and how you add something to a new chip. So how you get the back end up and running, where you can get more information, what the key things you need to do are and what I hope is at the end is that at the end you won't be able to write a new compiler but you'll know where to get started. So first of all, source of information, there's loads of theory behind compilers, there's an excellent beginner's textbook there, you can still buy it second hand, I believe someone bought one for a penny on Amazon, second hand, so and I've been recommending, I haven't used the second one but it has, I strongly recommend it by someone else there and this is the Bible. If you've got a lot of money you can buy the one on the left, if you haven't got so much the one on the right is still rightly available. But this is what we're going to worry about today, the GCC internals manual, everything you need to know is there, some of it's out of date but it's generally a pretty good document and it's online so you can just go and get that. So, we've got a new chip, our new chip, this is an entirely fictional architecture, it's taken from my textbook I showed earlier and it's a simple byte stream architecture used for just as a target you can compile to for demonstrating how to write a compiler. So, we've got arithmetic, we've got logic, we've got shifts, we've got the ability to store and load and we've got some branching and a branch and link so we can do sub-routine calls. And there's all the details of it but we'll come back to it. So, getting started, first of all you need GCC so you can clone it and there's a mirror on GitHub as well. You've seen this from Dave, here's the structure and the bit we're going to be concerned about is within the GCC primarily the config because that's where you put the configuration for the new back-end architecture. So, we're going to, there's one, there's one for RISC-5, there's dozens of them there, we're going to add one for VAM. So, if you were to look in RISC-5 you'd find these four key files, there's loads more in the RISC-5 directly but you have a .h file which is where you define a lot of parameters that says what my back-end looks like, you know, how big's a char, how big's an int and so forth. You have RISC-C which is where you put C code and it's really helper code to get you off the ground. You don't need, you need hardly anything in .c to get started. The big one and where we'll spend quite a lot of time is the machine description. It's a, it's the thing that describes what your architecture looks like and GCC will then pick that up and use that to be able to compile to your target. Okay, and it's written in a, nominally in a dialect of LISP called scheme. Okay, and lastly there's a file called .opt and you don't actually even have to have .opt but it's where you've got target specific options and our architecture, we're going to give it an option that says you can have soft multiplication where you do multiplication in software or you can have hard multiplication where you actually generate multiplication instructions. So first of all, we need to see how do we configure GCC for my new target. Well first of all, we actually need to go into the whole auto-conf system and actually add it in there. So at the top level in the repository you'll find a file called config.sub. Now that is actually pulled in from a separate project. Okay, so if you're doing this properly you would go to the project listed there and you'd make your change there. But I'm just going to hack it today and I'm just going to add a line in the, if you look in there you'll say case dollar CPU where all the CPUs are there and I'm just going to add VAM, our architecture. So now the auto-make system will understand about VAM and then inside the GCC sub-directory, so the GCC proper sub-directory, there's config.GCC and that's where you put all the GCC specific configurations. Okay, now our full name of our architecture is probably our compiler will be VAM-unknown-elf-GCC because we'll put the full triple in front. So VAM whatever you like, ELF will match that. So if you go and say I want to configure for that target, what do I define? And there's a whole load of variables you can set to tell your target what goes in there. The thing is you don't really need to put anything because it'll know there'll be a, if my target's VAM you must have a VAM.C, a VAM.H, a VAM.CC, a VAM.H and a VAM.MD and maybe a VAM.Opt. I'm going to say actually I want one other because this is bare metal. I'm going to take the standard ELFOS file for bare metal operating system file and add it to that and that's the target machine list of files that make up that architecture. So that is all I need to do to make GCC to know that. And now I can say go and configure GCC and this, you'll see it's a bit like Dave did, but this time my target is going to be VAM-unknown-elf and it will configure for that. I'm going to do, I'm going to put it in pre, I'm going to, when we've finished it, it'll get loaded in, it'll get installed in OptVAM. We'll do it without headers just to keep it simple. We'll just do the C language and as Dave said earlier disable the bootstrap, just the stage one which is on a plain C compiler and there's loads more options there and we'll come back to that later. And then I can just say make all hyphen GCC and lots and lots happens and then it will complain and say ah, but I can't find VAM.MD, the machine description, okay? Because I didn't actually create a machine description. I just told it that was here's my machine. So we're going to have to do something about that. So let's start adding those files in. Let's start with the header file and so let's create our configuration directory. So we're going to the source directory, we're coming out of our build directory, going to the source directory, create a sub directory within GCC config for VAM for our architecture and I'm just going to create empty files, VAM.CC, VAM.H, VAM.MD and VAM.Opt. Come back into our build directory and make all GCC again, lots more happens and then I get an error message. It says ah, in somewhere deep inside the GCC world I haven't found a definition of first pseudo register and maybe you meant first virtual register. And that's actually one of the variables that I have to define in .h. So in .h there's a whole load of macros I've got to define that I will need for that. Okay, so here's an example, so in VAM.h we've got some things. You define target CPU but those are the built-ins I want to appear. You know that when you compile for a particular architecture in GCC there are some predefined macros there including one that tells you what your architecture is. So we want underscore VAM in capitals and lower case actually defined so if you're writing code you can put hash if def VAM, if def underscore VAM and put your VAM specific code there. And there's a couple of asserts there, I'll assert the CPU's VAM and the machine is VAM. Okay, where does it go, what goes in the header file? There's a whole section on this on the internals manual. You'll be here till 2057 if you try and put all of those in. Easy approach do what we all do is copy an existing architecture and hack it around for you. Open risk is a really good one. It's quite small and Stafford Horn knows what he's doing so it's a good starting point and it's what I used. Okay, and associated implementation codes in VAM.cc and it's things like data storage, data types, register model, the ABI implementation, all the constants that will define that. Okay, so here we are, here's my storage layout, you know all the number of bits that go in everything, what boundaries I'm aligning on, the sizes of all my data types, what the ABI looks like, so I've got a comment to say what it does and then I define the first pseudo register, so I've got a total of 33 real registers and then anything else would be a pseudo register and I'm not going to go into pseudo registers, because I've got my 32 real registers, general registers and I've got my status register. I don't have the program counter as a register because it's not actually exposed in my architecture, I have nowhere treating it as a real register, it's just something behind the scenes. And I've got names for all my registers and some of those have fixed purposes, so r0 is always tied to 0, r1 is the stack pointer, so I've got an array telling me which of those have got predefined uses and the last one is the status register, that's got a predefined use. And then what are good ways to allocate this, so when GCC needs to use a register, what's a good one to choose, so I don't actually end up choosing one, I have to then worry about restoring and everything. And so I can give that in a priority order of what order do I want you to allocate registers in. We talk about register classes, now this is very simple because we haven't got many registers, normally you would separate out your integer registers from your floating point registers and then you can tell GCC to do different things depending whether you're doing floating point or integer. In our case it's only an integer machine anyway, so we've just got general regs and we've got one class for the status register which is the flag regs. You always have a no regs class which is no registers and all regs class which is all registers and you define the last thing in that enum because it tells you the size of the enum is limit regs classes. And then from that we can define a macro called n regs classes and we can define the names of these which are just the text strings. And lastly then we say for each of those classes we're going to have to give you 33 bits to tell you for each of those classes which bits are there. So for the no regs none of them are set, for the general registers all the bits are except the 33rd bit and it's the bottom low bits on the left and the top low bits on the right and then the status register is register 33 so it just has one bit set in the other bit there and then all regs has all the bits set. Okay and you've got a macro to tell you which regs, you've given a register number which register class are you in and there's loads more in there and you can read through it there and see what happens. So we say make all GCC and even more happens and then it complains that it can't see SP regnum. Now you think ah didn't I define a stat pointer, I did but I decided something else because the point is this is not SP regnum as known by a header, this is SP regnum from the machine description. Okay so some of these things are actually not defined in the header, they're defined in the machine description. So if we look how code generation works in GCC it's generic okay it's a pattern matching compiler, it looks for patterns and replaces them by new patterns. Okay so it's how it does code generation, it's actually also how it does optimization and what we have to do is give it all these pattern templates in order to be generated and that is what the machine description is and actually when we come to optimization replacing patterns by better patterns is what you do. So we heard from Dave the different types you've got generic then Gimple then RTL and we're really worrying about how you get down to the RTL level. Okay side note here GCC has its own name for type systems so they're everything from quarter inch to eight bits up to double inch and tetra inch and double float with and so they're known as QI or HI and so forth and you can have unsigned variance of those just when those will come all the way through so when you see those they just sizes of things. So how do you get Gimple to go down to RTL okay which you can then code generate from okay you we probably had a set of standard patterns okay and all you're going to do in the machine description is tell him given add QI 3 that's add quarter inch to three arguments two source arguments and destination and they're mostly three address code like that so add two quarter inch and so forth. There's a whole set of these to define you define all those okay and then GCC has all the patterns and it will generate code for your machine okay so quite a lot of these have to be defined but some of them don't need it you know you don't need atomic and vector patterns if you're not going to atomic if you haven't got atomic ops or if you're not a vector machine okay so I say when we build the compiler it's parsed and all that scheme description of these patterns will be turned into C which is then built it then compiled and put in your GCC compiler and there's a whole huge chapter on this in the internals manual machine descriptions but we will do the same thing is we will copy an existing machine description and hack it so I've this we will take OR1K again okay so let's have a look I want to just describe machine description I'm taking them these from risk5.md just because I want to show a lot of ideas here quickly and they're richer in in the risk5 one than in my simple one okay so at the heart of it is define instant define instant which is the semantics of a pattern this architecture supports the name can be anything but obviously we're worried about the predefined ones and add SI3 is one of them okay and that's how GCC can learn RTL using that name okay so the first thing you see is match op-rand that's telling you how to match the first op-rand and then the second op-rand and you see there's match op-rand size of it single integer number of that op-rand so we've got 0 1 and 2 and then a bit about what it is okay so register op-rand says I can be any register it's it's an allow or deny gating function okay and you can write your own predicates as well but the whole load of standard ones okay and then we have constraints on that now the constraint is not much here equals r comma r and that's saying I'm giving you two scenarios and they actually have both happened to be r in this case but we'll explain why that is so it can and the equals means I'm writing to it so I'm either writing to a register or I'm writing to a register okay now the reason that matters is these pairs go together so the second op-rand is a register a register the third opera the first opera and then op-rand 2 is register or I for immediate and you have to read those if you as though we're there in columns so we're looking at one scenario where first op-rand is a writable register and the other two operands are registers and we're looking at the second scenario where the destination is a register the first operands register but the second operand is an it's just so if you think of them in columns that's how to think of them okay and yeah so the next line which is just empty here that's often for a global predicate okay and that could be where you put one of your flags so you may have to find a predicate like is this soft multiplication in which case I can't generate a multiply okay and just empty means true just always do this and then the code generation template it's just a C fragment and in this case so you say if it's a 32 if it's a 64 bit architecture then generate the string add word blah blah blah if it's a 32 bit architecture then you it's just a generic add instruction okay and the percent elements there percent nor percent 1% to refer to operand nor operand 1 operand 2 okay and at the end you can add some attributes we're not going to worry about attributes in BAM attributes are useful because they're where tagging the instance and sometimes you can have code generation options and opera optimizations that can take advantage of them okay so let's look at what we did for BAM first of all you define some constants that's where sp reg none the numbers of the key registers is defined okay and then we've got a very simple instance it's called no op and it doesn't have anything to match really it's just constant zero and it generates the text string it generates for code generation is just not here's a more comfortable and add si 3 you've seen that bit before we've only got one sort of ad okay the first operand is destination register the second operand is a register and because VAM is a two-address machine okay so add a B means add a to B and put the result in B we actually have to say the destination you see I've constrained it to be zero that means it's got to be the same as opera and zero which is the destination okay and I've got the same for sub i and the template to generate the code okay so the standard names the standard MD patterns machine descriptions and output statements how you do the assembly language templates and you've got some useful files in there and I say the open-risk one is a good example that's pretty simple okay so what about the option file VAM.opt there's a whole spec on this and we're going to allow it to have hard division soft division hard mode whether or not you generate multiply and divide instructions and they have a fairly simple pattern of explaining what it is and a bit of descriptive text okay okay putting it all together so we do make all GCC and almost everything almost everything happens and away it goes and it blew up cannot stat 10 permit 10.cc you know I have no idea what this means it's in deep in the bowels it's journey mitt so what do we do about this I asked for help and so thank you to match a Rizzicky who came up and said there's a trick you can tell it to emit fewer partitions it might be a bug and so I tried with emitting five partitions and it all worked fine okay and actually I ended up with a GCC because X GCC is what the GCC within the build tree is called and it ran it and it ran itself test it said let's check if the compiler is any good and then I got an internal compiler error because I haven't actually finished doing my compiler so VAM.md is missing some patterns and it's essentially blown up because it couldn't work out how to find a pattern to get the code down there for one of the test cases but I do actually have a working compiler well I have a working compiler in the sense I've got a compiler I can run it will crash whenever it compiles things but that's actually that's actually quite an achievement so now I need to just debug it okay but I have actually got a GCC build so I'm Dave covered this how to dump stuff we are so you didn't know you just mentioned so you can dump all the different intermediate codes but what Dave did cover was the wrapper option and the wrapper option is your friend that's where you can go inside we've talked about the wrapper option and how it puts things here actually you can do the same sort of thing as you can do gdb args and then I just copied that error message I got with the internal compiler error and now I can run under there and I can run it and I can generate my internal compiler error under gdb but I now have the ability now to do to debug it okay self-test even better so there we are and this is why we work as a community because we are so make self-test type in gdb we'll do all this magic for you okay so there was a bit of smoke mirrors in there I created a minimal vam.cc guess what I copied it from there was a bug in vam.op.ul's I had to hand create that in the hack round that and that I think is a bug I had to create vam.com.cc and I'm not quite sure why I had to do that but everyone seems to do it except open risk and I had to make it and I just took the template one I used it I added vam to the documentation that's a good thing I also compiled with enable maintainer mode which is used to regenerate some files I'm not that was when I was trying to fix the url's problem I'm not sure I actually needed to do that okay but that's what I did to get there so what next and the reason this is rather rushed is it's part of our three-month graduate training course this stuff was put together by my colleague Max and Blinoff a few years ago it's a five-day part of the course um for eight hours a day with exercises and so I've compressed it into 25 minutes um but hopefully it gives you just a little bit of a touch on how you can get started and there's enough hooks in there that you'll get off the ground and if you get stuck ask for help we're a friendly bunch I have an ambition one day I'm going to create a full public tutorial on GCC that's probably my retirement project but in the meantime everything I've just shown you is on github thank you okay I've got I've got two minutes for questions yeah are there any ready-made um CPUs that are a bit weird like um big guitars that we can use and play around for fun yeah so the question is are there any ready-made ones there are loads I mean there are what 50 or 60 backends for GCC and some of them are really weird and some of them very normal I would look at open risk because it's relatively recently done it's well done it's quite small because great excellent so the comment was about working on power isa power power isa and adding the scalable vector um functionality into the back end please join in ask for help scalable vectors are the flavor of the month at the moment so you said that we have to add the architectural specific stuff in the machine description I was wondering if there is a minimum set of touring to complete that you say that you do the assignment the addition and this yes question for the audience then we start the question is yeah our time's up is what is the minimum set in the patterns I don't know but if someone could tell me I couldn't find that thank you thank you
What can Compiler-Explorer do for GCC
In parallel, let's get started here. Up next we have Mark Boudiaz, if I pronounce it. Yeah, that's mostly correct. Yeah, another time. Even for French people it's complicated. He also worked on the GCC RAS front end for a bit, and I think why are that got involved in the got called compiler explorer. And now he's telling us what the compiler explorer can do for GCC developers. Yeah, thank you. So my name is Mark. I'm a compiler engineer at Hidakor, and today we'll talk about compiler explorer in the context of GCC. So what's compiler explorer? So for people who may not know what the compiler explorer is, it's a website where you can enter a program in some language, for example C on the left, and you can pick one or more compilers and get the corresponding assembly. So that's the very basic usage. Compiler explorer was created mostly 10 years ago by Matt Godbold. So that's why you may know the website as Godbold, because he was hosting it on his own domain, and it stuck. So now people are referring it as Godbold. We are now a team of eight people. We host like 2,000 compilers, support more like 60 languages. We have around four million compilation jobs per week, and thanks to our sponsors and patreons and stuff like that, we are able to pay the bill every month of around $2,000. In the interest of time, I will only showcase a very small subset of what the website can do. If maybe you should go and check out by yourself and experiment and see if there's something that you can find useful on the website. And at the end, I will answer questions and maybe get feedback or future ideas. So basic use case. So I'll try with the live site if it works. It's not too slow. Okay. So let's say you have a CC file, then you can add a compiler like this. So by default, it's TCC. You can see that the assembly is color coded too much with the user source code on the left. You can also execute the code. So for example, here, you can see that the printf is displayed on the bottom. You can also ask to stop after the assembler and get instead the option view of the object file. So you can see here that you still have the relocation in the file. Or you can also ask for a full link for the program. Yeah, still. And you can see that the relocations are gone and it's resolved. The last thing that I wanted to show is that you can share this by clicking on share. You get a link. And if you send this link to someone and they open it, they will get the exact same setup and layout. So it's very useful to share some code, bugs and stuff like that. The next use case is if you need, for example, multiple files. So that's the case, for example, in Ada where you have to have different files for the package. For example, the full package is in the two files named foo.adb and ads. And we have a main unit called example. So this unit is using the foo package you can see here. And you should see I'm also using an input file called input. So you can also put like text file in it if you need that. And then you can add as before a compiler. So it's not compiling because I need Ada22 and you get the same features as before. So you can execute, get the object files. You can share the session. Everything works as before. So that's the very basic use cases. We support many more features. You can build your program using CMake. We have GPU support so you can execute code on actual GPUs. You can see both the target and the host view of the code. We have deep views for assembly so you can compare the output of different compilers or the same compilers with different options. We support libraries, environments. There is documentation for some ISA and many more. So please try it yourself and experiment. Now the first feature that can be useful for compiler development is the conformance view. So for example if you have a bug report, so in this case it's from the GCC bugzilla. It's an internal compiler error. You can use the conformance view to find when it started regressing. So you add a conformance view and from there you can add some compilers. So GCC, PX86, for example trunk. So you can see this is red so there's an error. If you hover on the right you can see the backtrace so it's an internal compiler error. So from there you can just duplicate and check with a different compiler. So GCC 13, so still failing. And you can do that for all the compilers. So I won't do this now because it's short of time. But... Okay. I will skip this one and just use... So this is local so there's only a subset of compilers but it's fast. And you can see that you see quickly where the problem started, so around the 13 release. And the nice part is if you want to modify the code and see if it changes anything, the view will update itself so you can play around and see if you can have better ideas or things like that. And again you can share the session and send it to anyone. Something I used during my day job where I need to test against different compilers or targets or language. I create empty templates meaning that I simply create the conformance view with the compilers. I'm interested in for the given target and the language and I leave the code mostly empty. And whenever I need to test something against C++ for x85 targets, I click the link, the share link. This opens up. I copy paste the code and I directly have the result. I don't have to add the compilers by hand every time. So that's it for the conformance view. Very recently, Jeremy in the team added the support for the GIMPAL. So it means that now you can use GIMPAL as any other language in the compiler explorer. So maybe that's useful for some of you. You can just copy paste and use any GCC starting from the nine release. We also have support for the dumps Dave and Jeremy talked about previously. So this is C. I can add the compiler. Then you can add GCC tree RTL. And from there, you have access to all the dumps that GCC emits. Like this. If you need, you can filter between the tree, IPR, RTL. And you have access to all the options that you would have from the command line. And again, if you change something like the optimization, the view should refresh itself. So believe me, it should work. And that's for the most used dumps. But if you have debug dumps from frontends, for example, I've added the one for the ADA. We cannot support you. Simply have to ask. And maybe we can guide you or we can do this ourselves. So just ask and we'll be happy to help. Something else we offer are the nightly compilers. For GCC, we build a subset of supported targets from the GCC master branch. We also build from different repositories. For example, the core ball or the Rust one for VikiTub. We can build the topic branches if you have some that you would like to see on the public website. Or we can build more complex stuff like the Rusty, Codgen GCC where you need to take Rusty, build GCC and stuff like that and package and publish it on the website. So again, ask and maybe we can help. We provide an API where you have access to the basic feature, mostly compile and execute. So you can use that from Shell Script to do tests or you can embed this in application, plug-in, IDE. For example, this is a screenshot from the tool I've done for work using... I can run against different compiler using filters from the command line so I find it very useful. So maybe this could be for some help for you. And the last thing I wanted to mention is how easy it is to create a local instance, private instance. It's mostly heat clone, make and it will do some NPM magic for you. And this will bind to local host so that's fine. You can use it yourself but if you want to do that for a team, multi-user, please, please, you need to take extra care because this is basically a remote execution as a service. So you are from the web browser asking people to enter code and click execute and do everything. So for yourself, easy, for multi-user, not so easy. And we have ideas of new features we would like to have in the context of GCC. For example, for Clang, we have a nice view where you have all the optimizer passes and you can see how each pass is modifying the IR and with a nice div view. So it would be nice to have the same thing for GCC. Maybe a better div view where you can do divs on the RTL directly. Someone has for more Windows Compiler so maybe you have other ideas. So this is the end. So again, that's only a very subset of features. So go and experiment by yourself. We accept any kind of contribution called feature request, anything. So thank you and I'll be happy to answer. So one question. So one question. There was one question. How do you manage security? I don't. We have people working on this, mostly Matt Partouf and Austin. They are doing very complex stuff. I don't understand because that's really not my domain. But everything is sandboxed. The nodes where you are executing are mostly empty. So even if you exit the sandbox, there's nothing to steal. And if you crash the machine, we just reboot a new one. So that's as far as I can give any details. But you can contact them directly. They'll be happy to answer that. Okay. Thank you. Thank you.
Can the mold linker be /usr/bin/ld?
So up next is 3. I hope that's reasonably correct. Yeah, that sounds right. Of linker thing, let's just say that. Yeah. Now talking about whether the mold linker can actually be used as a system linker. Yes. So thank you for coming to this talk. My name is Rui Uyama. So I'm the creator of the mold linker as well as the LLVM linker. So I wonder if you guys are using my linker. So raise your hand if you are using mold linker. And what about LLVD? OK, maybe almost everyone is using my linker. So it makes me very comfortable to be here. Anyways, so the mold linker is my latest attempt to create the best linker for developers. And that really matters because in most compilations and build times, linker dominates, especially if you are doing a quick edit, debug, compile cycle, because you edit a single file, build a thing. The compiler finishes pretty soon because it compiles just a single file. But the entire executables need to be built from scratch. So the link time matters. So I've been developing the mold linker since September 2020. So it's been almost three years under a little. So it's relatively new. So it's available under the MIT license now. It's been under a different license because I was trying to commercialize it. But it turns out that it didn't work out. So I decided to go with the published license. And the main purpose is to offer the fastest linker to that developer. So it's order of magnitude faster than the new linker. And it's also faster than my previous one, LLVD, as well as the new gold linker. So as a rough to give you an idea, the on a decent multi-core machine, mold can output one gigabyte output per second. So if your executable is two gigabytes, and then it takes two seconds on your machine. And that's pretty fast. But the modern executables are gigantic as well. So for example, if you build LLVM with debug info, the output would be like one and a half gigabyte. But it can be built in one and a half seconds. And the mold linker supports almost all major targets, except MIPS. And the reason is because MIPS, ABI, has diverged too much from the other ABI's. The fact is that the other ABI's have evolved since 2000. But the MIPS ABI has stagnated since the collapse of SGI, because SGI was a de facto player in that field to set the standard. And then no one has since then made any effort to improve the ABI. So MIPS has diverged. So at this point, I'm not sure if we want to work, continue working on MIPS support, because it seems like no one is really making a serious effort to refresh the architecture. But anyways, it supports a lot of architecture, even including long arch, which is a newcomer in this field. And despite being pretty new, I think that the linker is production ready. And I think that many people are actually using for production use. I will talk about that later, how I tested the linker. So from the developer's perspective, so this slide explains what is the model linker from the developer's perspective. So it's written in C++, specifically with C++ 20 features, and with Intel TVB as a 3D library. And the one thing that you would notice immediately if you take a look at the source code of model linker is that almost all functions and the data structures are templates rather than just plain functions or structures. And the templates are specialized for each target. So for example, if you, so we have, and the source code quality, and ideally have readable source code. So I put a lot of efforts to make it readable. So this is an example of how you write target specific code in mold. So it uses if constexpr in the source code. So if you are not familiar with C++ 20, this is a feature, this is a new feature. And the beauty of this feature is that if constexpr is evaluated at compile time rather than runtime, so this if constexpr expression will be compiled to nothing. If this function will not be specialized for PowerPC 64, V1. So if as long as you got your new code in this way, your new code cannot do anything harmful for other targets. And it cannot be, it cannot slow down other targets. So this is another example how we use C++ 20 feature in mold. So this is a data structure representing on this format of relocations. But there are many types of relocations because we at least have big Indian, little Indian 32 and a 64 bit version. So in combination we have already four different versions. And the beauty of C++ 20 is that you can use a require your crowds after the template keyword to specify what kind of type parameters that you wanna specialize for. So in this case, this data structure is specialized for middle Indian and real way of which is very technical stuff. But we have two different versions of relocation data structures. And below the definition, we have different versions of data structures of the same name. And we even have completely different version of data structure for specifically for Spark 64. Because Spark 64 has this weird field that doesn't exist in any other architecture. So, but we can just define this data structure only for Spark 64. And as long as you guard G code that access this field with if course expert, then your code will not be cause GM, you know, you are using the missing field of the data structure. So this is a very beautiful way to compile your code to a specific target. So, it's not loading. Okay, so this is a machine description of the of G some specific target. In this case, it's a machine description for x86 64. So we have bunch of constexpr static variables as a parameter. And it defines, you know, that whether it's a middle Indian architecture or big Indian architecture or it's 32 bit or 64 bit. And basically you, so if you wanna put the mold link to new target, then you define this kind of data structure where basically copy and paste. And then make the modification as you needed. And then it's just as simple as that. And since this is G's fields are compile time constant so the compiler knows what the value is at the compile time so they can optimize code based on these values instead of, you know, that dispatching at runtime. So this is a comparison of the number of lines that you need to put more linker to the new target. So on the left hand side, we have code. So it is not a really precise comparison because lines of code is not a direct indicator about how easy or how hard it is to put linker to the new target. But it gives you enough idea about the scale of you, about the amount of work that you have to do. So apparently for gold, you have to write tens of thousands of lines of code for each target. But the reality is most code in the target specific code for gold are just a copy paste. So for example, if you wanna put new gold to like spark or long arch or whatever, then you would start copying the entire file as long arch dot cc or whatever and then it make the modification. So you have a lot of copies of code and that's not a really good way to, you know, put that thing to the new code. And on the other hand, we have very little code in mode to put to the new architecture. So we have a few, we have some amount of code outside of these files for target specific architect code but overall the amount of code is very, very small, like only a few hundred lines of code. So testing, testing is the most important and the difficult part of writing the linker because as you know that if you write a simple linker it's not really hard because it's just a program that takes object files and combines them into a single executable or shared object file. But the thing is there are so many edge cases and because there are like hundreds of thousands of programs that uses the linker, essentially every program uses the linker. So every corner case will be, there is some use case of corner cases out there. So testing is very hard. So we have two tests of how to say the mode to ensure that you, I will be finding a bug before you will notice in the production use case. So the first test is shell script based test which is a very simple test. I have a slide, slide for this. So this is just a test case for the very simple test case. So we actually compile code and try to link the object file with mode and then actually execute it on the machine. And as you can see that if you have a cross compiler and the QMU, you can test that this test for other architecture that's different from the one that you are running on. So for example, you can test Spark 64 on x86 machine. But apparently this test is not enough for real use cases, right? So the other test that I was doing, I'm doing is to try to build all gentry packages in a business mode in a Docker container to find any bugs. And the beauty of using gentry is that with gentry, you can use the exact same command to build any package. And it can also run the unit test that comes with the package. So it's very easy to wait to test whether you can build the program and the build program will work or not. So I did that and it takes a few days on the 6C4 core machine. But it works. But the thing is it is sometimes extremely hard to debug the stuff when something goes wrong. But somehow I managed to fix all bugs that I found this way. Well, yeah, it was a fantastic experience to fix all the jits bugs. But my point is that it is very important to fix all bugs before you would notice in the world. Because if mold didn't work out of the box for your project, the next thing you would do is just switch back to the original linker and you will never try it again with the mold linker, right? So why mold is so fast? Well, so we use multiseletting, multiselet parallelization from the beginning. So that's essentially why mold is so fast. But the other thing is that mold is simply faster than the other linkers with single-slated case is sometimes because we are using optimized data structures and code. Actually, the data structure is more important than code. As Rob Pike once said that you would write code around data structures and not to other ways. So designing the right data structure is important to make faster program. So here is, I think, a good visualization of how good mold linker is to use multi-core all-G cores available on the machine. So on the left-hand side, LLD fails to use all-G cores, but the mold finishes very quickly with all-G cores. So why, but the question is, would be why do we want another linker even though we have LLD? So my answer is, so LLD is not known, first of all. And the other thing is that LLD does not stop or support GCC LTO. So LLD is actually tightly coupled to a specific version of LLVM. So LLD, for example, version 15 can do LTO only for LLVM 15. So it of course cannot handle any GCC LTO object files. So if you wanna do LTO with no faster linker, then mold is the only viable option. So what about Gnu Gold? I think the problem with Gnu Gold is the lack of clear ownership. So it looks like it's not really maintained well anymore. And the original creator of Gnu Gold, which is Google, has lost the interest of keep maintaining it because they are now switched to LLD. So I think the future of Gnu Gold is not clear. So and the gold is not as fast as my linker too. So can we improve Gnu LLD so that Gnu LLD gets as fast as my linker? My answer is no. I think that it's almost impossible to make the thing faster unless you rewrite everything from scratch. And if you rewrite from scratch, that would be the same thing as I did. So and in my opinion, the source code of Gnu LLD is not very easy to read. It's like the source code was written more than 30 years ago and it's been maintained since then. But people are still adding new features to Gnu LLD first and then put to other linkers because what they are actually using is the other linkers. But I think that the situation is silly because people do not really use Gnu LLD anymore for their real world project. So I think that it needs changing. And my question is do we wanna stay with Gnu LLD, the current Gnu LLD forever? My answer would be I don't think so since we have a good replacement. So if I can, I'm open to donate more to Gnu project so that we can call it a Gnu mold if that accelerates that option. It's not something that I can only decide but because it means a lot but I'm open to that option if it makes sense. So the death missing piece to use mold as the standard linker is the kernels and the embedded programming support. So user and the programs are mostly fine. Well, if you install more as a system linker you wouldn't notice any difference other than speed. But the kernels and the embedded programs needs more special care about memory layout because hardware for example, enforces you to put some data structure or code at a very specific location of the memory. And if you are programming against MMU this computer then you wanna layer as the hardware memory is. So that kind of stuff is usually handled by linker script as you know. But the linker script in my opinion has many issues. The first thing is that it doesn't have any formal specification of the language. It only has the manual and we implement to, so other linkers are trying to mimic the behavior of Gnu LD but it of course causes compatibility issues. And the other thing is that the linker script predates elf file format. So not all linker script command can translate directly to elf terminology and it causes more confusion than necessary. So, and I think that it is almost impossible to add a linker script support without slowing down the linker. So I think that we need something better. So this is my current approach to support embedded programming and counter support. So I added a very simple command line option which is called section order. And that specifies them how to layer the things. So, and I think that this option alone can satisfy like more than 90% of the usage but I'm pretty sure that that doesn't cover all the usage of linker script. So I need a help from you guys. So because especially in embedded programming world, their programs are not open source and they are not available on GitHub and they tend to be in house program. So I don't know what the real usage is for embedded programs. If you can tell me that I wanna do this with the mold linker, then I can implement that for you. So I would appreciate it if you give me a hint. All right. So this is the end of my slides. Thank you very much. So you mentioned that it's possible to do link time optimization, like as a feedback in the GCC, but in general, is it also possible, how easy is it to do link time optimization inside the linker, like is it possible for the linker to disassemble some instruction and try to put something else there? Okay, so the question is how easy it is to do something like link time optimization but not quite there. So I don't know if I correctly understand your question, but it's... It's basically optimizations during the linking. Yeah, of course, but the thing is... It's not by the compiler, it's all LTO, but it's not by the compiler. So the way how LTO works in the linker is compiler emits. So from the user's perspective, all you have to do is to add hyphen FLTO to the command line option to compiler and the linker, and everything works automatically. But behind the scenes, the compiler emits intermediate code instead of the actual machine code to the object file, and then the linker recognizes that intermediate code. And then it calls the compiler back end to compile all things once to the single object file, and then the link continues as if that gigantic single object file were passed to the linker. So in that sense, you can do anything with the intermediate file inside the compiler back end because the linker doesn't really care what is going on behind the scenes. So, well, does that answer your question? Yeah, so you said that you tested more against being all of the factors in gender Linux. How long did that take? How long does one count take? So how does it take to test all gender packages against more the linker? And it takes, if I remember correctly, three, four days on my 64-core machines, 64-core machine with 200 gigabytes, 256 gigabytes memory. And yeah, it's a very long time, but it's definitely doable on a beefy single machine. One target? Only for x86-64 because in order to cross-compile everything to different architectures and run-g test, you have to do that on QMU, which slows down like 100 times than the real performance on the computer. Yeah. Yeah, I can't. Yeah, sorry. What kind of mistakes did you make in LLD that you're fixing in mode? And are there any mistakes in mode that you think are interesting? So the question is what mistakes did I do in LLD that I fixed in LLD? And did I make any other mistakes in mode? That's a good question. The first thing is the relocation processing in LLD wasn't as good as mode. So it's complicated. It's hard to maintain, and it's slower than mode. So I fixed it. And the other thing is that LLD uses templates to support L6432, big-endian, little-endian, but it's just four instances. So it doesn't instantiate for each target. So you cannot use the technique that I used for Spark 6c4, that I showed you on the slide, for example. And did I make any mistake in mode? Maybe not. I am pretty satisfied with the quality of mode. I think that I really made... I'm personally enthusiastic about the code of the readability. So I tried to make the source code as readable as just like a book. And I don't know if I could achieve that goal, but the point is that, well, yeah, it's definitely readable. One last question. Are there any plans to ever support any order of that file that helps? Oh, so the question is, can you support other file formats? No, I'm planning to ever do that. Oh, do I have an plan to support other than LLD? Well, I did for macOS, which is a Unix-like environment, but it uses a different file format, which is called macOS. Yeah, but the thing is, and I succeeded to create a faster linker for macOS, which is much, much faster than the upload linker. But the thing is, last year in September, they released Xcode 14 with their own new linker. So there wasn't going on efforts within Apple that I wasn't aware of. And then their new linker is as fast as mine. Maybe they wrote my source code as well, because it's available online. But also, GTIB3, then? Oh, my linker is now available under the GMIT license. So it's, yeah. So maybe you only heard Apple. Well, Apple haven't released their source code yet. So, okay, we have to stop. So thank you again.
Build Distribution for Maintaining the Famous GCC 4.7
That's great to see so many people interested in GCC from more than 10 years ago. Okay, let's get started. So we are taking a step in time by more than 10 years, I think. Yes, almost, yeah. Okay, so Oliver Reiche is going to talk about the maintenance of GCC 4.7 and the reason for that is that GCC 4.7 has a special property which I'm sure you will talk about quickly. Exactly. So, hello everybody, my name is Oliver Reiche. I'm working for the Huawei Research Center in Munich. And yeah, I would like to talk about a built distribution for maintaining the famous GCC 4.7. So, and I would like to start with dissecting the title a little bit. So, first of all, what is the famous GCC 4.7 and what is it famous for? And then also talk a little bit about what do we mean by the term built distribution? Then I will show a little bit of patches that we applied to that GCC version and show a little bit about a bootstrap process before I wrap up the talk. All right, so GCC 4.7. Well, there is a movement that's called bootstrapable builds and this movement strives for building all software from source. And of course, you have to start somewhere. So, in practice, you usually start with a minimal set of binaries that you need to start the bootstrap process. And then at some point you bootstrap your C compiler and at some point you want now to bootstrap your C++ compiler and then you might ask yourself, how do I build a C++ compiler without a C++ compiler? Because most modern C++ compilers are actually written in C++. So, this is exactly where GCC 4.7 comes into play. It's a key role for the bootstrapable builds movement because it's the last GCC version that can only be compiled, that can be compiled with only a C compiler. So, if you want to enter the realm of C++ and everything that is beyond in this bootstrapping process, you will need this version of GCC. So, and it's also about software preservation because, yeah, it's a quite old code base. It does not run out of the box with modern compilers. It does not run out of the box of modern systems. Modern systems and modern compilers use by default usually the C11 standard. Also, this code base has some issues with that and GCC 4.7 does not build reproducibly in all scenarios. I will come to that a little later. So, the next thing is from the title, build distribution. I mean, this is like a very fuzzy task that we term that we invented. So, what do we mean by that? So, we have actually a project that's called Bootstrapable Toolchain. There's a little bit of advertisement here on the right side. You can build this project using our very own open source build system that's called JustBuild. And if you use this project, you can Bootstrap the latest compilers and latest build tools with it. And all you need, of course, our build system and reduced binary seed. We need the core utility being installed. We need a POSIX compliant shell and some C compiler with a working C standard library. So, even the tiny CC will work. And what we do is all of those two chains here are actually built from source. So, we didn't reinvent the wheel. We used the existing build descriptions for GCC Make or CMake for Clang. And our build system basically takes care of orchestrating the build and calling those foreign build systems. And yeah, you might have noticed, Make and CMake are not part of our initial binary seed. So, we have to Bootstrap those first. This is also what our build system takes care of in this project. And so, what we do basically is we do on-demand Bootstrap of all the necessary tools during this process to make sure that we have everything that we need in the next steps to do Bootstrapping of the next tool chains. And by doing so, we basically unfold the minimal Linux distribution on the fly that is barely enough to just build the tool chains that we are actually interested in. And yeah, this minimal Linux distribution is what we're referring to as the build distribution. All right, next I would like to talk a little bit about what patches did we apply to patch up GCC 4.7. Well, most of them are actually maintenance patches and backports. So, from newer GCC versions, so in the square brackets you see the GCC versions where we backported those commits from. So, in the PDF those are clickable links, brings you directly to GitHub. And yeah, just to mention a few, so the largest commit was the general Muzzle support. And yeah, this is just an example here. Of course, the commit is much longer. This introduced the entire macro infrastructure that is actually necessary for GCC to work with Muzzle. Another interesting commit was the actual linker support for Muzzle. So, it adds this magic string here which is the hard coded path where GCC expects the program interpreter to be located. But much more interestingly though is how did we patch up reproducibility for GCC 4.7. Well, if you use our build system or any other modern build system as a build orchestrator, they usually build in isolation. So, all of the stuff that runs in the action, so the make command, the make binary, everything that is needed to get the job done, is actually located in an isolated directory. It could be a temporary directory at a seemingly random path. It could also be located in the user's home directory. And there's a problem, for instance, yeah, those two binaries you heard about it today already, and CC1, the C compiler, and CC1 plus the C plus plus compiler, they contain checksums. And those checksums are computed from many things, and parts of that are the path of the linker that was actually used. And because we built in isolation, the linker is also located in this temporary isolated directory, and that path is seemingly random and finds its way in the final checksum. And the other problem is that the relevant object files for linking those binaries are also hashed to compute this checksum. And well, the object files contain debug information and therefore also contain somehow the build directory. So, we needed to patch that as well in order to compute a reproducible checksum that is independent of the build directory. So, which is actually fairly simple. So, we just made sure that the linker, we know the linker, we control the linker, so it's actually not necessary to hash the full path. So, we just stripped the path by some constants and replace it with some constant string. And of course, we copy the objects that are relevant during the process to some temporary directories, stripped them from any debug information using strip for target, of course, and then use those hashed those to compute the final checksum. So, at the end what we get, we still have a meaningful checksum that somehow represents how those binaries were built, while still being reproducible in the sense of being independent of the build directory. And all of those patches that I just showed will then be automatically applied during our bootstrap process. So, what is the process? How does it look like? So, we have actually multiple stages to, until we end up with the modern compilers that we actually want to build, because of time limitations I will only go into details of the very first stage. So, we start off with just having core utils, a shell and some C compiler. And the very first thing that we do is we bootstrap certain parts of busybox, because it includes very important tools that the auto tools and the auto config scripts later will need. And we only restrict to those very specific parts. So, grab find, say it for instance, and of course we need patch for patching GCC later. And with those tools at hand now, we can now bootstrap make. Make can be built with make, of course, but they also have a bootstrap path. Luckily for us, there's a shell script and with a little bit of magic, we end up getting the make binary and now we have make build system available. And then together with those tools and the make build system, we can bootstrap the archiver from the bin utils sources and then we also have an archiver available for producing static libraries. Okay, now we can do the first real build. So, we can build with those at hand, we can now build latest bin utils, the normal way it's meant to be built, configure and make, and then we can patch GCC and build GCC. If you're interested in running this on your machine, it should work on any x86-64-bit Linux system. You only have to install just build, clone this project and run this command. It should give you a working GCC 4.7 installation. Okay, so let me wrap up the talk. So, we tested that on many systems. It should work on any x86-64-bit Linux system. We also tried to test it on very different systems like NixOS, where actually everything is located at some custom path. We also tried very reduced images that only contain a tiny cc and a muslin libc. And with our project and together with our own build system, you can easily integrate, if you have a C++ project and use our build system, you can easily import this tool chain into your project. And then you can make the tool chain a committed dependency to your project, which has several advantages. Of course, it's easier to set up for the user. He doesn't need to have a certain C++ compiler installed. You can just clone your project, run build, and then the first thing that happens is the tool chain is being built. And don't worry about compile times. Of course, bootstrapping the tool chain takes a while, but this only needs to be done once. So the next time you build the tool chain is like a static part of your dependency chain that doesn't change, so it will come from cache. Of course, if the tool chain is committed to your project's history, also git bisects are easier. And we can even show, if you do it right, that you can predict the binary hashes of the binaries that your project produces. Because you have a very confined tool chain, you know exactly what the output should be. If you use the Moodle, Lib C, stripping, everything using static linking. We have a demo application showcasing that. We can predict binary hashes for this project that should run on every x86-64-bit lingo system. All right. Last thing, I would like to encourage everyone who's interested in that to just install just build and try those commands yourself. It will take about 30 minutes. If it doesn't work on your machine, please let us know, because this is super valuable information for us to make this process even more stable. All right. That's all. Thank you very much. Thank you. And we will allow for maybe three minutes of Q&A because we started late. And actually, I want to start with one question from the Matrix online channel because give them a chance to answer some of the questions. So Ismail Luceno asks if there is any collaboration with OpenBSD because they have been maintaining their own fork also of GCC 4.7, I guess, because of the C++. Okay. No, there's no collaboration. So the question was, is there any collaboration with OpenBSD? Yeah, very good. Okay. Because they maintain their own fork of GCC 4.7. No, there is not. This is actually a good question. I haven't heard about that before. So this is already valuable input for us. Okay. Got a question? Is this partly in the timing for things like bootstrapping for trusting trust at a time? What were your tries to avoid the possibility of your compiler to be supported? I didn't recognize it. Trusting? Trusting trusts, can you enrich your model to remember and then you should think of where you could support the compiler, you could insert an actor, compile it in such a way that you compile a source code and then recompile itself as high an actor and define the ring, but not present in the source code? Okay. Okay. It's pretty hard to repeat that question for me. I may be just in paraphrase. Yeah. Okay. So to ensure the question was whether this is security related. Yes, in some extent it is security related. So one idea is to have the possibility, if you build reproducibly in a way that you can say, okay, this source code compiles to this binary and will have this hash, pretty much independently of the system you're building on, of course there are some restrictions, that gives the opportunity to say, well, we can basically prove that this binary originates from that source code and that source code alone. That is actually also one of the motivations. Yes. It looks like typed up. How we got? One more question. One more. Do we have the next speaker in the room? Our guy finding. So, yeah. I was surprised that it's machine dependent. I wonder why different architectures aren't easily done. So the question was why it is machine dependent and different architectures weren't done. The reason is just we were focusing on x86, 64-bit Linux because it's the most widespread right now. And it's also quite of work to patching GCC up to make that happen. So we basically just not had the time to look into other architectures. But we already have it on our to-do list. We want to at least support ARM 64-bit. And then let's see where we come from there. All right. I guess we have to go. Yeah, then. So at the end of this process you get C++ compiler, but it is an older C++ compiler. Yes. I was just wondering like how many stepping stones are there to get to the latest? All right. So the question was that after the bootstrap process of stage zero, we just have GCC 4.7. This is a quite old compiler and what other steps are necessary to reach modern compilers. This is a very good question. Yes. So modern compilers usually need C++ 11 support. GCC 4.7 does not have that. And so the next stage, so stage one is actually bootstrapping GCC 10.2, which is to my knowledge the first one almost completely supporting C++ 11. 4.8? Is it all right? Okay. So current GCC can still be bootstrapped with GCC 4.8. Oh, okay. Okay. But not that clear. Okay, but we definitely need one more step. And yeah, we got that covered currently with GCC 10.2 is the step stage one. And then from there we can go on. So don't need more than one step after 10 years. So is that okay? Yeah, exactly. Yeah. And I guess the advantage of picking a later GCC version is that we don't have as much patching for new back ends and configurations, and stuff like that because that's then all. And it looks shiny new. And it's still maintained, GCC 10.2. Yeah, and you would help these main things. You're the next. Okay. I hope that'll be the end. I'm sorry. Okay. Thank you, Arvid. Thank you, Arvid. Could you help me with this? Yeah. I'll cut it off. I'll see you there. Thanks. Thanks.
Sega Dreamcast Homebrew with GCC
Okay, cool, cool. Okay, so up next is Falko Gurgis telling us I'm sure an entertaining story about Sega Dreamcast, how did you get this idea? I have an entertaining story about Sega Dreamcast Homebrew with GCC. That's true. Not this standard thing you would do. You ready? Alright, so I'm talking today on behalf of the Sega Dreamcast community. I'm actually a developer on the Homebrew independent SDK, SDK called Callisty iOS. And we're talking about how basically... Yeah, no problem. We good? Okay, yeah. So basically this entire Homebrew community is powered by GCC. And I'm just showing you kind of what the kind of stuff that being part of the GCC ecosystem is allowing us to do. So first of all, what is the Sega Dreamcast? Maybe some of you don't know because it had two years in the limelight. It was released in 1999 and it was only commercially viable until 2001. Despite that fact it had a substantial effect on the gaming industry. It left a huge legacy and it competed directly with the PlayStation 2. A little bit less the GameCube and Xbox because it didn't last that long. And then a little bit about it, it had Hitachi, which...SH4 CPU, which is now owned by Renaissance. An imagination PowerVR 2 GPU, which was the predecessor to what eventually got used in the iPhone. So that same technology actually for the GPU went on to do quite a lot of fancy stuff. And then there's a little bit extra about it. But the key thing here is the Hitachi SH4 CPU. And that's what has made our destinies intertwined with GCC. Because GCC is the only compiler that supports the Super-H architecture. So why the Sega Dreamcast? What's the big deal? So I think there's a lot of strong arguments for doing it. I think in like an era where people are into Raspberry Pi programming and embedded systems, it offers a really good middle ground between high performance because it's good at graphics, it's good at floating point operations and embedded programming. We have a lot of established tools that are really good. As you'll see we have really modern compiler support. We have a lot of language support. Thanks to Matt Godbolt we have SH4 and compiler Explorer. So you can actually check, look at what the disassembly of your Sega Dreamcast code looks like to make sure it's optimized. And as a beginner you can treat it like just kind of a weak PC using cross-platform APIs. Or as you mature in advance you can go down to the hardware level and optimize for it. There's also a lot of cool toys and peripherals. There's light guns, Samba De Amigo Maracas. There's the visual memory unit. And visual memory unit itself, the little VMU, has its own little homebrew scene. So as I was saying we have a pretty decent community and because our independent SDK uses no Sega code we're actually able to release our homebrew commercially and sell it online through retail stores and stuff like that. So this is how many we've released each year commercially and there's just a collage of different commercial games. So as you can see you're not going to get rich on Dreamcast, but you know if you're making a PC game within that spec range maybe you should check it out. So this is a little bit about Callisty iOS before I get really deep in some code stuff. This is a little bit about the architecture. So it's Callisty iOS, it's like a big SDK but it also is like an operating system. We have a kernel. We integrate with NewLib 440, which as far as I know is the latest one that's out there. That's where we do file I O, date time, malloc. We have a really cool virtual file system which abstracts the way the PC CD-ROM. You can stream from your PC. You can use the new SD card readers. Networking, we even have IPv6 on this thing. We have examples. We have add-ons and ports for OpenGL, OpenAL. The tool chains as you'll see we have GCC 1321, latest Benutils, GDB going on it. We're trying to take this retro game console and let you use the latest and greatest versions of the languages of your choice on it. That's kind of a little bit of what we're going to touch upon. This is a little bit about my Dreamcast. It's not going to go into too much detail but as you can see it's like a car. You can totally spend all your money on it if you want and go to town on it. You don't need to do any of this though to develop for it. That's another big point of it is as long as you can burn a CD-ROM, 90% of the Dreamcasts out there can boot your homebrew game as long as it's burned a certain way. That's part of why the homebrew scene became so big. The first thing we're going to look at is C23. We wanted C23 on the thing. What did it take to get there? It didn't take as much as we thought. One of the first things that we had to do was support Atomics from C11 so that you can say atomic int, atomic bool, and have, since we have a preemptive multi-threading scheduler, you want to be able to have atomic variables that aren't interrupted and stuff. Unfortunately the SH4 is old so there's no hardware support for Atomics. But since it's single core it's not a big deal. You just disable interrupts around it, you load or store your value, and then you enable interrupts afterwards. So this is actually offered by the compiler, the SH compiler, as the soft imask model. What it did not offer is 64-bit and generic Atomics. So we had to implement that and there's the C code for, it's kind of an ugly C macro to do it, but you can basically see. We just disable the IRQ, we load or store a type, and then we enable it later. And that's the basis of our Atomic model. So if the scheduler can't get interrupted when you're accessing Atomic, then it's Atomic. Then we validated the Atomics. You'll see a bunch of the output there is from my Dreamcast. So we have a bunch of tasks we ran through, a bunch of different Atomics, an Atomic buffer, and yeah, the Atomics work now on the Dreamcast. It's pretty nice. Something that was much harder was adding thread local storage support. So in C and C++ there's a thread local keyword, and there's a lot of stuff you have to do for that. It's a delicate interplay between the compiler and the operating system. On the operating system end, don't worry if this code is a little dense, that's the whole point, this was actually a pain, and that's code just there to show you what a pain it was. You have to allocate with every thread, you have to allocate extra block for thread local storage with the T data and T BSS segments for thread local, and then you have to swap every time you swap context, you have to swap the thread pointer to point to a new thread chunk. So we did that, and this is some of the validation tests for it. What actually makes it hard is you can align your TLS storage arbitrarily, so we had to compensate for arbitrary alignment, that was all the extra logic that was more than just a malloc with a fixed size, you have to also align those segments. So yeah, now TLS works on the Dreamcast. And then that was pretty much it, we got C23, we have no pointer, auto, type of, all the cool stuff that C23 added. VAopt is now in C23, Align As, Static, Constexpr, Compound Literals, one of my new favorite things to use right there. This is just me throwing in a bunch of C23 with a Breakpoint API. Oh, Binary Literals, pretty nice, C23 edition. This is a little video, uh-oh, was a little video, it's not working. Okay, well, cool. Maybe after you can check out my Twitter, all the videos are on there in case they don't work. So C plus plus 20 and 23 is up next. What we got for free, we actually got a whole lot for free, it's kind of cool. Concepts, constraints, modules are not fully supported by GCC yet, but hey, everything that was supported worked fine for SuperH, we were pretty shocked. Ranges, look at that crazy range stuff that we can do with C plus plus 23 on the Dreamcast. Pretty sweet, standard format, and this thing, a static, variadic, multi-dimensional, overloaded subscript operator. You can do that on your Dreamcast now, it works. That was pretty awesome. What we had to earn with this, standard async did not just work for us because our kernel had a serious bug with once in it that nothing had exercised that code path with the ferocity that modern C plus plus did, and we found a race condition there. Standard random device took a little bit of work, I'm going to get into that. Standard file system is not quite supported. Yeah, that's a sore point for me right now, we're working on that, that's our fault. We're not propagating error now properly with NewLib, working on that. Standard time zone, well Dreamcast doesn't really have a time zone, so not much we can do about that, although I will say we gracefully don't support it, so it's not a big deal. Stack trace is one that doesn't look like there's much we can do about that. Yeah, C plus plus 20 stack trace, I got the library compiling for it, but it looks like deep within the library where it's trying to look for the binary path for reflecting over the L for executable to unwind the stack and look up the symbols there's just not really any way for us to tell it where to look over the network for a Dreamcast, so yeah, there's no stack trace right now. Maybe we can hack something up for that later. Standard random device, it actually works fine, so you can do all this crazy random stuff. This is the NewLib hook where we actually hooked into, we supplied the entropy from a bunch of uninitialized RAM, and that's what the entropy is coming from uninitialized RAM, which goes to standard random device, and then this is just a uniform distribution getting generated on the Sega Dreamcast and showing, you know, looks pretty uniform. Yeah, C plus plus concurrency meets the Dreamcast, this is pretty exciting. Yeah, there's a bunch of interesting C plus plus 20 stuff there, so I made a huge test thing that we're running on Dreamcast, which is just, it is running a bunch of standard, it's generating a bunch of standard async threads, testing everything from semaphores, latches, share locks, condition variables, barriers, and everything. And at this point, I guess I can't show it because the video is not loading, but it would just be like a big printf printout showing that all the tests are passing. So, yeah, as far as I know, including code routines, everything from the support for GCC up to C plus plus 23 is working fine on the Sega Dreamcast because you definitely need that level of concurrency to work with this machine here. Alright, let's see. Yeah, I had another little video that's not, I don't know why they're not loading, but they're all on my Twitter. Alright, Objective C, there's a little more to this because there's a couple reasons. GCC, it looks like, doesn't quite support the latest version of Objective C. Objective C, too, I guess that's because Apple didn't want to fund it anymore, I'm not sure. So, we had to make do with what we had. It looks like Objective C might be a little broken right now for cross compilation. We had to patch a build script right now to get to cross compile for bare metal. It was failing a compilation stage and we just basically commented it out in one of the config files and then it worked. We were able to build LibObjective C. The problem with, you know, building plain Objective C is, LibObjective C is a C library that lets you access all of the object-oriented features of Objective C. It's not very pretty, it's not very idiomatic of Objective C, but that's the raw runtime. In order to do anything useful with Objective C or do anything that you normally associate with Objective C, you need the foundation standard library, which is typically associated with Apple. That's where your NS string, NS object, all that comes from. Luckily, GNUSTEP has an open source implementation of that. So, we tried to port that to the Sega Dreamcast to give you, you know, this very big, nice Apple API that you definitely want for your Sega Dreamcast homebrew. Oops, that went pretty well. So now you can, you've got data structures, you've got auto-release pool, you've got NS string, NS log, all that kind of stuff on the Sega Dreamcast. You know, that's just basically some Hello World stuff doing that from Objective C and that's the Dreamcast output. Now, what gets a little more interesting is the concurrency model for Objective C is actually pretty cool. We support the NS run loop, which has NS timer, which lets you schedule periodic timers. They're used for, like, GUI updating, you can use it for game engine logic. And then we're firing NS notification events asynchronously from that event loop. And the video was really just showing, like, a bunch of events firing asynchronously on a Dreamcast. I don't know why it's not working. But anyway, so you've got the Objective C concurrency model as well. And for the record, if you need Objective C++23 to get everything, that works too. If you want to mix both of them. Okay, so then we tried to get D on the Dreamcast. This was not done by me, this was done by someone who goes by Luna on the Luna Fox girl on Twitter. Thankfully, she helped us because I didn't know really much of anything about D. She did a great job. What was involved with bringing D to Dreamcast? Well, we used the GDC front end for GCC. We cross-compiled it for SHL. She wrote a custom run time to do some of the stuff that the D run time does, which I'm a little sketchy on. But I believe it's stuff like lifetime management, like allocation, deallocation, entry point. She did not use the garbage collector, not because it won't work on the Dreamcast, because we run Lua and it's fine, but she wanted manual lifetime management. And at this point, we did not try to do libfobos for the standard library. We actually are just binding to libc for that kind of stuff. And then that's kind of a folder view of what the project looks like. It's called DKOS, which is the D bindings for what we did. And as you can see, I was worried that a bunch of the low-level stuff we were doing in C and Callisty iOS would have to change. Like, hey, can you bind to inline assembly? Like, what are you going to do about the C macros? And actually, a D is quite capable. And here's some of the crazy stuff that she either rewrote or bound to from D. So there's inline assembly. It can handle flexible array members, inline functions, macros, versioned enumerations. I started getting a little jealous there as a C and C++ programmer, actually. It's really good stuff. So yeah, D meets the Dreamcast. So here's some fairly idiomatic looking D that had a video there that all it was doing was basically changing the color of the... It was animating the background color with the PowerVR on the Dreamcast, the frame buffer, and printing some stuff to standard out. And it worked great. And let's see. Here was one more video, which was a bunch of animated cubes showing 3D accelerated graphics with the D language. That's on her Twitter, actually. And then finally, everyone had been asking the entire time we were doing this on Twitter, hey, what about Rust? Hey, what about Rust? And we're just like, hey, man, I don't know what to tell you. LLVM doesn't support SH, SH4, like, take it up with them. And then GCCRS came along and happened. We weren't having any luck with Rust C at the time. We couldn't get it cross-compiling properly for SHL. So we started playing with GCCRS, even though it's like very, very new in its infancy. I mean, we were seeing like four loops being added almost in real time, you know, like we pulled down like, oh, you can use a loop now. It was pretty cool, you know? So this is not stuff that necessarily is ready to be played with, but we don't care. It's what we do here. There's no bar checker yet, so you'll notice everything is just unsafe, but it's still fun and it's still Rust. So this is, oh man, the video's not there. It's a rotating cube that is driven predominantly by Rust. It's unsafe, as you'll see. The main control flow is Rust. The OpenGL API is calling in to C for that. And then there's a mystery third language that you're about to see, which we implemented miscellaneous support utility functions for things that GCCRS wasn't able to cope with just yet. So, all right, we're going to go into that demo here. All right, so on the left we have the Rust, which is calling in to C. On the right we have the utility functions, which are Fortran. So we had C, Rust, and Fortran, all on the Dreamcast. And yeah, here was the rotating cube. So, yeah, I would say we inherited quite a good deal from the GCC ecosystem. And yeah, may your homebrew be powerful and good and fast, and yeah, that's it for us. I just want to say thank you to everyone who contributes to GCC, and to our GCC in general for supporting us, for supporting the SH backend. If you're interested in looking into any of this stuff, that is a link to our Wiki page, which is everything on how to set this up. You can do it from Windows, Mac, Linux. It's mostly just running a script that works in any POSIX environment that sets up the cross compiler. And I wanted to say that we are just one community that's powered by GCC and is modern. I'm friends with the guys who do the PSPSDK, the Sega Saturn stuff, Lib Dragon for Nintendo 64, SGDK, Sega Genesis, and the Vita SDK. I can tell you right now we're all using GCC. So, yeah, you guys are, there's a lot of people out there who owe you a lot. And if you like this kind of stuff and are interested in hearing more, you can follow me on X or Twitter or GitHub. And that's it. Any questions? Over there you have actually sitting, or one of the Fortran main tables. Really? Oh, it's awesome. We have a couple more in the room shortly. Oh, yeah, but... No, no, no, that's... Oh, I'm sorry. Our application at the moment is targeting the Lib Rome, basically. Oh, yeah, yeah. Which is a good library, but why that over Callistae OS? Because our app has been targeting the Dreamcast for that last 12, 15 years. It's developed by Marx, I think, in North Marx. Oh, my gosh. Oh, okay. Yeah, I know. Yeah, yeah. Accesses. I was wondering, it's fairly easy to install your GCC chain because trying to patch up GCC to a bit of Lib Rome character is a pain. Oh, you should totally use our tool chain. Yeah, our tool chain should definitely work. And our scripts, there's so many people in the Dreamcast community that by now they're pretty battle tested. Like, people will want it for Mac, Ubuntu, every flavor of Linux, Windows with StigWin versus Windows with WSL. You know, ours is pretty solid at this point. You should definitely check it out, actually. I'll definitely be trying to pull it. It's pretty nice, yeah, yeah. But, oh, that's really cool, though. Very nice. Anyone else? Yeah. Which version of NGL are you supporting? All right, so the latest you can get on the Sega Dreamcast is 1.1 because we don't have any shaders. It's all fixed function. But I will say, it's one of the most epic end- end-late-stage kind of GPUs that's fixed function. We have a lot of the stuff that went into shaders in hardware. Like, we have hardware accelerated bump mapping. We have some things called modifier volumes, which are really cool that you can use for cheap shadows and stuff like that. So there's a lot of cool stuff you can play with, despite being an OpenGL 1.1. You guys ever heard of Raylib? Yeah, we actually just got a port of Raylib that sits on top of GL 1.1. So it's really cool being in the Raylib community right now. And like, someone makes a game for like PC and you're like, hey, sick out your game on my Dreamcast. It looks pretty good. And they're like, what's a Dreamcast? But yeah. Anyone else? Yeah. Well, as you know, I'm the SuperH Conal maintainer. I do know that. My hero, man. There's actually the SuperH backend and GCC is actually still in a questionable state. Oleg Endo is watching this. So yeah. Well, he hasn't been doing on it so much recently. Yeah, yeah, yeah. There used to be two people working on it. So if I'm seeing now there's so many people like working on SuperH, it would be nice if like kind of people came to the Debian community or like there's also a Linux SH RSE channel on my barrel. Because like doing this all alone, like what I'm doing in Debian is quite a burden. So there's like some people that would like to help like also improve GCC. Absolutely. So the Linux kernel almost dropped the SuperH architecture and he saved his life. So yeah, we owe this man a great debt. And yeah, I meant to reach out. Definitely. Anyone else? Oh, yeah. I wanted to ask, as I know the Dreamcast had sort of support for Windows C for Dreamcast. Yes. And was there any plan or something about that? Because I remember the vertical system run on the Windows C platform and if I'm not 100% mistaken, GCC might have a Windows C target. I'm not sure about that because when the Dreamcast was released, there were two SDKs. You could use the one that was Windows CE, which a lot of games used and it was very impressive. It supported a lot of the Windows kernel and there was one that was pure Sega. But the thing is we try to distance ourselves from that because those are official proprietary SDKs. They're not independently developed so you can't really sell your home brew with that stuff. So I don't know too much about that, to be honest with you. Yeah, sorry. Anyone else? Yeah. We have actually, there's a giant chart on that wiki page that I linked to. There's a giant chart going back to like GCC4, looking at, we are running one of our polygon benchmarks, looking at performance versus binary size versus a few other variables and it's kind of interesting how it's varied across versions of GCC. Definitely GCC 13 is not the best or the worst and it's not like a linear trend either. But yeah, you can definitely take a look at that and that's a very good question too. Yeah, please. If anyone wants to port anything else, we are very interested. Okay, thank you. Thank you.
The secret life of a goroutine
It's time for our first actual talk of the day, which is by a very frequent speaker who I didn't have to look up the introduction of, because every time I look at his talk, it's like, wow, I learned something very deep about Go. So, small applause. Okay, just... Hello, everybody. Well, I'm going to talk about the secret life of a Go routine. This comes from my interest about how Go works internally, and I was investigating how the Go routine works internally. So, when I started investigating it, my idea of how Go routines were created and all that stuff was something like this. A caring mother with a baby in her arms, taking care of that beautiful, full of joy baby. It wasn't like that, okay? I started digging into the code and I realized that it's more like this. And necromancer racing the deads. I was like, why? There's a reason for that. But before that, I'm going to talk about something more general, that is the Go scheduler. For understanding how the Go routine works, we need to understand how the scheduler works and how it is shaped. So, let's start with the different pieces of the Go scheduler. One of them is the P extract that is the representation of a virtual CPU. Whenever you say Go Max Prox, what you are saying is the number of pieces that the scheduler has. And a processor, as I said, is a virtual representation of the CPU. It can have a status that can be either running, c-scrolling, or g-stop. It has associated the current M. We are going to see what an M is in a moment. Then it has, each processor has a queue of Go routines that needs to be executed. And a list of free Go routines. We are going to see what free Go routines are later. And, of course, other metadata. This is a very shallow explanation of the scheduler. This is an over simplification. Of course, it's more complex than that. But, well, a lot of other metadata inside the PS track. Let's talk about the M. The M is the self-representation of an operating system thread. It's what is executing your code in the CPU. And it has associated normally the current Go routine that is running in this M, in this machine. And the current processor that is associated to this M, that can be null, actually. There are some cases where the M is not associated to a processor. But, in general, they are associated. And other metadata. Let's talk about, let me, let's talk about the scheduler itself. On top of all these M's and P's, there's a struct that is called a schedule. That is, it has all the, it has a list of all the, all the idle M's, all the M's that are not doing any work, all the idle P's, processors that have not, that are not doing any work. All the, at least of global runnable Go routines, a queue of work that is not associated to any specific processor for now. And a list of global free Go routines. Okay. And the start of our show, the Go routine. There's a struct that is called GStrug. That struct is, represents a Go routine. And a Go routine is composed by, in a lot of the stuff, but mainly you have a stack that is a two kilobytes chunk of memory. The program counter that is similar to the program counter in a thread that is pointing to the next, well, to the current instruction that is executing. The status of the Go routine that can be running, waiting, runnable. There's a lot of different statuses. The current M that is associated to this Go routine is being executed right now. And the wait reason. The wait reason is if the Go routine is waiting, they have to be waiting for something. They have to be a reason for waiting. And that's the way reason. There's a lot of other metadata. But let's take a look at the whole picture. As I said, we have the scheduler at the top left with a list of free Go routines, a list of runnable Go routines, a list of either processors, either machines. And we have running processors with running Go routines associated with machines and all that stuff. Also, another interesting thing is that at global level in the runtime, as global variables, we have a list of all the M's, a list of all the P's, and a list of all the Go routines. That really are three global variables in the runtime. Okay, but how Go routines are created? This is where the necromancer raising the dead's metaphor comes into place. Because whenever you create a Go routine with just Peggy's, you create a spawn a new Go routine and start running things on that. But that's not what is happening. There's two ways of creating a Go routine. One option is to create it from scratch and the other option is to reuse all Go routine that is no longer working. So this is what is happening. Whenever a Go routine finish, it's changed the state to dead. So all that free Go routines, actually they are dead Go routines. So whenever you need a new Go routine, you can reuse one of them. Or the other option, if there's no free Go routine or dead Go routine to reuse, you create a new Go routine full of life, you kill it, and then you raise that from the dead. So that's the process. And actually that is how it works in the source code. It was shocking for me and it was a funny way of representing this. So let's see an example of that. Imagine that I have this Go routine here that wants to create a new Go routine. What it's going to do is pick one of the free Go routines in the free list and raise that from the dead, convert that into a runnable, put that in the queue of the runnable Go routines of the processor, and call the scheduler and the scheduler is going to, well, and the scheduler is going to eventually execute that Go routine. Another option is this Go routine here wants to run a new Go routine, spawn a new Go routine, but there's nothing in the free list of the processor. So it's going to go to the global free list of the scheduler and pick a chunk of them, move them to the processor, and then pick one of them and raise that from the dead and add it to the queue. And finally you have the option of this one is it wants to create a new Go routine, but there's nothing in the global queue. So what it's going to do is create a new Go routine. It's going to kill it and then it's going to raise that from the dead and put in the queue and all that stuff. So that's how Go routines are created. Let's see how Go routines, how is the life of a Go routine. A Go routine can go through a lot of different states, can go to runable to running, from running to waiting, from waiting to runable, from running to preempted, from preempted to waiting. There's a lot of stuff. Let's see how, let's see all these transitions one by one. From runable to running. That happens when you for example have a Go routine have finished the job or a Go routine start waiting for something. So it's going to call the scheduler. So the scheduler is going to try to find another Go routine to execute. The first thing that is going to do is try to find a Go routine in the local processor, in the runable list of the local processor. If there's nothing, it's going to go to the global runable queue and it's going to take some of that, it's going to move that work into the processor, it's going to schedule one of that Go routines to be executed. Then if there's nothing in the global queue, it's going to go to the net pool. The net pool is this system that allows Go to do IO work in an efficient way. And what it does is do the IO work and whenever it's finished, it gets the Go routine runable again. But sometimes what we do is we need to find work to do. So we go to the net pool and check if something is already done and start executing that. If there's nothing in the net pool, we are going to steal work from other processors. And if not, we are going to help the garbage collector in the marked face. Well, once we have found a Go routine in all the process, we are going to mark that as running and we are going to assign the machine, the operating system thread to that Go routine. We are going to mark that as running and we are going to start executing the code. Another option is running, well, another change is running to waiting. One of the interesting part of this is it's exemplifies how Go routines are cooperative entities. So they cooperate to give you the sensation of concurrency. So the Go routine, when the Go routine needs to wait for something, is the own Go routine who parks itself. Whenever I have to write to a channel, for example, if the channel is not buffered and I have to wait for something, what I'm going to do as a Go routine is park myself, stop myself, check my state to waiting, set the wait reason, detach myself from the operating system thread and run the scheduler. It's the Go routine that is marking itself as waiting, the one that is calling the scheduler to schedule the new Go routine. So the scheduler is going to find another task and it's going to start running that. So what are the reasons why we can wait? If you go to the Go source code, and actually there's in the bottom right corner, I usually put some references to the Go source code, but well, if you go to that point in the Go source code, you are going to see the wait reasons and that's the least of all the wait reasons. There's no more, there's no less. That's all the wait reasons. Don't pay too much attention to that. I'm going to summarize that. If you want to take a look, you can go. But the summary is you have GC reasons, garbage collector reasons, mutex reasons, semaphore reasons, channel reasons, sleep reasons, and other reasons. That's mainly why the garbage, why the Go routines waits for something. Okay, from running to Cisco and to running or runable again. Well, the Cisco is an interesting part. The Cisco is basically calling the operating system to do something and that can be fast or can be slow. And for some Cisco, it's kind of obvious, but for some Cisco, it's not so obvious. So what it does is whenever you enter in a Cisco, whenever you try to execute a Cisco, it's going to detach from the processor and it's going to detect if the Cisco is slow or fast. And if it's a fast Cisco, it's going to finish the Cisco and go back directly to running. But if the Cisco is slow, it's going to just stay in Cisco state and it's going to detach the processor. Well, it's going to keep the processor detached so the processor can select another Go routine to execute and it's going to finish the Cisco eventually and whenever it finish, it's going to move the Go routine to runable again and then queue that in a processor and all that stuff. The other thing that is interesting is the copy stack status. Whenever a Go routine needs to grow the stack because it needs more space for the function parameters or for the local variables of the function execution, it's passed through this process that it's going to move from running to copy stack. It's going to reserve the double of the current stack size in memory, copy over all the information from one place to another and change the pointers and then it's going to move back from copy stack to running again. From waiting to runable, this is a very interesting case because, again, as I said, Go routines are cooperative. So normally, a Go routine, it's changed from waiting to runable whenever other Go routine calls go ready. Whenever other Go routines say to my Go routine that it's ready to keep executing, we are going to see examples of that later. So whenever Go ready is called, for example, if a Go routine is sending something to a channel and some other Go routine is waiting, it's going to wake up that Go routine, it's going to mark us ready that Go routine. Then it's going to mark us ready, it's going to add that to the queue of the processor and try to get a processor to execute that. Another way is when you reactivate a list of Go routines that happens, for example, when the garbage collector have to reactivate some of the Go routines and then the garbage collector are waiting for the garbage collector phase, for the mark phase, and when that's finished, it's going to wake up a list of Go routines. Another case, it's when there's a case where it doesn't need to wait. Imagine that you say, hey, I'm going to wait for X, but that X is already fulfilled, so I'm going to go back to runable directly. Another thing is when you are trying to find a Go routine to execute the scheduler, you check the scheduler, sorry, you check the net pool, and the net pool sometimes has these Go routines that in theory they are waiting, but the data is already there or the job is already done. So it just moved that app from waiting to runable. Okay, from running to preempt to waiting or runable. Go has a preemptive garbage collector, has a preemptive runtime, and what it does is when a Go routine is executing for too much time, the system monitor is going to detect that and it's going to send a signal to the operating system thread that is executing the Go routine. That signal is going to mark the Go routine as preempt, so it's going to be moved from running to preempt, and eventually the Go routine itself is going to find the time for moving from preempt to waiting. And after the next garbage collector scan, it's going to move from waiting to runable again. So again, this is the whole life cycle, runable, running, syscall, waiting, preempt, govistak. Now all these states should be more obvious or more clear to everybody. There are some other kind of similar states of parallel states related to garbage collector. This is again a bit of a simplification, but this is in general what is the kind of state that you have in the Go routines. So let's see some examples. Imagine that you have a channel and you want to send data to that channel. The channel is not buffered, and there's nobody else waiting for that. So I try to send the data and because nobody's waiting, I'm going to need to wait for that. So I'm going to park myself, the Go routine is going to park itself, it's going to add itself to a list of Go routines that is inside the extract of the channel, and it's going to wait there. So it's there, it's waiting, and eventually another Go routine comes to read from the channel. What it's going to do is go there, read the data directly from the memory of the other Go routine, and then when it has the data, it's going to call Go ready on that Go routine saying this Go routine is already prepared to keep going. It's going to, and that's going to end in this state, and eventually the scheduler is going to select that Go routine to be run and everything is going to keep going. Yeah, this is the whole picture, trying to send the data, waiting inside the channel, getting the data from the other side, and the other Go routine is the one that is responsible of waking up the Go routine that was waiting in the channel. Let's see another example. Let's talk about the wake groups. For example, I can create a wake group and add three in this case. This is a very common pattern. And then I just found three Go routines that are going to do certain work in parallel. Then I'm going to wait at that point, maybe one Go routine is already running, maybe not, doesn't matter. So I call wait, so I'm now waiting. The Go routines keep going, maybe some of them are executed, maybe some of them have finished already, doesn't matter. Some of them finish and are there. And the last one, the last one is going to call done, the last done, and it's going to see that, hey, the wake group is already zero, so I'm going to call ready on the list of Go routines that are waiting for this wake group. So that end up with this situation where that's a runnable Go routine that is going to eventually be executed by the, well, that is going to be a schedule by the scheduler, and that's it. Again, the whole picture here. Okay, let's talk about how Go routines die. There's a Go routine normally dies when it finished the work. Basically, whenever there's nothing else to execute, it's going to change the state to that, it's going to set most of the data to the zero value, it's going to disconnect the Go routine from the end, add the Go routine to the free list of the processor, the dead Go routine to the free list of the processor, and call the scheduler to find anything else to execute. So, yeah, the whole life of the Go routine. Again, if you see this is the scenario where the Go routines are doing things. If I did my job correctly, you now should understand this better. And also this should sound familiar to. So let me finish with a couple things. One of them is I want to thanks Laura Pareja, the one that did all the illustrations for this talk. All the illustrations are creative common by. And you can see the webpage of Laura Pareja. So you can reuse it that do whatever you want with all that images. Also, I want to, I have a gift from MatterMos that is my company, they're the company that I work for. I have some stickers. I going to left out the stickers there, like Margie said. So that's exactly right there. So feel free to pick as many as you want. But I don't know if, well, I also have some pins too, but they are going to fly probably. Another thing is what is missing. I haven't talked about certain things because in the sake of simplicity, I try to avoid getting too much into the details. One of the things that I removed from the equation and have a lot to do with Go routines is garbage collector. I ignore the garbage collector entirely and it's a big chunk of how the scheduler interacts and how the Go routines are moving from one stage to another and all that stuff. The net pool, I mentioned the net pool, but I haven't entered into the details. There's very good talks about the garbage collector and the net pool out there. I know SIGO. Also, SIGO have certain implications with the Go routines also, but I have ignored them. The mark assist phase that is kind of important is a relevant part of things that Go routine does, assisting the garbage collector in the mark phase. This is the monitor that I have mentioned, but I haven't talked in detail about that. But again, there's talks around system monitor out there. One of the main references is the Go source code. I totally recommend you to go there and explore it. There's an illustrated text of Go runtime scheduler that is a YouTube video there. There's a series of posts from Argonel Labs about the Go scheduler. It's from 2018, so it's not super up today, but the general patterns are still there. Well, I hope this talk, after this talk, you have a better understanding of how the Go routines work, how the Go routines change from one state to another and all that stuff. But I want, what is more important to me, I want to encourage you to go there and explore the Go source code because it's a great source of information. There's a lot of super cool stuff there. And well, and depending on a combination of your passion about learning and your taste in movies, this can be more exciting than a zombie movie. So thank you. If you want to keep in touch with me, feel free to contact me. And the other thing, if you want to have a follow up session, then try this. If you want to have a follow up session, asking questions or whatever, feel free to join there. If you're leaving. Thank you.
You're already running my code in production: My simple journey to becoming a Go contributor.
And I will now like to introduce our next speaker to you. I would say he needs no introduction because you're already running his code. But he might need an introduction. This is a new... Sorry, could I have some silence in the room, please? Thank you. You're already running his code and he's telling a story of which I am, for some reason, after running the Go Dev Room for five years. Still I'm curious about, because I haven't contributed the Go project yet. And he has. I'm jealous of him. So round of applause for a Go contributor. Thank you. Can you hear me okay? Is the microphone on a good spot? Yep. So quick show of hands. Who here is a Go contributor? Is contributed to the standard library, the compiler. I see one, two, three, four, shows hands, five. Who here would like to be, like Marcia, who would like to be a Go contributor? There's a lot more hands. Who of you who wants to be is afraid to become a Go contributor? Who thinks it's intimidating or complicated or you just don't know enough about Go routine scheduling or something like that? Okay. This talk is for you folks who have your hands up right now. So my goals for the talk... Oh, first of my agenda. I'm going to talk about goals, who I am, and I'm going to tell my story of how I became a Go contributor and talk a little bit about how you can too. So that's my goal. My goals today, tell my story. And ultimately to encourage you to be less intimidated about becoming a Go contributor. My non-goals are to be exhaustive. I'm not going to do a deep dive into how the proposals work or how Garrett works or all the technical stuff. And I'm not going to show you a lot of code. There's a little bit of code, but you don't even have to be a Go developer to understand the code I'm going to show you. Who am I? I'm a Go contributor, technically. I'm a fractional Gofer. Fractional CTOs are all the rage these days. I'm not that. I'm a fractional Gofer. I work for different clients. You can hire me if you want some help with your Go. I also do Go mentoring and career mentoring, hire me. I'm also the co-organizer of the Go Amsterdam meetup. And I'm a podcast host and YouTuber. I hit that word, but I put videos on YouTube, so I am one. So some of you may know me through the Cup of Go podcasting. Listeners here in the room today? All right. A couple. I hope there's a lot more after this. I have stickers, by the way. They'll be over there. If you like Brewster, our little Gofer mascot for the Cup of Go podcast, get a sticker for your laptop a little bit later. So how did I become a contributor? Well, first I needed an idea. So long ago, I wrote this public open source library called Kivik. It's for CouchDB. It's sort of like database SQL, but for CouchDB. So if you wanted to be document store stuff. And I had a request from a user of my library. They were trying to send a regular expression as JSON to CouchDB because it's a JSON store. And it was just submitting an empty object rather than meaningful data. So they said, hey, could you make your library do this thing the right way and send a regular expression string? It's like, that's a really great request, but I don't feel like it's my library's responsible to do that. That should go in the standard library. So I created a request, which we'll talk about. But first, here's the problem they were explaining. So here's the code. I think this is the only slide in the presentation of code. So imagine you have this regular expression, food question mark. So it would match fo or foo, pretty simple. And you call JSON Marshall on something that contains that. This is the output you would get. Not very useful. This is the output the user of my library wanted and what I thought made sense. So I created a proposal on the Go issue tracker on GitHub. Now this is a great point to mention that there is a process, a proposal process. Some of you are probably familiar. If you listen to the Go podcast I just mentioned a couple of Go, we talk about proposals fairly frequently and we talk about, oh, this one's in the accept phase or this one's been declined or this one is possibly accepted and so on. That's all relates to this. Now this is a very simple proposal, so it didn't need the design doc, which some do, like generics had a design doc, actually multiples of design docs in the end. So this is a very simple proposal. I mean, I just explained it to you. I don't need a design doc to explain what I just explained on the last slide. So this didn't need that. So I just created a little, you can see there, that's the entire issue there, right? That's what I wanted. I showed the code that I just showed you. I showed the current behavior, the expected behavior and a little bit of conversation about my reasoning. And so that happened in 2021, May 13, if I can read that correctly. And then that kicked off this proposal process or a truncated miniature version of it anyway. So we had some discussion. One of the first comments came from Daniel Marti, who said, this would also be useful for this other thing and tagged Joe Sy, who was working on another issue that it would be relevant to. I don't know who this person's name, I didn't look it up, but they said, losing the options feels like a deal breaker. What that was referring to, there's actually two flags you can put on a regular expression in the Go library. You can say it's a POSIX regular expression and you can say it's, is it longest match? So at the end of two Boolean flags you can set on a regular expression and those are not expressed when you call the dot string method on the regular expression. So those flags would be lost. And so this person said that feels like a deal breaker. And there were some other comments too, but ultimately Russ Cox came in and said on June 9, so this is almost two months later, said it looks like this is probably going to be declined based on the fact that it would be a lossy expression of the regular expression. That was sad. Not really sad because this isn't a feature I wanted, I just was kind of excited to see a feature I proposed, you know, get through the process. And then Roger Pepe, I think is his name, came in and said, I think it would be fine if we went ahead and did this. You know, just use the equivalent of string, it's already lossy, why don't we just go with that and so on, gave his reasoning. And so this is just a month later now, we're into July 2021, Russ says, so this is the current idea, we're going to have Marshall and un-Marshall do exactly the same thing that string does, blah, blah, blah, and then it looks like it's going to be likely accept now. So, cool. Happy about that. Fingers crossed, let's see if it really becomes accepted. A week later, no change in consensus, so it became accepted, yay. So who's going to do the work? Sadly, just having your proposal accepted and go doesn't mean it's done, someone has to actually do the work. Now this isn't a lot of work, in fact Russ said, even before it was accepted, I'll do the implementation and see if I come up with anything surprising. I don't know if he ever did, if he did he never mentioned it on the issue tracker. If I ever had the chance to interview him, I'm going to ask him, did you ever do that thing? So I said, January, this is six months after it was accepted, I said I'm interested in working on this and nobody really responded except somebody gave me a heart and I thought I felt good, but. And then three months later, four months later, Joe Sy says, hey are you going to do this, Russ? I can actually use it now. And Cricket's from Russ, he's a busy guy, no shame on him, but you know, so more weight eating ensues. So I decided I was going to go ahead and do it and I decided to, I don't remember exactly when, we'll see the dates in a few moments, but so I decided to go ahead and do the code. Now this is a good time to talk about the contribution guide. This is probably the part, at least I felt, was the scariest part of contributing to go, so I'm not going to talk in detail about it, but the TLDR is you have to create a Google account, you probably already have one unless you're intentional about not having one for security or ethical reasons or whatever. If you want to contribute to go, you have to have one, I'm sorry to say, so if you're avoiding that bandwagon for ethical reasons, maybe go, contribution isn't for you, I understand your reasons, but you have to have a Google account, you have to set it up a Garrett account with a Google account. What's Garrett? Who's used Garrett, I'm curious? Who doesn't even know what the word means? All right. So think of like GitHub except an open source version of GitHub from 1992, that's what it looks like, but it's really powerful in ways that I can't really comprehend or explain because I haven't used it that much, but it's not bad, so don't be afraid of it, but they use Garrett for that. Now actually I lied a little bit, they do use Garrett for that, but you can do this through GitHub also, and I've not done that process, but if you're really afraid of Garrett and you can't read the documentation and follow the instructions, you can also use it, create a GitHub pull request, so that's an option open to you if you're really afraid of this, but don't be, it's not that bad. So 11 months later I finally wrote the code, I created my Google account and all that stuff and the Garrett account and I wrote the code, this is my change, this is what I added to the standard library, plus some tests and a couple other metadata things. It's like 20 lines of code if you count the comments in the blank space, the blank lines, that's not a big deal. I was really hurt though that Marcia didn't mention this in the Go 121 changes because I know it just barely threw under your radar. I actually got this yesterday evening, you're going to find it. Yes, yes, okay. And you knew I was going to talk about it, so why mention it twice? So really simple, I guess I lied, there's two slides of code, but it calls the string method and turns it into a byte slice, that's all it does to Marshall, to Marshall, your regular question, and then to un-Marshall it, it does the same thing in reverse with an extra error check, super simple code. So I pushed that up and then I, this is a screenshot of Garrett by the way, like I said, 1992 GitHub, that's what it looks like. And I got some code review. And then it was time for some humility. I kind of pride myself in writing tests and writing good tests, I usually write them before my code, first comment, make sure the test pass. I failed to, I mean I tested my code but I didn't run the entire test suite, which takes 10 minutes or something on my machine, and it was failing. The reason it was failing is because I failed to add some metadata about public API changes, it wasn't a big deal, it was easy to fix, but it made me feel a little bit silly for like, not writing, not running the test suite before I asked other people to waste their time reading my code. I had learned the project style, this was my original commit message, I don't see anything particularly wrong with it, but it wasn't the style that they wanted, they wanted something much shorter, they didn't want me to, they didn't want a long paragraph explaining, like they felt like, I say they, Ian felt like add these functions was enough, I didn't need a paragraph explanation, so I followed his style guide and ended up something shorter. The tests, he wanted some changes in the test, I called t.fatal, but it was a for loop, so if one test failed, the other test wouldn't run, so he wanted me to do t.error instead. Cool, makes sense. And then Godoc recently, I don't know how recently, recently in my mind because I used it before this, but they recently added these square brackets to do hyperlinks and stuff, and I didn't do that, so I needed to add that. Yeah, little nitpicky things, plus I forgot to run the test. That was kind of it. That was my thing. It got merged on March 27, so just over two years after the original, was that right? Just under two years after the original issue was opened, it got merged, and then it was in the Go 121, yay! My name's not there. It's in, it's in Git somewhere, but whatever. It still felt good. So I think I just breezed through that. I have a lot of time here. We have a time for questions here. I mean, I have a few more slides, but this is the point of my talk, really. What does it take to become a Go contributor, and what does it not take? So non-requirements are you don't need mad hacker skills. I mean, you saw the simplicity of that code I wrote. Now I've written much more complicated code, at least I like to think so, but not at the Go project. I've spoken to people who contribute to Go just by adding those square braces to Go doc. That's cool. That helps. I mean, that's valuable, right? It's not cheating. That gives me hyperlinks when I go to the Go doc for that package. I can click on a hyperlink now. That's useful. So if that's what you want to do to contribute to Go, that's all you need to do. All you need to know is how to type square brackets. You don't need to know about zombie Go routines and whatnot. You don't need deep Go knowledge. What do you need to be a Go contributor? I think the main thing I learned from this process is that for me to be a Go contributor, I need patience. I mean, a lot of that wall clock time was me not doing anything. If I had been trying and pushing the process forward, I probably could have truncated that down to maybe three or four months. But that's a long time to get 20 lines of code implemented, I think. I mean, relative to what I do at my day job anyway, where I do that 15 times a day or something. So it takes patience. But if you're willing to put in the time, you can become a Go contributor. It takes a little humility, especially when it comes to learning a new project style. I mean, I don't know if you've contributed to other open source projects before. I have. Each one has their own flavor, their own style. You need to learn that. You need to be willing to learn that and not, yeah, just put your ego on the side. That's not the point. It's just to do something useful according to the community's guidelines. And to learn some new things. Yeah, I think I'll breeze through this. Those of you who raised your hand that you were intimidated earlier, any of you feel less intimidated now? One, two, three. Okay, my talk was a success. That was my goal. If you're interested in learning other ways, one of my goals is to make Go less scary for people. That's part of the Cup of Go podcast idea where we talk about the weekly Go news. It's part of my YouTube channel, Boldly Go, if you want to watch that. If you have questions, reach out. You can find me at boldlygo.tech. That's my Go themed website. You can find all my socials and contact details there. Any questions? I don't know. Do we have, we can do questions, right? We have enough time for questions. We have time, so yeah. I will hand you the microphone. If you're too far away, you'll have to shout and he has to repeat. Hi, thanks for your talk. I want to do a Cup of Go listener. Wonderful, thanks. Shout out to the podcast. My question is, are there other ways to become a Go contributor like, you know, good first issues or stuff and get up? Other ways, other than introducing a proposal? Yes, definitely. You can find one of the existing bug fixes or proposals. So this was the first code I wrote that was implemented to Go. I had participated in the sense of filing bug reports and stuff like that previously that others then fixed. And many that had been just like closed as invalid or something that happens too. There's that humility part that comes in. But yes, there are a lot of open issues. There are some tagged as good first issues. You can find typo fixes, typo, I actually have an open CL. It's the Garrett terminology for a PR. Open for a documentation fix in a package in the center library. Things like that. There's a lot of things you can do. You don't need to file either a bug report or a feature request. You can find one that's already there. Hello, thank you for your talk. Yeah. I've tried several times during Octoberfest to do some contribution. And the big part of it was to find an easy issue to begin with. Do you have some tips for that? Not really. I mean, there is a, I believe there's a tag on GitHub on the issue tracker for like good first issue or needs help. I know there's a needs help. You could look at that. I think there's a good first issue, but I might be confused about the different project. One thing that is understandable but frustrating to me about the Go project is it's not really designed for newcomers. That's one thing I hope to help change with this. Help at least lower the mental barrier that you might have individually to doing this. But I say it's understandable because they're trying to build a professional quality, high quality language and standard library. And that requires one set of skills and guardrails around the project. Being open to all new contributors is a different one and requires very different types of open source management. So Go, I think, mostly intentionally has moved to that side of high barrier to entry for reasonably good reasons. But that is frustrating for this question. How do you find something you can do to contribute? I don't really have a great answer except look through the issue tracker and find something. In front. Become a Fotherm organizer, get fitness for free. Yeah, hello. So you had this requirement at the beginning and this sparked the problem and the solution in the library. But what did you do in the meanwhile? Because this took three years, right? So what did I do about this in the two years in this thing between issue, file, and I didn't do anything, honestly. The person using the library, I'm assuming they had their own work around. I mean, so there are work arounds for this sort of thing. Suppose that you want to, suppose this already exists. Now you're using Go 122, but you want a different version of the regular expression to be presented. You have the same problem, right? So you probably would end up wrapping the regular expression, reg x dot reg x type and put your own custom marshal on it, for example. That's probably what they were doing. I do that with time dot time or time dot duration fairly frequently depending on the application needs. So that's probably what I would do. Are there any differences in the main Go code versus like the Go X modules? Yeah, that's a good question. I haven't contributed to the X stuff, so I don't have experience to go on from there. I think it's pretty much the same process though. I do think the requirements for inclusion in the X packages are lower. So if you want to add, say something to X slices, you want to add, I don't know, change color or something, you know, some ridiculous thing there. There's lower barrier to entry to get in there because it's considered experimental. So you're like, if you want to do it in the center library, they have a high standard. Like we want to make sure that we're never going to regret doing this. In the experimental they're like, yeah, we don't know if it's a good idea, but let's try it. So in that sense it's easier, lower barrier to entry. Any last questions? Okay. I think this can mean one thing, but it was an amazing talk with not too many questions left. Round of applause everyone.
Efficient Integration Testing in Go: A Case Study on Dapr
Actually, an ex-collworker of mine, we worked together on CertManager, if I recall correctly. We wrote a lot of tests there, not enough tests in my opinion, but there is never enough tests in the world. And I have to be honest, when I code and I'm not being paid for it, I do not write tests. So Josh does, and that's why he's going to talk to us about how to make your testing life way, way better. Right, that's possible Josh? Thank you very much. Cheers, Marsha. Good. So hi, Ron. Yeah, hopefully I can change Marsha's opinion on that during this talk. So I'm Josh. I work on the project DAPA, which is an open source project. I'm going to talk about that in a second. And the talk is about efficient integration testing in Go. So it's a case study on DAPA. I work on DAPA, I'm coming from a DAPA perspective, but the idea here is the kind of learnings that we have did through DAPA, you can kind of bring to your own project and make your project better, more efficient and correct and these kinds of things. So this is the agenda. Like I say, we'll talk about testing, we'll talk about DAPA a bit, the framework that I wrote for the integration testing in DAPA, and then some learnings and some gotchas and some things you can pick up for your own project. Cool. So testing. Why do we test software? Fundamentally, why do we test software? So the first thing is to prove the correctness of software. That's the main point, right? We write software, software is complex. Code is hardly readable by humans and we make mistakes and the more software you write, the harder it gets to keep track of the state and yeah, we all write bugs. But it's not necessarily the case that this is the only reason why we write tests. If it was the only reason why we write tests, we would write our test once and then once they start passing, we would delete the test file. So writing tests just for the correctness is not the only reason. Another reason is for putting guardrails in place. Implementation code changes over time and so assertions you want to make about your code behaving in a certain way, you want to kind of keep into the future. So yeah, that's why we don't want to delete our test files after we've written them. The next thing is ensuring compatibility with external APIs. So if you have external services, I'm thinking I come from like a Kubernetes world and things like this. So Kubernetes version changes, they break stuff all the time. You want to make sure that your code still behaves in the expected way when external things change. Verifying performance, performance testing, these kinds of things, making sure that not only your code is correct but it also does things in a timely manner or uses less resources than is your limit or things like this. And finally, and what we'll follow in this talk is hopefully that if you write a testing framework which is usable by humans and is efficient and is easy to read and use, then that testing framework itself can then be used as your kind of sandbox on how you can test or do experiments in your software and test features and things like this. So a really good testing framework is really important to improve your developer experience and the final thing is increasing developer velocity which is largely a big thing that we care about, right? We want to write features. So test types, if you open a textbook on testing, you'll probably see this graph somewhere. It's a very kind of classic visualization of the different types of testing. At the bottom you have a unit test, that's your test bar, that's your logic code, and it tests a variable equals another variable, really exciting stuff. And then at the very top you have things like your performance testing, your testings and things like this. And then the middle section you have your end-to-end and integration testing. The difference between these two things is semantic and depends what project you're talking about and who you're asking and things like this. Again, I'm coming from a dapper perspective. End-to-end tests for us are deploying to Kubernetes and running it in a Kubernetes environment and invoking it there. Integration testing is running binaries locally, typically, and that's where the differential takes place. Integration testing ideally runs quicker than your end-to-end testing. Kubernetes is a slow software so it's a pain in the ass to write loads of tests for an end-to-end test. So yeah, the talks about integration testing, what are integration tests? Fundamentally, this is what an integration test is, and this is true for a lot of testing as well. But fundamentally, you're setting up your system to be in a particular state that you care about. You're then asserting a particular behavior and then you are then cleaning up that system state. That is it. That is fundamentally what you're doing. As an example, again, going back to dapper, this might be executing one of the dapper services, then doing a curl, in this case, to make sure that the healthy endpoint returns a 200 or something like this, and then finally killing that process at the end. That's it. That's what an integration test is. Keep talking about dapper. That's interesting. That's not dapper. Okay. Try that again. What is dapper? Not that. Dapper is an open source project, all written in go. The tagline, the marketing headline, is that it is a set of APIs and SDKs and frameworks to make a developer more productive in a cloud-native environment. What that means fundamentally is that the project will expose a bunch of APIs for you that you typically need to write some business logic that does something interesting. They have a list of APIs here, so it gives you some state management, PubSub, Actors, and then you can back those APIs by whatever implementation that you want. It might have different concerns, so the infrateam might manage your postgres, and then to you as a developer, you're just exposed with the state support API. That's fundamentally what dapper is. What is important for this talk is that dapper is a complex software system. We have multiple services running, and they're all doing different things. We're all talking to each other. Maybe sometimes they're MTLS, sometimes it's not. Sometimes GRPC, sometimes HTTP. We have a whole set of APIs. We have a bunch of backing services that we support, whether it be postgres or some Google stuff, whatever it might be. The point here is that this is a very complex software system, which all software turns into over a longer period of time. When your software system becomes this complicated spaghetti mess, it becomes a house of cards. It will happen, and if anyone who's worked on a larger project will have first-hand experience, you make a small change, and that will have unexpected consequences or behaviors in a completely seemingly unrelated part of the system. You'll have software turns into house of cards, you don't want to make changes, and again you slow your developer velocity that we were talking about. How do we resolve this? Tests. We use integration testing. When I joined the project, there wasn't any integration tests, so it was kind of a blank slate. I could start from the very beginning of how I wanted our integration tests to look. I came with these set of design decisions. First of all, I wanted to go as the sole dependency on these integration tests. I hate make files. I think make is terrible, and I don't want that anywhere near having to invoke tests. The next thing that I wanted to do was to run a test. I wanted to do something like a test, and it would be worse, something like needing Python or God forbid having to run Docker or something like this. It just run my tests. We want them to be as close to what developers are doing in their day-to-day, because remember it's a community project, we have lots of contributors. Having go as a sole dependency was really important. They need to be quick. Time.sleepers.band, we'll talk about that later. Tests need to be portable. We basically get that for free with go, because go is very good in that it can be compiled to different architectures and operating systems and things like this, and it's designed from a portability perspective from the start, so we get that for free. It needs to be extensible. We have lots of contributors. People need to be able to write code for the integration tests as they contribute to the project, and it needs to be readable. Similar reasons. That was the design philosophy, the design decisions I came into the project with, or into the integration test with. Next was actually writing the framework itself. If we go back to our original diagram of fundamentally this is what an integration test is, the first thing we can do is turn this into go stuff. We create what I call the process, which is the thing that is managing the setup and also the cleanup, and then we have the test case, which is doing the assertions that we want on that particular test scenario. We can then put in some kind of wrapper stuff, so this is actually executable, and there's like an entry point into this kind of test case. And then we're in go, so it probably makes sense to make these interfaces. So this is what a test case is fundamentally. If you can do a setup and you can run, it will be able to be executable in the integration test suite. This is what an integration test looks like in DAPA. It's a single self-contained file, we do some registration on the test, and we'll talk about that in a second, and then we do a setup and then we do a run. You can see here in my setup that I'm creating a process, which is going to do the setup and the cleanup, and then the run bit is where I'm going to do the actual assertions. Talking about the process part, the bit that's responsible for the kind of dependency creation and cleanup. Again, similar story, it's an interface, it does a run, and it does a cleanup. Really simple, and that's the point, it needs to be simple. We'll talk about a bit in a second on why this is a great thing. This is what a process would look like. This is kind of like a no-op kind of example, not super important to read the whole thing. The whole idea is it's, again, a self-contained package. We have the new, which creates the thing with a bunch of options, using functional option style here, which isn't necessarily people's favorite. It made sense in this particular case. The kind of struct versus the kind of functional style is a bit of a hot topic. Yeah, it has a run and then it has a cleanup further down. I know very abstract, but it's clear, it's obviously very important to get your interfaces correct because you're going to live with these forever. Cool. We have a framework run. The thing that I wanted to point out here is we do a process run here, and then you can see that we're using the go test cleanup function, which is amazing because it puts things on a stack. When you create your dependencies, whether these be binaries or whatever else that we're using in our processes, it will clean them up in reverse order. You have that stack, which is the natural order for things to be executed and then cleaned up in. Cool. We have all our test cases defined. They're running various processes. Again, there might be executing binaries, writing to files, things like this. We do our assertions and then we do our cleanups. These will get put into test cases and then we have some kind of sweet runner that executes these tests. That's what it looks like. It's a for loop over a set of tests and it executes them. Simple stuff. The next thing is how does the integration sweet runner know about these tests? What we need is a case registry, which is just a very fancy way of saying that we have a global variable that has a slice of test cases. What is important here that I wanted to point out was that it was a design decision that our test cases, and I mentioned it before, that they should be self-isolated in single files. I think as a developer, when you're reading test cases and things like this and you're having to go backwards and forwards into various places to even follow what the test is doing, is not good practice and it's confusing. Again, you can run into these problems. In order to eliminate that, we went for the style of having an init function, which does the registration to that global variable, and then using the bare import and style to import our init functions up into the top-level registry. Next thing is naming, which is always hard. I think there's a thing where developers generally don't necessarily respect testing code as much as they should. They care a lot about their implementation code and make it look pretty and performant and things like this, but they don't necessarily respect their testing code as much. This leads on to the kind of mess that people don't want to add to it because it's difficult to read. Having respect to your test code is really important. Similarly, naming is generally really important. Go has good standard on how you should name things, i.e. meaning should be derived through context. If you have a HTTP package, don't call your thing HTTP server, call it server. It should be hierarchical. Similarly, derived meaning through context, package path, describe your thing. Less is more. Go is not an IDE language. It's a good language. You don't need to have really long names. Just be very specific. No under scores, things like this. The benefit of then treating our test cases to be this package hierarchy with very meaningful being purposeful names is that we can do some reflect magic that gets us a lot of benefits. So when I showed before that we're doing this kind of sweet test case registration, when we are registering a test or when we're pulling out all the tests, you don't need to read the code. But basically what we're doing is using reflect to name the test its package path plus that struct name. So before our thing was called base, so it pulls out the package path of where that base test file is plus the struct name itself. So in this particular case, this test would be test underscore integration, DAPID foo base. Why is this a cool thing to do? Because that means we can start doing reject searches over our tests. So you can imagine for example if I'm writing a feature for DAPID or trying to fix a bug, if I'm working on maybe the active subsystem or something like this or placement, I can in another time and I'll have my integration test running and I can just do a search, a reject search on all the tests that are in the project for related things. So yeah, being very specific about your naming means that you can search through them and run all the relevant tests. Again being quick, developer focus, good UX. Yeah, that's how you do rejects in Go for loop and then you filter out all the test names that don't match the rejects. Here's another example, I'm working on century related things or MTS related things, I want to run all the century tests, I can just give it a query. The next is processes. So these are the two bits down here, the kind of dependency setup and the cleanup. We've been talking a lot about the different services in DAPID, so these are obviously using the exec, we're exacting processes on the computer, using the exec package. What we've decided to do is follow the kind of UNIX philosophy of running these processes as in do one thing and do one thing really well. So the exec process does really good at exacting a binary on the computer. You can then wrap that process in another more meaningful, again being intentional about naming which has a bit more context about how that binary should be run. So for example, this century process has all the context of knows what the CLI flags and things like this gives it same defaults, exposes the options in a human readable way in order to run that binary. And then as I mentioned before, DAPID has lots of different services, it's a complex software system but following this UNIX philosophy you can do this wrapping in your processes to make more meaningful, higher level naming and interfaces for your developer. So I can talk about a Kubernetes process and it's very easy as a developer in my test suite to say run Kubernetes, whatever that might mean, under the hood that's actually like a mocked Kubernetes API server which is actually a HTTP server, yada yada yada. So yeah, having this kind of wrapped process is kind of an elegant way to handle that. Here's an example of another one, so there's an operator service, we're doing some log line stuff in here, some DAPID stuff, but these are very high order concepts of dependencies that we're creating and these are all wrapped going down. Process binaries, so I mentioned before that we want to go as the sole dependency and go is a good language and it's got a very good build caching system and what that means is that in our testing integration testing itself is we're building the binaries in the test, so one of the first things it's going to do is it's going to build all the binaries that are in the project, that's the code that's doing that. It's then going to write them to a deterministic static file location and what that means is that every time I invoke the test it's going to run that go build, but because of go builds cache magic it's not going to take any time at all, so I can completely retry my go test and it will just be quick. The other nice thing about this is that if I change my implementation code and just write go test in my integration test, it's going to pull all the changes that I've just made to the code right because it is building from source every time. So that's a neat thing with go piping. So software writes things to logs and these can typically be very noisy if you're running lots and lots and lots of tests and this is going to take up a lot of disk space potentially, it's going to write a lot of things to the screen and it makes it impossible to read the test output. If you've got oodles, like a gigabyte of test logs and you're trying to find one test failure and read the logs from what happened, it becomes impossible. So write these things to in-memory buffers and then you can do things like only write the in-memory log buffer to the screen if the test actually fails, which is the only time where you actually care about what the log line is. Then obviously you can do things like because it's in memory, you've got a reference to it, you've got a pointer to it, you can then do some assertions on what was in the log lines and test log lines that way. It's quite good for this, you can create pipes and things like this. All very idiomatic kind of go stuff that you're familiar with. Asserting eventually, so all software is eventually consistent fundamentally like computers that are any as quick as the speed of light that is as fast as they can go, they're not as fast as that. But fundamentally computers to do a thing will take some time. And so we have to wait a period of time to observe some behavior when we put it into a particular state. Just fundamentally we have to do that. However you should never use time.sleep to do this, which I think is very, it's always there and it's very easy to just be like, time.sleep three seconds or something like this, but you should never do it. Time.sleep is the nuclear option. So to kind of illustrate this, if a single test sleeps for five seconds and DAPA CI for example runs four times a day, not counting PRs or anything like this, just standardly runs every four times a day, this equates to two hours of idle CPU time a year. If we then do it more than this, so like DAPA currently has 133 integration tests, if just 10% of those tests sleep for five seconds, then that equates to more than an entire day in a year of idle CPU. Which is crazy, right? This is bad for the polar bears, bad for the environment, it's bad for our developers too, which, yeah. If your test takes ages to run, no one will want to run them and no one wants to add to them. So being very intentional about the speed of your tests is very important. The way to do this would be to do polling basically, so in Go there's the kind of testifier package that is really, really good and highly recommend using it and it has this eventually function. All of the functions in this package are like super sane and highly recommend used to use them. And yeah, computers are faster than you think they are. Stuff does not take as much as you think it does, so like HTTP calls over local hosts take like milliseconds. It doesn't confuse as fast as you think they are. So even I've got here an appalling of like every 100 milliseconds, maybe that is even too slow itself. So yeah, computers are faster than you think they are. Be more aggressive with your kind of assertions and your polling. Clean up. Tests should never leak. Having data leaking from one test case to another will invalidate your assertions just fundamentally. So it's very important that you clean up state in between test case runs. And yeah, and it's also the case that if you're not cleaning up the state in your project in between case runs, then you're going to reduce the resource utilization that each test case can do and it's going to slow down your tests. So I'm thinking, you know, if you've got database tests or something like this, you're writing a bunch of stuff to disk. What if you fill up the disk? You're not running any more tests, right? So clean up is important. To list through some of the things that could be interesting for you to use, use temporary directories, using the test package. That's really good. T.cleanup, we just spoke about that earlier. That's doing the kind of stack thing, so it does things in the kind of reverse order. Use port zero. Ideally your kernel is going to give you a free port if you ask for zero. Use in-memory stuff. Don't use the internet. Don't give stop channels into functions. And use context. Context is one of the best things in Go and always use context. Very quick to talk about operating systems. Operating systems are very weird. Use build tags where you need to do different file types and things like this depending on their operating system. Work through the pain. Use if statements. Yeah, and then finally being productive. So building a culture of integration tests in a distributed team is always a work in progress. To know unnecessarily really likes writing tests, however, if you write a really good test framework, that's going to encourage people to add to them. And if they're quick, they're easy to use, then yeah. A good testing framework should be usable as a development sandbox. So what I mean by that is if you're writing a new feature, your testing framework should be your first port of call to wanting to use that new feature. Tests are great because they're encode, which means they're reproducible, and I can execute them and I can make changes over time. And it's very clear what's going on. Just running binaries on your terminal and things like this are fine, but having it in test code makes the reproducible better. And then the more, again, the more higher order your processes are, the more productive your team will be. So don't describe things like your developer shouldn't be describing things like exec, this binary, things like this. They should always be in a high order kind of thing that they're describing. Again, it decreases the amount of code that you have to write in your test case and makes them more approachable for contributors. And that's me. Thank you, everyone. APPLAUSE Saved some time for you, but I don't know if you want some questions or leave it there. I can fit in one quick question. Otherwise, you can just grab them in the hallway. Ah, no question there. Let me run one second. Keep holding your hand up. So, quickly, why did you make your own sort of test filtering system instead of using Go's test filtering system? And secondly, why didn't you use an event hub instead of polling? Say the first one again, sorry. Why didn't you...
Effortless Bug Hunting with Differential Fuzzing
Our next speaker is Maché and he's going to talk about us, about hunting bugs and how do we hunt bugs? We do that by sending a bunch of random input into our programs or more scientifically called fuzzing. Round of applause. All right, welcome. So in the spirit of testing, let's talk about fuzzing. So I'm Maché, I'm an offensive security engineer, I've introduced the platform engineer and software engineer, I sail, climb and play board games. So what we'll talk about, we'll talk about fuzzing, we'll talk about differential fuzzing, how it differs from fuzzing and we'll talk about bugs that are in the sun in the library and how you can actually find those bugs and fix them using fuzzing. And then at the end we'll talk about fuzzing in continuous integration pipelines. So what we'll not talk about is how fuzzing works under the hood. There are excellent resources out there that we'll talk about like fuzzing engines and like other stuff. I'll link to them in the end, but this talk is not about this. Why should it occur? So there's an OSS Fuzz project, who's familiar with this? Cool. So this is a kind of a platform that gives open source projects computer resources to run Fuzz tests continuously. And there's about 1,000 projects in there and within a six or seven years it has found 10,000 vulnerabilities and 36,000 bugs. And if you do a simple math, that's 10 vulnerabilities per project and 36 bugs per project. So this seems like an F word that's worth investing in. So let's assume we have a simple function, it accepts the string, mutates it and it gives you a transform string back and it transforms letters or characters in the alphabet to a character that is fricking positions later. So you get n for a, o for b and b for c and so on and so forth. So in your regular testing, you'll come up with some inputs, you put those inputs into the function and then you make assertions based if the output is correct. You're all familiar with this probably, you can run this using your standard Go CLI. With fuzzing, the situation changes a little bit. Instead of your device input, your things you came up with, you have a random input, you put it into the function and make some assertions. It looks very similar and is supported in Go from like Go 1.18 and you can also run this using the CLI. You see some boilerplate around the test but you know, in the middle you basically have your unit test that you had before. I intentionally left the assertion blank because how the assertion stuff, if you don't know the input, right? If you run the fast test, you'll see that it tries hundreds of thousands of inputs per second in this instance and it runs indefinitely. So you can run it as long as you want. As you've seen, it's easy to create fast tests if you have unit tests in place. So there is no reason not to do it really. One thing that we haven't talked about is that it's not our magic. You still have to kind of instruct the fuzzing engine to be able to come up with inputs that make sense for your test. So you can actually reuse the inputs you use for unit tests and add them to what's called the corpus and that tells the fuzzing engine to come up with something that's similar but quite random as well. Add the inputs from your unit test. That helps a lot. I've talked about those assertions that might be pretty tricky to come up with them if you don't really know what the input is. So what you commonly see in fast tests is that they don't make any assertions. They just, the engine just checks if the function crashed, which is still very efficient because it tells you that there are some out of ground size axes, for instance. But you should and can assert an invariance of things that don't change and in our instance, for instance, there is a property to the ROT13 function that you can actually call it twice and you get the input back. And this holds true for anything that has an inverse symbol. So if you have an inverse function, you can make a simple search like this, which is called ROT13 and ROT13 and then you expect the input back. If it doesn't agree, it's, you know, the test fails. Some examples that are commonly used are encoders, decoders, marshallers and marshallers. You can just call the things, you know, decode the encoded thing and you should get the input back. There's other stuff, like if you do a SHA sum, for instance, you always expect it to return 32 bytes. But there is other technique. And what if you had two implementations of ROT13, right? Something that you wrote and then, you know, something else. And that's called differential fuzzing. So basically, you get a random input, you put it through two implementations and you see if they disagree. So, you know, think about for a moment and, like, where can we get those second implementations from? The first thing is refactoring. Let's say you have your function but, you know, it's unreadable, maybe it's not performance enough, so you're refactoring the code for whatever reason. You can save your old implementation to the site and use it to basically reference it when you refactor the codes. The second example is performance. You might have, you might maintain two implementations in the first place. For instance, you are following a specification closely and, you know, the first implementation is written very closely spec, but it might be inefficient. But the second one is heavily optimized, but it might be not quite readable. You know, you might have some straight buffers or, you know, whatever. The third option, which is really interesting, is that there is a C library that does a similar thing. And you can use C go to college. And that's what we'll explore further. So back in January last year, I saw an interesting bug report and I can go with a newsletter where there was an issue with the HTML tokenizer, basically the piece of the, or part of the experimental library that does HTML tokenization. And the thing was that it was incorrectly interpreting comments and this led to an excess attack. So what does an HTML tokenizer do? It basically takes a HTML input and it gives you the HTML token. So for this example, for instance, you have a paragraph and a text inside and an anchor afterwards. You'll get start attack of P, text, and then the text inside and tag of P and then start attack of A. This is a very well-defined process and there is an HTML specification for it. It's very high in detail. It's easy to follow. And it's a state machine which will become important later. If you look at the go implementation, though, it's not a state machine. And it's not quite easy to follow, at least for me. So I thought, you know, if there wasn't a report for it, there might be other bugs lurking around. So let's, you know, let's use that function a bit and make another one that gives you a list of tokens because the API works in a stringing way. So we'll just call the tokenizer, collect all tokens, and then return the tokens it generates. So you know, when we, let's say, start with the fuzzing, we will supply some HTML input to the corpus and then call the tokenize function without making any assertions. And there are no results. It doesn't crash. Something will be expected from, you know, from some library or from the experimental part of it. So let's try differential fuzzing, right? We'll have the, our tokenizer function that we wrote and some alternative implementation for it. And if they don't agree, we'll fail. And as you can imagine, because the, you know, C ecosystem is very mature, there probably is a library that does the same thing. So in this case, I found Legsport, which is a web browser engine that, you know, is a software library. It has no extra dependencies. It has a Poshy to Prenel license. It sounds about perfect for what we want to achieve. So don't look at this slide really. It's, you know, it's basically implementing the tokenize function that we implemented using the Nets HTML tokenizer, but using the Legsport. It's actually a lot more complicated than that, but we'll be good for our tests. So we call the tokenize and Legsport tokenize and do some equality checks and if they fail, we fail the test. And it found something. So there is some weird looking HTML codes, looks, month forms, and Legsport says that, you know, it's an ATAC, but Nets HTML library is like, oh, there's nothing in there. So let's transform this a bit and let's see what the browser thinks. So we have these agreements. Could this be a security issue? So what if we made trust decisions based on the tokenizer? And so imagine you have like some, you know, user input on your website, you accept the HTML inputs and, you know, you decide whether the staff people input is safe to display or not. And you should, by the way, you really shouldn't do this, but we'll have a S-save function that will return the Boolean, whether it's safe or not, and we'll just look for the tokens we get and only allow strong tags and nothing else, strong attacks and text tokens. So the S-save method thinks that, you know, the thing that we got from the fuzzing is safe because it thinks there's nothing in it. But the browser says otherwise. When you look at the documentation, though, there will be a security consideration section in the HTML tokenizer and it says, you know, care should be taken, especially with regard to unstressed inputs. If your use case requires a well-formed HTML, the parser should be used rather than the tokenizer. So let's implement this using the parser, right? I want to go into detail, but we use the parser here. That's also in the same library. The thing is the parser also thinks this is safe, and the reason is it uses the tokenizer underneath, so it doesn't really, you know, differentiate between the two. So we still get the XSS. So we have two things. You know, the first thing is that the documentation could be improved because it's unclear. It's tier C in the wrong direction, and second, that there is a bug in the tokenizer. So I thought, right, if there was a vulnerability report in the VRP program for the common thing, I'll do the same thing. So I submitted a VRP report. There was some back and forth. They closed my ticket. I told them to reopen it. They reopened it. And the result of that was that there was a documentation update, which is cool. And they say that in security context, if trust decisions are being made, the input must be recerealized, for instance, using render or token string. So what they are saying is that instead of doing, you know, a safe function that returns a boolean, you should actually transform the input and construct it in a way that, you know, basically sanitize this, transform the string. And there are two ways to do this. One is to use the token.stream function, which, you know, when you loop over the tokens, you can reconstruct the input or render when you use the parser. A few months pass, and there is a comment to the library. And they fix the actual bug. So, you know, handle equal signs before attributes, and they quote the spec and fix the debug that was there. So now if you call the is safe function, it returns false. That's pretty cool. But let's run the fuzzer again. I mean, you know, you get something that is very similar, and it acts the same way. So I thought, all right, I have this fuzzer. It's not pretty. You know, it has no way to reach the standard test suite. But we can, you know, learn the code base and iterate over it. So run the, you know, fix the problem, run the fastest again, and then, you know. So I prepared the patch, and you've seen I get it screened today already. It has the code review, but as Jonathan mentioned, you need a lot of patience. It's been stuck in, like, ready to submit for like three months, I think. So it still hasn't reached master, but it's close, I think. But when you run the fastest again, there are no more findings. So the takeaway from this is that fuzzing is very effective, and differential fuzzing helps write correct for code. So let's talk about what are good testing candidates. We've used it on parsers, which are pretty complex codes. You can use them to get the coders and coders, you know, marshallers, and any complex code that, you know, can be unit tested, basically. But running those tests in CI is kind of traumatic, at least in my experience, because it's not really mature enough yet, I think. And when you run the go-pest fuzzing vocation, it can only run a single fuzz test. So people have been doing a lot of hacks, like, grabbing this fuzz code, trying to find those fuzz targets, you know, sleeping, like, some pretty hacky buskers, for instance. There is also a very cool project called Cluster Fuzz Lite. It's actually a subset of OSS Fuzz that you can run in your CI. But we found some problems with it. First, it has problems with extracting and failing inputs. Like, if you have a byte array, for instance, it doesn't really translate one-to-one to what the actual input is, because you have to apply some of your own transformations over it, and it's being convenient to run locally. So we built Go-CIFuzz. And it's kind of a lightweight wrapper around go-test fuzz. And it separates multiple test targets, and it allows you to extract inputs. So if you want to give it a try, there is a link here. And, yeah, good to go. And it's basically plug-and-play, drag-and-drop. You can use it to run fastest as part of your pull request workflow, or you run it on schedule, so, like, you know, during the night, or whatever, whenever you want to run this. All right. So we've placed for it. But, yeah, if you want to, you know, say hello, there is my email address and my handle. And also, I wrote a blog post about this, but it goes more in detail about this actual finding. And there are some references. You have the, if you want to start fuzzing, there is a very excellent introduction to it in the Go documentation. There's also Goode's article on Wikipedia on how it works under the hood. And there's a link to clusterfuzzlight, the Go-CIFuzz, the blog post, and also a pretty interesting entry in this list is the second one. So there was a recent paper from Google where they use AI to actually generate the fastest. So maybe you don't really need to write them, and AI will be able to do it for you. All right. So if there are any questions, happy to answer. All right. Any questions? We still have some time. And the front. That's nice. Okay. How many minutes do you run the fuzzer in the CI because this is important, right? Because it costs money. That's true. Yeah. So, you know, it depends on the workflow. So for instance, when it's a pull request, you really don't want people waiting. We run this for like five to six minutes. It's enough time in our experience to catch like those bugs that are, you know, the edge cases that are quite common. But you can run this indefinitely during the night, and it depends on how much money you want to spend for your CI runs. Yeah. All right. Any other questions? Questions. Can you keep your hands up and I can go to the right row if you could pass us along. Have you tried to fuzz only inserting random strings or like also a combination of valid tokens in different order? Could you please bring? From what I got from the slide, if I'm not wrong, you were like inputting the data. You were like putting random strings, right? Okay. So how it works really is that you provided a starting corpus. So like your, think about your unit test inputs and then the fuzzing engine underneath takes those inputs and puts transformations on them. So every time you'll get a slightly different input. It won't be completely different, but it will be a bit more formed. So like if you saw these, the findings for instance here, right? It outputs all, well it outputs a valid HTML or almost valid HTML. So it kind of reached this conclusion based on some coverage data it found. So like it also looks at test covers. So when it runs the fastest, it kind of captures which branches of code have been covered and tries to reach the other that have been not covered. So it's kind of an interactive process where it applies transformations to the inputs. Right. There's another one. How does the engine know which part of the corpus it might change and which not so it doesn't only input like random strings as I could obtain from the random package? Could you repeat the beginning or the question? Yeah, sure. The fuzzing engine, you give it a set of example strings. How does it know which part of that it may change and so that it doesn't just put in random things? Okay. So I don't know the exact details, but I think it works that it makes a change and it looks at the coverage data. So it looks at the branches, it kind of discovers when it made the change and it will note some interesting inputs and then try those inputs. So like if the coverage increases, it will try to make more transformations similar to the one that it makes. Yeah, one more. What kind of coverage metric is it? The question is what kind of coverage metric it is. I think it's, I'm not so sure, but I think it's branch coverage based. If you run the fastest with some variable flags, you will see that there are coverage bits and I think it tells you how much coverage there is for a particular input. All right. There's one more. One second. I can probably just speak up. So the question is, there is a go cache or when you run fastest, there is a cache folder that will capture the inputs already run and the question is whether the tool will or can support this. And the question is, the answer is it doesn't right now, but it's planned. So for those that are unaware, when you run a run fast test, there is a directory that will capture all the input it has tried or the interesting ones. And when you run this again, it will start from the point, which is really handy because you will not do the same work every time or a similar work. You can start from where you left. Yeah. Thank you. Yeah, there is one more. The question is slightly tangential to this directly, but you said we provide a starting corpus and then there's transformations on that, which is run against whatever we're testing. So is there a way to optimize the starting corpus to increase the kind of test cases that are actually generated by the FuzzError? Is there a way where the starting corpus can be designed to cover as many edge cases as possible? Okay. So there are similar perspectives to this. There are corpus that you can find online in GitHub, for instance, that you can employ in your FuzzTests. Also when there's a finding, for instance, when you run the FuzzTest and you find a string, it will add it to the corpus that you have in your repo. So when you run this, there will be a directory created in your repository that's called test data. And inside that test data folder, this will be captured. And you should actually commit that folder to your repo so that every next time you run the FuzzTest, it will actually check for regressions. So yeah, I hope this answers your question. Any more? Thank you. Are there ways to customize the kind of transformations that are applied by the FuzzError? Not in the Go Native FuzzTests. So there are other tools that have been used before, and Go introduced native fuzzing. There is libfuzzer, for instance, that's very commonly used by the OSS Fuzz. And I believe if you use that, you can customize it. But the way native Go tests work is that they actually use libfuzzer, but it's not very configurable. So it's supposed to be good developer experience-wise and cover most of the needs that you need, but I don't think you can drive the transformations from it. I'm going to end the questions here.
How we almost secured our projects by writing more tests
The careful eye might have noticed something in my schedule. I put a lot of similar subjects together and because Philip was actually replaced by the speaker, this would have been three hours filled with only tests. Glad we saw where I say it from that. But let's continue into this test thingy because tests are important and many people love them and many people hate them. So Alessio has got to take us away with security by testing. All right, applause. Hello, everybody. Welcome to my talk. I give you a little introduction about myself. So who am I? My name is Alessio Griggi. I'm a software engineer at Armo Security, the company behind QtScape. My full-time job actually is to be a cat food opener for my furry friend. But jokes apart, I'm passionate about reading and taking long walks. You can find me on GitHub and Twitter with this account and the following avatar. But let's start the talk. So I will give you some introduction, some easy concept that can help you to understand better the world talk. So first question is what is the code coverage? So code coverage is a metric that we can use. It's a percentage actually, as a metric, that we can use to understand how many of our source code is covered by tests. Really or better, mostly, it is used to write when we write a unit test, but not only for this kind of test. Let's go a bit more in depth. So code coverage related to Golang. So first time it was introduced in Go version 1.2. It was more or less 10 years ago. I guess it was April 2013, if I remember well. With support for the unit test in this specific article. But the story continued after more or less 10 years. So one year ago the community introduced in Go version 1.20 a new kind of support for tests. This time it was support for the integration test. So what happened since last year that we basically sensitively increased the percentage of the coverage in our project. Of course if we were already doing integration tests. And yeah, basically in these 10 years a lot of things changed. They also implemented some nice tool in order to check the coverage rendering the profiles with an HTML page that you can check on your browser. It's really nice also to use, really helpful. But let's see another concept that is important for this talk. What is a second profile? So first of all, second is a kernel feature. And it helps you to block certain syscalls during the execution of certain program. You can define second profile as a kind of rule. So you can list all the syscalls that you want to execute or you want to block during the execution of your program. And what else? It is extensively used in the Kubernetes ecosystem. Also in Docker you can attach this security profile when you run a specific pod or container. And the container will use this second profile in order to check if all the syscalls are enabled to run. And another important thing is that in Kubernetes if you enable the second default profile feature flag you can basically use by default the default profile that is a list of deprecated, really dangerous, let's say, syscalls that you should not use during your execution. So by default you can use this profile and be quite safe more or less. But it may be better if you create your own second profile for the project that you are implementing. So the main idea that I had was to generate a security profile during the test pipeline since it is probably the best environment when we, of course, if we write a lot of tests that can help you to run all the syscalls that are included in your project. So the test environment is probably the best candidate to use in order to extract all the syscalls that are going to be executed in your project. So the idea was to generate the second profile and in case you have your project that is based on Kubernetes, you are developing something related to Kubernetes, the way was to create an init container that can inject the second profile into the node and use the security context with the second profile localhost in order to attach this security profile that you just injected into the node. And that's one example. So you have the init container that's downloaded the second profile. In this case it was just a test but you can think to provide it as an artifact on GitHub or whatever you want. And the application container can use the second profile type localhost by referring to the second profile. Okay. This was the first part of the talk. But now let's see how I try to achieve this goal. I mean, how I try to extract the syscalls from the test. So in this case we are talking about integration test and unit test. In this case you can see a kind of execution path of your project. So if you run the project you are going to have this kind of tree. So with the code coverage you can understand which part of this tree it has been executed. So you can refer as a metric about your second profile in order to understand which part is missing and how much it could be readable since it's a metric that gives you a percentage. So first thing, extracting the syscalls from the integration test. So let's say it was the easiest part. So with the integration test you can build a binary, provide some script that basically checks for expected results. And when you run the binary that you built you can use one of the tracing tool, for example strace or perfer or whatever you prefer, in order to extract the syscalls during the execution of the binary during the test. So this was the first part but let's see the other one about extracting this information from the unit test. So first of all it was a bit more complicated and I'm going to explain why. So the reason is that GoTester actually compile and run the test all at once. So you cannot do strace GoTester because otherwise you are going to catch all the syscalls that are not related to the function that you want to trace because think that we are speaking about unit test. So we are testing only specific unit, only specific functions and you want to extract the syscalls that are executed during this function. So you cannot do strace GoTester first of all and even if we build the binary, the test binary for the test we cannot neither do strace dot slash test binary because the test binary could include some noise that could be related to for example let's suppose that you have some data file that you want to run against your function and you open this file and you take this data and you put this data inside your function. So when you do this open file you are going to catch with strace also this open. So it's not really suitable. So my personal solution, let's see another step, so more or less the solution could be split all the steps. First of all we can compile the binary without running it with the GoTester. So you can do gotest dash c followed by the package that you want to build and consequently you can from this binary extract the function name just by using obj dump followed by dash dash since so you can extract the entire symbol of the function that you want to trace. So at this point let's see my personal solution. I don't know if it's the better one but it's a solution. So this project is called ARPUN. You can find this project on my github and it makes use of an eBPF. I want to clarify that I'm not an eBPF expert but understanding the technology I try to use this technology to solve this issue. So the main idea was to define a trace point with eBPF that started its execution so it's tracing about the function. When a U-probe that was previously attached to the function basically emits an event. So the U-probe informs you that the function started the execution and another probe, the U-ret probe emits another event when the function finished the execution. Another important thing to know is that this project actually is a POC, it's not a production in a great project. It's based on Go.BPF that is a part of the Iovizer BCC project. So that's the main, how does it work actually. So you can put U-probe and the U-ret probe inside your health binary at the point of the function symbol. So in this case we have main dot do something that is our example function. And the U-probe and the U-ret probe will inform you when the function starts the execution and when it finished the execution. So in the meantime the trace point knows when to trace the function. And the trace point is going to trace the function with the C-center event. So it's going to trace all the C-scales that are executed during this time. So that's an example. In the right side there's a function that's some easy things. And in the left side you have the result. So you have the right, the open-et and the other C-scales and in the end you can see also the read. Okay so all these things are really nice. I was really happy to have achieved this result. But at some point I also realized that these things were not really working. I mean not every time. And I discovered after a while why this was not working. But first let's understand how the U-ret probe works. So because we have a problem with the U-ret probe in this case. So a U-ret probe basically overrides the return address of the probed function with an address to a trampoline. The trampoline basically jumps into another kind of function that in this case is our EBPF program. But since the GoStack dynamically changes during the time due to the GetBatch collector, when the trampoline function tries to return on the stack it is not able to do this. At least not all the time. Because the stack changed and the previous address is not more useful. So possible solution, likely for us the U-ret probes can be attached to a specific offset in the health binary. So we can basically simulate a U-ret probe that informs us when the function is finished by adding a list of U-probes on the ret instruction of the function. So if the function returns three times we should place a U-probe on these three ret instructions. So we can basically simulate the U-ret probe instead of using the U-ret probe. So future improvements, so when I realized that this solution could work I tried to check on the IOWI's or Go BPF library but it was impossible to attach the U-probes at certain offset. So it was my fault actually because this library is deprecated. So future improvements are to move to another library before. So we can use for example a BPF from Cilium or the one from Aqua Security and so on. So in this case we will be able to put the U-probes to specific offset and so put them into the ret instruction of the function. So here are some references that I found on internet that helped me to understand better what was the problem, how to solve this issue. Also some special thanks to some people that really helped me during this experiment. So thank you for your attention. Well I have your attention or sleeping depending. I have two announcements. One is read the wide board, not repeating this again, lightning talks, we still have available slots. And the second one is this room is not possible without volunteers. This is a 110% volunteer conference. I get no money, I even have to pay for my own dinner tonight. Oh no that's sponsored now, thank you. But I want to make a special shout out to my dear co-organizer Eva, a proud of her past. Eva is a student in computer science, more specifically in application development. If you have internship positions at your company, you can hire her for free.
Dependency injection: a different way to structure a project
I'm going to talk about using Go. What is important when you use Go is dependency management. You cannot write a program these days without depending on something. Dylan is a co-worker of mine. We work on Cillium together. He's going to talk about anything to do with dependency management. So run of applause. Hey everyone. Thanks for coming. So dependency injection. Before we start, a little introduction. Already got one technically. My name is Dylan Reimering. I work at Isovalent on the foundations and the loader team. So we're responsible for basically doing dependent or a lot of changes that I'm going to talk about within the Cillium project. You can find my get up there. In case you find anything interesting. I don't know. You never see. You never know. So before we dive into the dependency injection, why, how it works, what it is for those who don't know, a little journey about why I'm here, why I'm talking about this and how I got here. So what is Cillium? Cillium is a CNI. So it's long, long talk short. We use EBPF to do networking in EBPF. We secure it and we make sure that you can see what's going on. And that actually involves a lot of components. So this is our nice visual about a lot of the different features and we actually have way more that wouldn't even fit on the slide. You can imagine that with a lot of components that we get quite a large application. I checked and we are currently the third most active project on the CNCF. We have, I think, so again last time I checked this is like a month ago. We have 650,000 lines of code that are not the vendor directory. So we have a big code base, a lot of things that happen, which also means that we have a lot of dependencies. So to illustrate that, I picked one of the features that I personally worked a lot on, which is called the Alto announcer. And it's a little feature in Cillium that basically makes sure that certain IP addresses are reachable in the local network via ARP. So both gratuitous ARP and responding. So we have like the big Alto announcer block there, which contains most of the business logic, but all of the other things are dependencies. So all the way to top, we have, in the white still, are our external dependencies. So we have to create ports. We set up, we get environments, configuration, standard outputs, et cetera. Those are connected to our infrastructure layer. So our infrastructure layer does all of the things that are really common in the application, logging, metrics, configuration, da, da, da, da. And then we get to the orange layer, which is our control plane. And there are abstract business logic happens. So this business logic gets go objects, and it also writes go objects. It's all pure go world, and it doesn't have to care mostly about all the things. And then we go down to our data path, where the translations happens from this perfect abstract world into the real world, which in turn often means, for our case, that we talk to the kernel via net link, ebpf, maps, raw sockets, et cetera. So we have to, but for this one, for my big component to be able to work, I basically need all of this to exist, at least in production. So I went back to 111, which is before we started working on dependency injection in Cilium, and looked at what does initialization look like at that point. So we have our main program. We could call into Cobra. This is common, hopefully. We go into our run function. It starts up three components. It initializes the environment, where we already have 50 components. Then we call something called run daemon, which has 50 components spread both before and after the new daemon. And then in our new daemon constructor, we actually create at least 150 components. I started counting, or stopped counting, sorry. So we have a lot of components, but they all have to somehow wire into each other. And at some point, the development team decided is that we are going for sort of hub-nispoke model because we had so many components. We had this big daemon, which was our hub, and it had pointers to almost all components. And then it's easy. You only have to give the daemon to everything, and then via the daemon, you can find every other component. So it was, but that becomes a real mess because when is this pointer nil, when is it not, et cetera. So I started looking into this new daemon function, like what is this about. And then you'll see a pattern. You don't have to read everything. So we initialize this before we're creating this. We must close this before we open that. This must be done before we start identity allocation. IP cache must be done after the initialization below. This must be read. You said after this happened. So we discussed some for a while. So at some point, so at this slide, I'm at sort of the first snippets that 350 about. And then I basically, I stopped. So I just scrolled down at that point. My point was made. In the last reference I found something like before, do this before, do this after was at 718. But what is perhaps interesting to note is that this top snippet is basically a sort of defer. So it talks about cleanup instead of initialization, which is also a really big thing that we have. So to summarize the problem that we were facing at this point in development. So we have a lot of dependencies, but this is just inherent to the product that we're making. Just nothing to do about that. What we can do something about and what is a lot of the source of the pain are these implicit dependencies. So we have dependencies on global variables, these very big objects or system states, which require us to use comments to tell our other developers which how our dependencies work. So our dependencies are all implicit in this state, which makes things really hard to modify. Like when I started and I created a component, it broke CI, it broke everything. I couldn't figure out why. And it turned out that I had to move it up a few hundred lines in the initialization or down in some cases to make sure that everything that I needed or implicitly dependent on was there. So it's really hard and it really destroys confidence. It's hard to shut this application down at least correctly. You can kill the application, sure, but then open files are not saved. And if you are running end-to-end tests or anything like it, then you need to make sure that all your resources are cleaned up. So the next time you start, you are not blocking other things. So this was really hard and it made it really hard to test because if I wanted to test my L2 announcer, I had to recreate all of this additional infrastructure a lot of the time, even if I had interfaces because some dependencies were still problematic or whatever. So for us, we started looking into solutions and this led us to dependency injection for a few reasons. So before I go deeper, for the ones for people that don't know, dependency injection is basically a way to instead of explicitly initializing your project, so basically having a very big main file, you define your components and you explicitly define what their dependencies are. And then you can have some component, in this case I call it a graph builder but it's basically the name of your framework that you use to actually initialize that and you hand off the job of correctly initializing your application, you hand it off to some piece of software. We know software never has problems or bugs. But in all honesty, so this is actually quite popular pattern in other languages like Java, C sharp, PHP, but we don't see it that often in Go projects. So the only thing that is required for this to work is that you always, or at least work correctly, is that you specify your dependencies explicitly, so as arguments to a constructor function. So what I would like to introduce to you is the Uber FX library, so it's made and maintained by Uber. Originally developed by Glipp, who is now actually a colleague of mine, which is why how we got into this library. It's really well battle tested and I'm going to show you how it works and what this looks like. But what's important to know is that it is an, as is the Penesy injection library. The Penesy injection libraries might not all work for your use case, it didn't for us. So we actually, if you were to look at Cilium today, we actually use our own custom flavored framework, build on dig, which is basically the underlying library under FX. But FX is if you go ahead and first try something, then FX is your starting point. And this actually solved most of, like it was made to solve a lot of the problems we had, not only for this initialization issue, but also because we have a lot of binaries in a big mono repo, so it also allows for really good reuse, which is, as far as I understand it, where Uber first started. So to explain this, I first created a very, very small application. Normally you wouldn't use dependency injection on such a small application. So we just have a simple web server. And this is, I why, for example, might have, might write this without dependency injection. So if a main, we construct everything, link everything together, call server.serve, and we're done. So this is nice and short. So when we do dependency injection, we have to be a bit more formal. So I defined a new listener, a new logger, and a new server. My listener and logger at this moment don't have any dependencies. I could give them configuration or something else, but that wouldn't fit on the slides. And the server takes both of these and constructs itself. So we defined everything, what everything needs, and then on the top left in our main, we say we create a new effects application, and we provide the listener and the logger, and we invoke the server, because if you recall, the server was, the serve function was the thing that we were interested in, that we called. In practice, what this does is the invokes are basically your entry points. So and the library will look for all dependencies of that, of that entry point. So you could, for example, create a very big graph and have multiple entry points or remove entry points depending on, call different entry points depending on, for example, commands in your, in your binary. And then it will only construct and start your dependencies that you need. So it also does a little bit of that code elimination implicitly. And then you call the run, which actually wouldn't do anything in this example. So I'm sorry, because the serve is not called. So this would start and it would construct everything, but nothing extra would actually happen. For that, FX has something called life cycles, which are really useful. So we, the last slide talked about the construct time. So when we construct our graph and then when we run it, the life cycle gets invoked. So what we can do is we give this, we say, okay, the server is now dependent on a life cycle. And within the constructor, we, we tell the life cycle, okay, I can, I have some, something while, while I'm alive, I want to do something. So I have an on stop and an on, on, on start and on stop hook. And when I start, I want to start a go routine and serve out whatever I do. And when I stop, I want to shut down, which is something that my initial program didn't even do, do a proper shutdown of the, of the HTTP server. So when, when it's, it's a little bit hard to like show that in the original example. So I threw together a very small sample that still fit on the slide, which is important here. So I have ABC and they basically all depend on each other. So it's a very deep dependency chain. And then I have this print function, which you can decipher later, but it basically, I call it in every constructor. It's both prints at that time and it prints in the life cycle hooks. So you can see what happens. And when I would were to run this program, the output would be something like this. So it says, A is constructed, B is constructed, C is constructed, because that's the order in which the, so, so we have all the dependencies there when we are, but it's just some construction. Then the start hooks are called in the exact same order as we constructed them. So if you have dependencies, for example, A opened the file and we need that file to be open because B will start calling things in this life cycle. And we know that the, that the start hook of A is always called before any of its dependencies get time to run. And then when we stop the application, we control C or something else happens, we shut down. But the nice thing is, is we automatically shut down in the exact opposite order, just like you would add the first somewhere, but it's at the application level. And this allows you to do proper shutdown, write your files away, do everything else. And you also know that you, because you depend on everything else, that you get the first chance to shut down properly and no one will call into you after that, because, in their shutdown functions, because they don't have references to you. They are not your, you depend on everything else. There's also a nice feature called groups. There are actually quite a bit of features. I couldn't touch on everything because of time constraints, but this one is nice for, for a small section of problems. And it's called a group. What you can do is, so I actually use two features. I use the effects in and effects out feature. And it basically allows you to, to return multiple dependencies from a constructor or take multiple dependencies in a nice way. So I can, for example, have a parameter structure that takes in 20 different dependencies and don't have to spell them all out separately in my arguments. And I can also return multiple things. Crucially, in my case, I can specify group names to basically route outputs from one, from an output, from, from one place to another. And in this case, I created a mox. And this mox collects all of the mox handle, mox handle objects that are there. And I have a foo and a bar and they both admit their own thing. And they are collected by, they are collected by this, by this mox which we could, could give to, to a server. And the cool thing about this is that you, you have this once. And you can then add a lot of additional, you can add a lot of additional parts to the, to your whole application. And it all collects as an array into this group. There's some caveats. I'll come to that in a bit. So under the hood, how this works, very simplified, is we have our definitions. At least effects and dig use reflection to then look at the parameters and then based on the types, it creates a directional acyclic graph. And that graph can then be walked to get the, to get that correct ordering. So there is a small bit of magic there and it's called reflection, but it's not much. Like it's quite understandable if you actually go dive into, into how something like this works. And then again, the constructors to start and stop are called in that, in that determined order by the deck. It also means that you can't have cyclical dependencies. That's, that's a no, no. So it's a good reason to remove those from your code as well. So I would like to share with you in case you want to try this, try dependency injection. Some tips, tricks and lessons we learned because there are, there's a good way to do this and there's definitely also bad ways to do this. So inject, but in moderation. So not everything has to be a component. For example, math libraries are stateless. There's no reason why you would make that a component as like a dependency in this system because you can just, you can just use them and they are pure functions, etc. So my rule of thumb is if it has states, make it, make it a dependency because then you benefit from all of the state specific things. But if you have libraries that don't use state, please don't make it harder than it has to be. And also a note of inject, but in moderation is that we saw that doing dependency injection adds a lot of boilerplate, which is worth it in very, very big applications or even moderate applications, I would say. But it's likely not for your small CLI tool or whatever. So pick, this is really a technique for medium to larger projects. When you do this, pick logical boundaries. So we, for example, we started and then made 20 cells within the same package and then no one outside the package actually ended up using those cells, which is massive amounts of complexity and overhead is just not necessary. In my experience using packages as logical boundaries for these components is the best thing to do because you can also leverage what types I export, which type, because you can export, you can provide something and not export that type, for example, and then only export an interface that matches it or whatever. So that's a really powerful combination. So and the last thing to note is that one of the other features that I wasn't able to show you because of time constraints is FX options. So FX options is really cool because it allows you to basically take multiple of these components and bundle them under a single variable. So while global variables are big no-no's when doing this, you can still use them or you can use a variable, global variable on your package to export these constructors. And the nice thing there is you can make a sort of hierarchy. So if you have a package hierarchy that's three layers deep, you can basically reflect that. So in your main application you don't have to list 200 constructors all separately. So that also really helps with readability, seeing where what is provided and so on. Provide targeted interfaces. So go idioms still apply. The smaller your interface is, the more powerful it is, the better you can swap it out. So when I provide a very small interface or when I depend on the smallest interfaces I can, and it's really easy for me to mock out in my test, create a new FX app, only provide the direct dependencies which are interfaces which I can then mock out and it makes everything really nice. So this is general advice, not for dependency injection but like it goes hand in hand. If you have dependency injection and don't do this then it takes away a lot of the benefits you would otherwise get. So it also makes it easy to rely on external, for external components to not rely on internal implementation. So when I export something or when I provide a component I always try to provide it as an interface as well. And the last thing which is more of a trick is you can actually, if you for example have a struct, that struct can implement multiple, so instead of having one interface that implements three methods I can provide it as three separate interfaces that implement three separate methods. And that way you can, you have both on the receiving and the sending side of your dependency, you have the smallest possible interface again to help with mocking out but also so if you don't use certain methods that you don't have to like write fake methods that panic if you were to call them etc. I mentioned groups and they are really powerful but go easy on them. Groups are really only ever useful if you have multiple parties that are interested in the same list of objects. So for example we have metrics, so we have a Permetheus metrics registry which collects all of the metrics to actually use them. But we also have tooling that automatically generates documentation about these metrics and I can write a very small CLI tool that basically just with one component that depends on all hooks or all metrics that we have defined in our application and I collect all of them automatically and everyone who uses who registers a new metric it automatically appears in this metrics tool. So it's really great and the same goes for our configuration HTTP elements which will also have configuration for or sometimes CLI tools which live want to interact with the same things. The alternative to using groups is to just use like a registry pattern where you say I provide a registry, it just has a registry pattern and everyone else, so if I have 20 other components I can depend on that and I can register myself during construction time. And the upside of doing that is that you can, like if you use FreeScope for example or any decent editor is that you can follow those traces back. So you can always use references to see who actually uses what. With groups it's all magic. Something everything goes into this group and it comes out but it's not clear. You can't trace that back in the code itself without having difficulties. Stay with a static graph when possible. So you can, with this FX application you can in theory like depending on configuration provide or not provide components. We have opted in Cilium to never do this because it makes it completely impossible to verify that you never have missing dependencies or other problems like circular, the references and there are certain combinations. The graphs are verified at runtime so you have to have a good CI to run everything, make sure that it works. What you can do instead is use this life cycle and so you always have the objects but then you can always choose if they do or do not subscribe to the life cycle and that way you can enable or disable certain logic if you don't want to run it at that time but always provide it. And that was it. Thank you very much. Thank you. I have time for one question. I see a hand there. I'll quickly come over and hand to the microphone. If you are exiting already do it quietly please. What can you make choose, dig and FX instead of Google OIR which is more popular for example? So like I mentioned, colleague of mine, Glyb, authored it so it was very, we were very quick to jump on that one he suggested using the library. So it's purely advertisements. Thank you. Any questions?
REST in Peace: using generics to remove REST boilerplate
Well, this is a depressing title. RIP. Rest in peace, and I hope rest means, like, restful. And not like this is the end of go, because I kind of like go. Anyway, this is going to be a very interesting talk. Rodolfo Plaza, thank you. Oh, actually I don't have my notes, so I'm going to swing it because I had to clone. Anyway, hello everybody, thanks for coming. So I'm going to present a project of mine that I created like maybe years back, a few years back now. It's called Rest in Peace, and it's about to make rest in peace, oh ho. So in 2021, Jan Lans Taylor and Robert Grissmer created a talk about how to use generics once they create implemented, actually implemented, the generics in go. And they were like, basically, I don't know if I have audio. We'll see. No. Anyway, so basically what Jan Lans Taylor is final words were, please use generics wisely. So of course, when a figure of authority asks you to use something wisely, what do you do? The total opposite. And people from CrowdStrike security company decided, I don't know if you can read in the back, but basically they were like, created a channel on the discord. It was a creative usage of generics contest submission, submit your worst implementation of generics in go. Basically, everything that Jan's told us not to do. So my, I did a thing I'm pretty proud, my, let's say to make the world the worst place, was an, I think, await in go because who needs go routines and channels anyway. Some people even did some try catch. If you're missing those good things from other languages. I got a plush because, you know, when you do something, the world like gives you something back called karma. And just for records, I listed everything that was attempted. We had like monads and stuff like that. Anyway, but out of all of this, I created something that thought was actually useful, use of generics, maybe not the one that was like supposed to be because the current implementation is not optimized for this use case. But I thought it was a nice anyway. So about me, I'm Tanguy, I'm from France. I worked 17 years in IT and I'm also CEO of HTMX. Okay, one person knows about HTMX. Anyway, so as you will see from this video, oh, we have sound now. Anyway, I'm ready to make anything for money. So I'm a freelancer. Specialized in go since 2015, worked in a normal consulting before that. I worked mostly on classic restful APIs, backends, and I've done some blockchains. And I stopped freelancing for a year, about a year to work a dagger. You should check them out. I think what they're doing is pretty cool. It's CI, CDS code. And I'm also very interested in pushing goes in more, I didn't say that, areas than just like microservices and web backends, so GUI, game engines and stuff like that. And the next talk will be about GUI. So I'll advise you to check it out. So now, can anyone recognize this? Basically, it's, yeah, thank you. Basically, it's the HTTP handler code that anybody does. You might even have like a validation step if you're fancy. And, but we do all of this code in the end, we decode the JSON and we encode the JSON for the response. But all of this is just for this line right here. For just this line, we do all of these gymnastics. So that's a lot of code. And let's say we had another handler to deal with another type, another type in our API. And now we do basically copy paste all of that previous code and change the code here, here and there. Myzermanos, whatever. Anyway, and so again, I see like a lot and a lot and a lot of duplication. And for me, like duplication, it's just, it's just something we should try to avoid as much as possible. There are some of the few rules about the rules of two or three, which I think is good. But when you create like a big API, you have more than two or three copies of that code. So you can have a solution and abstract the handler and create like a very unsafe type to basically get whatever you want in it. It can make it work, but then you call your back end, but you have to deal a lot in the back end. And then the back end will deal with a lot of type casting everywhere and it can fail many times. So you need to do a lot of hair handling. And here I put like basically what we should do to convert from one type to another and make sure it works. And it's all of this. All of this is, I don't know if I have the thing. Yeah. So basically all of this is for two types, for a structure with two fields, A and B. And for all of that, we have all of this boilerplate that basically take the dynamic code, transform it into a type safe that you can actually use in your back end. That's again a lot of code. And then the real back end can be easy once you have the right types, but we had to do all the way to here to be able to just call a simple back end. And by the way, if your back end is just that and you can make money out of it, go for it. So as I said, a lot of runtime reflect boilerplate to get back to types and potential reuse of the handler. So we had a lot of durable potential. Not so sure about it. So if finally we had a solution, thanks to Go 118, we have the generics. And so that's when this ID popped out. So the pro about generics, we have better type safety. We have better performance than empty interface. And as a wise person said, empty interface means nothing. And for this user, sorry, mis-type, for this use case, for my use case, we don't have better performance. There's an article from Vincent Marty that talk about this in depth. But somebody told me maybe it has been improved since. So maybe it's the procated. I don't know. And in general, it's more readable code for the users. And it allow more like don't repeat yourself all over your code base. So for example, without generics, we have this. So I just want to check the minimum between X and Y. And can anyone tell me what it prints? OK, not very interactive. OK. Well, actually it doesn't print. It doesn't even compile. Because mat.min accepts only float 64. So you just have to do this. I hate you, but I hate it. So, I don't know. I do it disgruntly, whatever. Not my native language, sorry. And with generics, we have this function, which is way better. It doesn't look like it, but it's way better. So the library code is not that great to read. That's for sure. But you can get used to it. And so you compare the previous one and this. Like, yeah, it's not one reads better. But the user code. On the user code, it's really way better. You don't have to cast those everywhere. So it makes for a better code base. So what about rest in peace? OK, I'm checking. OK. So rest in peace. The idea is to basically use generics to avoid all the... this HTTP boilerplate that I presented. For example, here we have some user code. We just wrap strings.toUpper in a function which is like an input, output. I don't even remember the name. But basically it takes a context, it takes a type, returns another type, and an error. And as long as your function respect this interface, you're good. And you can just wrap it, send it to the rip.handle, and you indicate the method, then you indicate you call this function. And then you have like a route options. And then you can just call curl on uppercase, and it will just put your input into uppercase. Magically you don't have to handle any HTTP about that. Library code, less readable, I will admit. So we have the type input, output, funk, which is like the function that needs to be respected to be used, then in the handle, which takes the input, takes an output, and put the method for this route, and then you can just pass it like that. But this was fun to do. It was my first experiment, but I wanted really to go a little bit further, because I do a lot of rest back-ends, so there are a lot of routes to deal with for resources. We need to create, delete, update them, et cetera. So I wanted to automate that as well. So rest in peace. So the key concept of rest services is the notion of resource. It's accessible via an URI, an action on the resource URI via HTTP methods. I mean, this is one implementation about rest. Normally it doesn't have to be HTTP, but anyway. And the current state is sent back through the same system, which in this case would be HTTP. So on the user code, I create like a user provider, like an entity provider. I pass it to here. I decided this is the path I want, and here I just take the default route options. And thanks to that, we'll see what it will give us. But so this user provider needs to implement this interface. Okay? So create, get, update, delete, list all. I will update that because list all is a little bit too much. I need to handle pagination and stuff like that. It's not there yet. But once you implement that on your code, you can just use your code. You don't have to deal with any HTTP whatsoever. And you pass it to this function. Then you have a whole slash user with all the bells and whistles, which gives you all of this. You can create the entity. You can get the entity. You can update the entity, delete, list. And I recently added fields. So now, because you can use patch, basically, to update just part of your entity, but the protocol is not defined. So you have to define your own way of doing patch. Is it like text-tips or something like that? And it's a little bit quirky. So I decided I found somebody talking about a pattern, which I liked, which is basically you take the whole path to the field, and then you can just put and get from it. And so that's how you update part of your resource. And so you have your entity and the entity provider thanks to a type inference that improves. You don't even have to pass that. You can just pass directly. You don't have to put the square brackets and put those types. So you pass the URL, the entity provider, and the route options. And here you go. You're good to go. What you get, creation of CRUD-HDP endpoints, content negotiation for many encodings. Right now we have JSON, XML, Protobuf, MessagePack, HTML, HTML forms. We have automated resource web pages that can edit the resource. Right now it's very nice UI. You see this is not my major. And a harmonious way of handling common scenarios, and because I've worked on many projects, and maybe because duplication, you do it once, and then you forgot you need to update all those boilerplates you've done, so then you forget. And then the behavior between all your handpoints are not really coherent. So that makes a good thing, I think. For example, this is just an implementation of adding a new encoding to this platform. That's the whole code to add the JSON encoding on REST in piece. So you just, I have like a facility like the RAP codec. I use the JSON new encoder from the standard live. JSON new decoder, then I define the MIME types. That's it. You're good to go. Most of the implementation are like that. So RIP is to HTTP. What an ORM is to SQL, me. But I know that how many of you hate ORM just to see if, okay, you might hate me as well. Anyway. But seriously, I hope it will help you create services more easily because I have a pain of repeating all this code all the time. So here's the QR code. Like, subscribe, click the back icon, something like that. Anyway. And here's a demo. Last time I did live code, it was awful, so now I have a video. Amazing. So I just run the server. And so all the logs that will print in yellow are from the servers. There is one that is from the logging handler that is the logging middleware. And the other one is from the backend code that we just logged for ourselves. So here I just get the list of users. There is only one called Jean. So here we see the backend. Whoops. No. No. Sorry. Anyway. We'll check on the next one. So now we're going to create a new one named Cam. Are you stopping, please? Thank you. I'm sorry. All right. Okay. Mm-mm. Did I check this video? Maybe not. Okay. So maybe it will take a little longer. I'm sorry about that. Yeah, that's karma. Are you serious? Okay. Okay. So, oh, yeah, but also there was no output. So we just saved this new user. So here is the log from the backend. And this is the middleware, the logging middleware, which is just a apache log style. Then I go, then I released again to just confirm that we have a new user in our list. Then we have, we just get one user and we decide to display as XML because why not live in the past? Because on each endpoint, you can have multiple encodings based on whatever I do content negotiation. So if the browser or whatever client asked for, and I have it, I will give it to you. That's your problem. If you want to XML anyway. So here we're going to just modify the first users and call him Philip instead of Jean. So here he is. Check is still Philip. Yeah, good. All right. Now I just want the field. So I just want the name of this entity. And so it just returns it as a JSON string in green. Now I'm going to, I don't know what I'm doing. Oh yeah, the email address got thrown in a trash because I did a full put on the entity and didn't specify the email address. So I'm going to just modify this field to modify the address. So now I do a put just on the field, on the email address field. And now I'm going to check again that it's correct. And here now we have a correct email address and correct name. Then I delete and then I will just check that we rightfully so deleted and there is only one user left in the thing. So yeah, so that's what you get. Sorry, it switched this as well. Okay. So that's what you get with just like this one line of handle entities and the whole backend implementation, of course. But yeah, I think it's pretty cool. On route option, it's something that I added since I did this talk in Golab. So now each route can have their own set of handle encodings and middlewares. So that's pretty nice because before it was like a global state. Not good. Okay. And for this to implement that, for example, we need the entity, which is just like the user struct. And we implement those two method, ID string, which return the ID as a string because our ID is an int. And the other way around, we need to convert from an int to a string. So if you have a better design, come talk to me because I'm not very satisfied with this, but it is what it is and it works pretty well. Plimitation is quite simple. And then on the entity provider, so on this example, it's just like a map, memory map, and we just, I just present you the updates. So this is just the backend you have to write. I put blogging because why not? I get my user from the memory, and then I just update it and that's it. So that's the code you have to create. So basically I tried, like with the memory map, it's like 100 lines of code for this, for all those method and implementation. I did in SQL, it was 110, something like that. So then you really reduce to just that. So for the future, oh, I have time, yeah, maybe I will have time for another thing, but I'm going to just finish that. So for the future, I would like to do nested resources, but I've heard like even Django REST API doesn't do nested resources, so maybe not. I want to add pagination, I want to add open API auto generation, so then you could generate whatever client for your system directly. I would love even more 8 OIS, I don't know how to pronounce that, but to have links and stuff, so the API is self-discoverable, I guess, even more than open API. And I would like to overly improve the API. So since last time, I did like the route options, the fields, I added protobuf during my way back from Italy, and I would love to use a log and slog, better handling and customization of HTML templates because you will see. And I would love also to generate GUI apps for that directly so you don't have to also bother that, of course. Simple GUI apps. So let me check if I can do this. Yes, yes, no, yes, yes. Okay, so yeah, this is my beautiful HTML GUI skills. So we have the user Jean. For example, we decide to, I don't know, Jean Marc. All right. And we can add a new person, see like a very well designed from the 90s, as vintage as me. Let's add Marc here. All right. Okay, we get back there. Then we have our full list and we can just delete. All this is thanks to HTMLX, which is you should check it out. It's pretty cool. Anyway, so this is the thing. I wish we could update those through by what you want, actually. And then the last demo I have is I play with my daughters a game called GoCraft, which is a implementation of, simple implementation of Minecraft in Go. And to bother them, I was like, how about I use my thing and see how it's usable and just be able to create blocks in the middle of their construction to annoy them. Or I can just delete and yeah. So for this, I'm just going to show the code at four last minute. I created a block type, which the ID will be the coordinates, XYZ. So my ID is, maybe, yeah, I guess I can still see. Yeah. Okay. So I split by X. So the ID is basically, if I show this, it's like that. I'm sorry. So I just say like the coordinates in XYZ as the ID and then I just have to marshal and then marshal this. And then in the game, I didn't implement get, I just implemented create. I get the coordinates. I create in the right format for the game. And then I update the block, dirty block. So this is really code just about the game. I'm not doing any HTTP in there. And here the delete, the same. It's just like code about this specific go game. That's it. And it works. And yeah. So if you were excited to use it, talk about it or something, go talk to me. I have a bow tie. You should recognize me. And so I would love to talk about it. If you have like a design ideas and stuff, I'm really up to it because I think we could improve it. And discuss about it, contribute anything. All right. And I want to thank the go team for the generics. Without that, it couldn't have been possible. The Ghostrasburg meetup because they had to suffer through my first iterations of crappy slides. The logo from a fellow Strasburg gopher. You for coming here, you online. I don't know where is camera. I guess there anyway. And for them in the go, they have organizer because like you're really, really top and HTML for the mean, of course. One of us.
Low code graphical apps with Go top to bottom!
We're going to continue in more creative uses of Go. Most people use Go, microservices, Kubernetes stuff, servers, whatever, but usually not user interfaces. Every year there are a few crazy people who come to talk about some crazy new front-end thingy built in Go, and I personally always like it. So I also invited Andrew this year to talk about Logoad graphical interfaces in our favorite language Python. Go! And I'm a boss. Thank you very much, Dave Matcha. So yeah, I'm going to talk to you about Logoad graphical applications. So on two levels there's not going to be much code on screen. I think I actually have less code than John's description about how to get involved in contributing earlier. So let me see how I can do it. However, what's a pretty picture, so hopefully I can keep you engaged in that way. So yeah, hi, my name is Andrew. I am a software engineer, working various startups. I've written a couple of books, and occasionally appeared on podcasts and interviews talking about graphical app development with Go. It is exciting to be here on stage at FOSDEM. I've been coming for decades. Having been an open source contributor for years as part of the Enlightenment Project, Maven, all sorts of things that potentially predate and certainly stand outside of the Go ecosystem. More recently I started the Find Project. Perhaps a few of you might have heard about this. It's a way to build native graphical applications using Go that are going to work on any device. If you've never heard of it, I'll do a quick recap. If you have, just hold on a second and I'll move on to some new stuff. I've been a Go developer for about two weeks less than I've been working on the Find Project because we had an ambition of what we were going to do, and then we figured out what language is going to work to deliver on these ambitions. Hopefully everybody agrees, it goes just a fantastic choice. How did all of that come together, and what are we building on top of it? My day job is at Fine Labs, where we're working on products and services that help businesses get more out of the type of technology that I'm presenting today. Like I said, the Find Project started in 2018, and it over that time has aimed to be the simplest way to build native graphical applications. They should look good, they should be easy to build, and they should work absolutely everywhere. Of course, easy to build is relative. We've had great feedback from people who have never coded Go or who have never built an app before, but there's plenty of people out there who feel that that's a little bit overwhelming to learn, they don't want to be a coder, they just want to build stuff. That's why I'm talking about something a little different today, about building with absolutely no code at all. But before I do, here's the recap for anybody who's not familiar with Fine. It's been running now for six years, I can't believe it's been that long, but hey, it's come a long way. We're currently ranked around sixth of all cross-platform graphical user interface toolkits by OSS Insights. That puts us up amongst some other names that you might have heard like Flutter or React Native, and of course I should probably shout out to Wales as well. They're very popular, lots of different ways to build in modern toolkits and actually in Go. So, variety out there. Last week I was really excited to realise that we have become in the top 1000 GitHub repositories of all time, out of, I don't know, 350 million or something, long tail perhaps, but it's a little bit of a milestone, very exciting to be celebrating that. We have about eight core contributors, they come and go. This year has seen a lot of new contributors coming in and as part of the Go community, it feels like a really welcoming inclusive space and we have some channels on the Go for server, we have a Discord for people to chat and there's about 2000 people across the different platforms that we are discussing on. But that is enough about the technical side and the project, which if you're interested to hear more about, there is a talk in the graphics dev room tomorrow afternoon at a similar time. I wanted to talk today about not using code to build applications. So, I'm going to introduce you to a tool called Fission. The spelling is just as peculiar as the fine project, but why not? This basic screenshot is not going to reveal too much, I'm going to step through a little bit of what is capable of it, but more how we pulled it together and how it has really been enabled by the Go built-in functionality and what we have been able to build on top of that. This is the screen that you might be greeted with, if you do load the app for the first time, it is going to help you get started building a project. So what did this set out to achieve? There's so much that we could do and probably I should have thought twice about getting into the space, I think there's 130 to 150 no code, low code platforms out there, but if you've ever tried them, they're mostly building websites or web technologies and if they're apps, they might be bundling them up into native installs, they might be targeting specific platforms or they might be reliant on technologies that I might refer to as legacy or certainly not with the same awesome modern tooling around the Go community does. So we wanted to do something new, something that was truly building native apps, like I said before, fine applications are going to compile and run on any device with a screen, at least that's the ambition, we're about 95% of the way there. So we're wanting to build native apps, but also we want to make this really easy to get started, with easy to build stuff, so as much graphical editing as possible, as you would expect from a low code platform, we started with the UI and the theming capabilities, so although the application has got a long way to go, as you might see, there's something to get started right away, it should always be building on a real code base. So if you don't like the front end or you want to work with a team of developers just loving Go at the low level, you should be able to work with them collaboratively through the Git repository, for example. The applications should compile for all platforms, but also this should run on all platforms. We're making use of our own technology, if you want to build an app for an iPhone, but you want to do it from an Android tablet, that's cool. If you want to use Windows as your development environment, but target only mobile devices, that's just grand as well. Little tweak on the bus, because you know the boss was expecting something before you get in in the morning. Of course being at FOSDAM, everything that I'm showing you today is open source, it's going to remain open at the core, but some days companies want the business add-ons, the plug-ins, so we're going to be using this as open core, but like I said, nothing that I'm showing today is proprietary or held back. The repositories are evolving and some of them are not landed in the right place yet, but I'll point you in the right direction at the end of the talk. Like I showed you at the beginning, we're going to give a UI that allows people to get started with templates to get their application running really quickly, but also you could build an application completely from scratch if you want, with the building blocks that we've provided on top of Git repository for managing the source control. But there's so much to get started with building your first project. I kind of don't want to say that. When I started it was super easy, you opened a text file, wrote a couple of lines in there and then you just ran it. I mean it felt a little bit like a script, but really good solid code. I've opened a few issues upstream with the project team about why has modules made things difficult to get into? Workspaces are amazing, but it's more metadata. We're going to have to manage that for you, but that's exactly what's going to happen. You tell us what your application is going to be called and we'll generate all of the metadata, set up the modules all for you. The metadata about the UI, about the themes, everything you're editing is going to be stored in the source control as well. So if you decide that you want to work, like I said with somebody else who's not on top of this UI, you can pick up absolutely that code and work with it. But also we want people to be able to pick up this project having worked on the code directly for a period of time. So not like a project where you can really quickly pull together a user interface for an application and then export it. It's amazing, you've got a React Native app out the other end. Nobody can read it and if you want to start working graphically on it again, you're possibly going to be starting from scratch. I don't know. Anyway, so everything is synchronized with the source control onto the file system. So we are working on a Go project. I did promise something a little bit graphical. So here you have the first slightly better looking screenshot I think. We're going to be working just now on the theme. We have a pretty crude mock-up of a smartphone device here, a generic one. The cutout is somewhere between a magical island and a place where let's face it, cameras exist and we don't need marketing about it but it's there. The UI is going to allow you to see how these applications work on mobile device, smartphone tablet or a standard window, inset inside the application. It's going to handle the scaling, the alterations that you would expect for these different types of devices. But also we need to present in light and dark mode. So you can see a toggle at the top of the color picker on the right hand side. All of this lovely information is just saved directly to JSON. We've used the standard encoding package that Go has provided to save it to the wonderfully comprehensive file that you see illustrated on the right hand side. That wasn't easy. Go made it super easy, completely built in. But then we needed to load that data into the application that you're building. We didn't want to do any weird generation of things, stuff that could get in the way of working on code like you would in a real code base. So we just store the file there and embed it into the application using Go embed. I haven't realized how easy it was to work with this. I'm going to call it new functionality because I work a few iterations behind the cutting edge because we're trying to support as many devices as possible. To be able to stream this most effectively into your application, a fine app can have its settings set to a certain theme. You just call set theme. But it doesn't really expect a JSON file. It expects some Go code, a struct. We provided this from JSON functionality in the package scene. You can see here illustrated how we can provide both light and dark alternative colors for applications. Less so well illustrated here is that you can work with fonts, icons and the different sizes. Everything that makes an application feel the way that it does. We've got a bit of a look. You can imagine how that file might have your brand identity or something stored in it. You can port that across multiple different applications. Widget editing is the other thing that I feel is actually quite an enabler on a UI like this. If you're thinking about building out your first graphical app and you're looking at fine and you want to use Go but you're not quite sure how to get started, something like this, just this one screen could provide you with the graphical editing that helps you to understand how things are put together. The functionality in the user interface here is mapping to the APIs that are available if you're looking at this as a developer. Actually, let me just go back a little bit. I'll show you a little bit more later. You can see basically here there's a section highlighted on the user interface. We've selected that and down on the right hand side it is giving you the different areas of settings that are available plus the option to insert more things into your user interface. I feel like I've said a little bit too much about JSON already. The fact is it's really super helpful. I don't like to read it. I don't know if it's the win but I'll agree with the folk that perhaps suggested that XML was a little bit cumbersome in comparison. We use it again. Actually, it is great that Go not only supports serialized using something like a map to JSON but we can because we have a stateful widget toolkit, we're able to serialize the entire state of your application the way the widgets are positioned, the containers around them and the metadata for them streamed directly to a JSON file. Again, illustrate it over there. There is also a little blank field on line four for name. A chance to put an identifier on your widget so that you can hook it into code later because this is a low code solution. We know we haven't solved all of the problems and you might want to write a little bit of Go so you can hook into that through the name which is going to be exported as a field on the application which I can show a little bit more in detail. As part of the Find project we've created a library which did start out as a project a little bit like this but now has shifted focus to helping more applications to load and save graphical state. It will also allow you to understand which widgets are available so you can iterate through. You can, at runtime, create new instances of an object based off some textual representation or just ID of the object type that you're looking to work with which, as you can imagine, pretty helpful if you're trying to generate at runtime a user interface that's normally compiled at compile time. One thing that I find really quite surprising, in fact I don't know how many people have realized this but your objects and types in Go in memory can be written out to Go code to reconstruct them as though they were source code. That's pretty cool. It's like stringer but it's go stringer. Has anybody heard of go stringer? I'm really curious about that. Right cool. So hey that's really interesting. Anything that you have in memory pretty much can be serialized as the Go code that generates it. You may need to write a little bit of code to make that fully functional yourself but we built on top of that. That means that every time you save your user interface state it's not just saving JSON but it's spitting out the Go code that will generate the application source code so that you can be working with developers but also so you can actually compile and run it. Which moves on to compiling applications. Now it goes amazing at the cross platform compilation, portability building applications for anything but there are certain requirements when it comes to building native graphical applications. Partly they want metadata around them but partly people who own certain platforms put licensing restrictions in place and require that you run on their hardware or with certain toolkits present so there's a little complexity here. The project that I've presented and will illustrate uses local developer tools so you're never beholden to anything at all. If you've got stuff installed you can build the application that you have coded and have it run on the local system and install it into your applications directory or the start menu whatever the equivalent would be on Windows at the moment. For the local system that's really quite straightforward, the tools are there. For cross compiling we've had some really great contributions to the fine project called Fine Cross from Looker and Jacob and Cedric as well so that you can with that level of simplicity build for any platform. It pulls down images with all the developer tools installed that you would need but even then you still need to have it running on a Mac to do iOS development or on a Windows box to ship off to the store so I'm not going to say this is proprietary but if your business was interested in something that just worked in the cloud there's going to be an option here that, good timing, there's going to be an option that allows you to spin up basically a pipeline in the cloud. It sends the latest version of your code and it comes back to you with the native app bundles for store deployment or ad hoc distribution and also included in that we have support for over the air self-update of application software as well. This little diagram here is something I created a while ago to try and explain to people why platform agnostic development or building with a tool chain that works on any platform makes a really big difference. If you think this would help to convince people to use Go More in your organization there's a couple of postcards over there next to the stickers and on the other side there's a couple of really sweet doodles which just show how coding nirvana can be achieved so hard level tooling, cute fields. Lastly before I actually show it in action there's a project out there called Fishos. Again you might get the theme here with the FY starting of the name which is a Linux desktop operating system that is built from the basic graphical level all the way up entirely with fine and fine applications. We're moving to all of the applications being created or editable with vision not just with source code so that you can be very well running your desktop software and go I actually think that this could be tweaked with something wrong here I can improve on it and you could go and edit the software that you're running, load it in the UI, make some modifications and then install it right back over the top of the software that you're editing before. If that sounds really interesting well we're working in that direction you can head over to fishos.com to see where things are. There is a beta ISO, stick it in a virtual machine nothing more and some of this functionality is not in the version that's there yet but keep an eye out because this is all coming very soon to a platform near you. With that I thought I might just try and show you that this works and bring up the UI editing and application of my system here. I have the bar of icons on my installed system here. I see this calculator app. It's nothing special, it's a calculator, it's going to calculate some things. Clearly there's some things in this that could be improved for some reason I think that's true. Let's actually go ahead and look at how this is. We can edit the calculator application and it's going to load it in the editor that I showed you. I was demoing this for somebody else just immediately before so it's defaulting to smartphone apologies for that. Of course we're really working in desktop software so this is the more familiar button size, text size button size that kind of thing. This C button, it doesn't seem to be quite right, it's very vague, I feel it should be a robust red warning. It is a warning, it should be a danger button. Let's really indicate that there's a problem likely to happen if you press this. You might be one of these people that thinks that clear isn't quite substantial enough so all clear or AC might be more familiar to you. We also could look at the layout of our application expanding down here on the containers. If I tap this, this tap here, this is a two item grid, I think it's this container here. I could do something a little bit bizarre and make those rows wow. That did make sense from a mathematical point because this is a new and evenly spaced grid and I just asked it to do something a little bit daft but actually the columns were just fine so we can go back in there. This application obviously it's just a quick editing in line. I want my app, I want to test it, I want to run this piece of software so I'll test, run, I'll press run. It's going to go and okay I forgot to save the file. Sorry I should have asked if anybody realized what I'd done wrong there and offered a prize but it's happened to me once before and it will happen to me again I'm sure. So there you go it has compiled the application and built it natively onto our system. That is a live native application compiled for the current system but it is just a binary where it's a single binary as any good go application would be running off our hard drive but actually what I wanted to do was commit my really significant improvement to this calculator app to my operating system and so I'm going to use this other developer button called install and it is just going to improve every day of my life now so when I go back to my calculator app over here I now have a new version of this little piece of software and I just feel like this has been a big improvement for me. Hopefully you can imagine a lot more possibilities and you see the next project that you're going to build and I would love to hear about that. Anyway let me just, oh by the way if you really like building applications you like mark down and think it's the future of all good things. This slideshow application is a fine application called Slides with a Y of course and it's just marked down. Anyhow sorry I'm pressing the wrong button aren't I? There we go. If you would like to learn more about this that I have showed you and please do check it out any feedback that you have. It's early days but we're looking for people to get involved. Beta testing.app is the homepage for everything that we're doing. It offers you links to, I feel like some surveys let us know what you think is going to be useful. Sign up to beta testing when it's available and the second link there is actually not connected to that app website. We recently completed some user interviews and got some really great feedback about where the opportunities might exist in this area. If you're at Intrigue we're running a questionnaire based follow up so the second link there would be really interesting to get your feedback. Like I said this is all open core and everything so far is fully open sourced under the BSD license. Actually it's dual licensed with GPL as well for the licensing of business add-ons later but it is all out there with a compatible license. If you would like to see the source code which I didn't tell you about but honestly is there fully available and pretty straightforward you can go to our YouTube channel. There is a video series called Creating an App Builder I think. We used to do them weekly and then moved to monthly. There are 11 videos there that take you through almost all of what you've seen demonstrated earlier and the source code is currently in the tutorials repository because we're just working on neatening up the first iteration of the actual product that I just demonstrated to you there. But the majority of the code as I said been through the videos is available in the tutorials repository. Hopefully that's been really interesting. I'd love to take questions now but also like I said there's these little weird things out there but if you're interested in building the future pick up one of these stickers and slap it onto a laptop and tell the next person how it is that Go is going to be the next future or the best, brightest future for graphical application development. Thank you very much. Did you all just realise we just saw an operating system user interface completely building Go? Yeah. Wow. I'm shocked.
turnip: Update on Open Source Vulkan Driver for Adreno GPUs
Hello everyone, thanks for coming here. I'm Danilo, I've been working on Tornip Driver for three years in Igelya. And I want to give a status update, what we have achieved so far, and what's coming for us. Let's start with the new hardware we support. We now support a lot of hardware. And recently we started support at 700 series, Adreno GPUs. We already merged Adreno 730 and 740, and the merge request for the most recent, Adreno GPU 750 is being on review. There are a lot of changes between Adreno generations with four mostly performance reasons. There are registers changed, and many new performance features out there. We also currently implemented only direct rendering and not tile based rendering. Adreno GPUs are a bit weird because they support two modes, tiling and direct rendering, which is the same that desktop GPUs support. But tile based rendering is still working progress for now. We also support a lot of, almost all, 600 series GPUs, but there are some variants out there we don't support. There are five sub generations of 600 series. We support all of them. So to add a new one, new variant of the GPU, we just need to change some registers there. As for our features and extensions, we now support Falcon 1.3 and a lot of extensions with it. Most interesting one for us was dynamic rendering. It's rather simple for desktop GPUs because they don't care about render passes boundaries, mostly don't care about them. But for tiled rendering for mobile GPUs, it's a big deal. We have to stitch together the render passes, sometimes even at the submission time. It could be really nasty. Like the code is bodily readable for it. And we have all extensions implemented for DxVK, D3D Proton and for Zinc supported. So it's great. While we do not claim Falcon 1.3 conformance, we do regularly test Vulkan CTS. We test a lot of game traces, we test games, but with games it feels like a vacuum all game right now because there are not a lot of real users out there. And we don't have a proper CI with game traces, like Radvid does. Another big changes we've done are in pipelines. Our GPU has some unique way of dealing with pipelines and with all the new pipeline related extensions, we have to rewrite them every time in some way. But thanks to Conor, Conorabot, our pipelines are healthy. We've done a lot of IRC optimizations, which is our backend compiler. They add up a lot with time passing. And we've done a lot of work in debug tooling because we have to reverse engineer GPU. We deal a lot of with unknown registers, unknown instructions, so we have to be able to quickly understand what's going on right there. So I want to spend some time on these debug tools we've implemented so far. I gave a more in-depth talk last XDC. You could find it at this link. So what's our debug tool? We have GPU breadcrumbs like in Google flight, graphics flight recorder. We have ability to reply common streams. We have ability to edit common streams. We can print for GPU memory. We could print from shader assembly in these common streams. And we could debug register reading of undefined state from registers. I'll describe each of these feature a bit more in the following slides. Why we even need our own GPU breadcrumbs? There is already a solution for this at Vulkan API level. It called graphics flight recorder from Google. It already could tell you where Hank occurs at which command, but there are two issues with that. It's two cores because for example, the start of the render pass could translate into like 10s or 20 bleeds at the worst case and each of them may hang. So API level tooling could be like not great at this. And what's really prompted me to create the breadcrumbs to implement breadcrumbs in our driver is debugging of unrecoverable hanks. When your computer or board just completely hangs, you cannot do anything, writes to disk doesn't come through. Like graphics flight recorder doesn't work with it. And to make it work, you need some new Vulkan extension and so on. It was much easier to deal with in the driver itself by doing all the things synchronously. And it worked rather great. But this tool is currently is not used too much due to the tooling I will talk about now. Okay, let's say you cannot even reproduce the bugs. Some bugs are random hanks occurring in different parts of the game and so on. So the easy way to reproduce them is just to record all comments submitted to the GPU and then replace them back. I mean, for most hanks and issues works great for reproducing them. There are a few caveats like it's necessary to record all buffer objects submitted and there could be a lot for some triple A game. So it works mostly for one frame or two frames. And not all issues are reproducible this way. There are some that are too finicky for this. But most of them are reproducible, so it's good enough. But it's not enough to just be able to replay the trace and see a hank in the mask. You have to have a way to narrow it down. So what we implemented is a simple way to edit the common stream. So we could decompile some submit to the GPU into very trivial packets. Like there are packet names only in comments right there besides some of them. It's really easy to do for probably any GPU and even in this form, it's very powerful because you could bisect the trace and find the exact comment which hanks even if you have like the comment. Even if it's impossible to determine from any other way how to deal with it. So you could edit some part of the packet and see if it helps. If it solves the hank, you could like deal with it as with ordinary code. What if the issue is inside the shader itself? We already could compile the shaders from assembly. So with this replay tool, we could add ability to just print some registers from the shader. And the most trivial print is good enough. So our print takes temporary registers for address and so on and registers to print. And print them. Like it increments global counter and tries to global storage and replay tool just reads from it and prints the registers. It's trivial and it was incredibly useful in reverse engineering and hardware. You get the trace from proprietary driver, you decompile it, you edit the shader to print something and you see the values and what's going on. It's incredibly useful. And the last tool in our tooling is the way to debug undefined registers, stale registers. A lot of issues are due to reading of like run value from the registers. Some state is not immediate. Even games have issues of not emitting some state and so on. A simple solution, at least for us, it was writing run values to all the registers and seeing what's breaks. And it mostly works. It's not that trivial because there are at least registers which are written at the start of command buffers and never touched again. And there are registers written in each, like in the render pass, like registers set that are set by pipelines. So we divided the registers into two categories. The ones that are set at the start of command buffer and the ones that should be stomped before each bleed and render pass. Again, there are some other caveats but it helped us quite a lot in debugging various issues when we implement new features. Let's forget about some weird registers. Okay. What are the real users of our driver at the moment? Like where you could see it. At the moment they are emulators on Android. Why? Because proprietary drivers are terrible on Android. Not due to their code but due to update policy of proprietary drivers there. They are not updated at all. So users are stuck with their terrible, many years outdated drivers. And with many issues, these drivers have many issues. They don't have necessary extensions. Like it's bad, it's really bad. And emulators need new features. They need for drivers to work. They push drivers to the limit. So if, so they, like for example, you now is able to load our driver, Chornip, and use it instead of proprietary driver. And it works rather well for them. And I remember some other emulators use the same technique to deal with issues in proprietary driver. Let's see an example. Here is some Zelda game running on Android on Adreno 650 with our driver. It's running rather great, even if it's a previous generation of Adreno. Like FPS is nice, runs correctly, it's great. So proprietary driver is a bit weird to say the least. Like maybe it works with the most recent one but it's hard to tell. Drivers are not updated. It's hard for users to update them and so on. So there are lots of issues and probably they don't test with these games. Okay, fair enough. We also don't really test these games. But the developers of at least Yuzu are willing to implement some debug tooling like recording the games, the game traces for us to easy to debug them. Because it's not that easy to launch a game without having the switch itself. Like it's not legal. Okay, earlier I said that Tornip implements all the features for DXVK and VKD3D Proton. So can we run desktop games? Yes, we can run desktop games. Here you see laptop X, X13S running cyberpunk. It runs via a lot of layers. Like you need FAC simulator to translate X64 assembly into IRM 64 assembly. You need Vine for Windows compatibility. You need VKD3D Proton and so on. There are lots of layers. So we mostly test game traces, not games themselves. We test games, but mostly traces because they are easier to deal with. But we will test games more soon. So what is the future for us? We need to support tile based rendering on 700 series because it would maybe not give a lot of performance boost for desktop games, but it would lower power consumption and help probably on Android for the games. Mark Collins, my teammate is working on it. And I hope we will see it merged soon. It would be great. And then we need to squeeze even more performance. There are lots of performance features we need to implement there. So even if we will not come to proprietary driver performance, we expect to be somewhere near it. At least we hope for this. I hope. And in the distant future, we want to implement ray tracing because at least like, 740 should be able to support Rayquery. And 750 probably could support ray tracing pipelines. I hope we implement this someday. And maybe we would be able to implement my shaders. That would be cool. Okay, another exciting development, not from us. It's not a Galeas project, but an easy way to run desktop games on Android. There is a work in progress project called Kasha. It's worked upon by one of my teammates, again, Mark Collins and some other people out there. It's an amalgamation of Vine, DXVK, VKD3D and FaxCore on Android. And I hope Jornip would have a first party support there. So it would be all bundled together and work together as one. Or you may say that people already are running desktop games on Android. Like here you see some person running Assassin's Creed on their device. Like it runs. Yes, that's true. There is project. There are several projects probably for this. It is done with Thermux. It's, I mean, I'm not sure exactly what it is. But it's even more unholy amalgamation of projects. It runs, it's really cool. But there are some performance issues, some issues with how all these moving scenes are are stuck together. But like people running games, desktop games on Android, that's super cool. Okay, that's all from me. For today, so you have some questions, suggestions. So you said you... Mike, Mike, no, okay. So you said you could use this on Android to replace the proprietary devices. Yes, you could use... So does that, okay, does it meet the root device or custom kernel? There are two cases. If you want to replace proprietary driver for the whole system, you need the root. You cannot change system libraries without root. But if you want to use a tournip for emulator, if emulator supports this, it could just load the shared library, packaged for it. So, and Google Play allows emulators to use custom drivers, they asked for it. And Google Play allowed it for this case. And the loaded driver talks to the proprietary kernel driver. Yeah, there is proprietary kernel driver, KGSL, it's a downstream driver. So we have backends for several kernel interfaces. That's right. Anyone else then? Sorry, will you recall the one with the upstream for doing all the kernel? Could you repeat the question, sorry? How would your implementation interact with the upstream kernel driver for the seven access? Do you go as fast as you can? We develop a Mesa for 700 series on MSM, on upstream. Not exactly on upstream MSM, because we have some custom changes to make it work. Not all of them are upstreamed, at least for 750 GPU. But it will be all upstream, we need it upstreamed. It would be there. But the kernel is not done by us, so we don't have much control. It's other people working on it. Okay, I guess that's all. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you.
Graphics stack updates for Raspberry Pi devices
So, as I said, thank you to attend this talk about the acceptance update of the graphics stack in the Raspberry Pi device. Thanks also to the organizers of the 4 of them and especially to the people organizing this DAPLOO. Which is great. So, let me introduce ourselves. My name is Juan and with me is my colleague, Czema Kacenova. We work in the Alia and the graphics team and we are working on the Raspberry Pi graphics stack. So, what is it to come out? It's basically cover the change that happened in the graphics stack since the release of Raspberry Pi OS Bullseye Edition was in November 2021. Up to the latest version, which is Boogworm. It was released several months ago in October 2023. So, I mean, this is for people that are not used to with, primarily with MISA. We have like five devices, Raspberry Pi devices, well, there are more, but are like variations of those devices. The Raspberry Pi 1, 2, and 3 use the GPU from Brathcom. It's called VideoCore 4. And the name of the MISA driver is called VC4. And then for the Raspberry Pi 4 and 5, they use the VideoCore 6 and 7. And the name of the driver changes like V3D for the OpenGLES. In this case, they are support for the Vulkan driver, which is called V3DV. So, what things happened? Well, probably the most exciting one is the release of the Raspberry Pi 5. This is evolution of the GPU from Raspberry Pi 4. It's an architecture, but with more benefits, how it really means like. It has like a higher clock rate, so it's faster. It supports up to eight render types. It has better support for subgroups operations, which is interesting for Vulkan. And that provides a lot of changes at the institutional level, so it allows to have more compact shaders which run faster. Drawback is that it has a bit of less register, so it suffers a bit of more pressure. And this is the support. It's integrated with V3DV-divis, and it was submitted for review almost the same day the Raspberry Pi 4 was announced. And now it's released in MSI 23.3, and that's in the current 6.8, which is required. As I said, this is more or less the evolution of the GPU, so the GPU front has the Raspberry Pi 4. So nowadays the features are more or less the same in terms of the driver implementation. So it supports the OpenGL ES 3.1 and Vulkan 1.2, and that supports a non-conformant version of OpenGL 3.1. We will see that at this moment. So from the point of view of the drivers in the MSI, the OpenGL driver, well, one of the important things was that we promote from OpenGL 2.1 to OpenGL 3.1 with some caveats, I'll explain later. I think this is quite important because at the end the Raspberry Pi is intended to be used as a desktop PC in most cases. So targeting the OpenGL desktop apps is quite interesting. So there are some applications that require OpenGL 3.0 something, and now they come on the Raspberry Pi. The upgrade from Bullside to Google allow us to expose 35 new extensions from OpenGL, OpenGL and OpenGL ES. And I was saying before, the driver is not fully compliant of 3.1 because there are some missing features in the hardware. For instance, this version requires 8 radio turrets. This is fixed in Raspberry Pi 5 but not in Raspberry Pi 4. It doesn't support 4. And then the hardware itself does seamless QMAP filtering, and the OpenGL spec requires no seamless. And then some other formats that are not supported. But all in all, probably these are not the most easy features. So we support anything else. So from a practical point of view, probably any application that uses OpenGL 3.1 will work in the Raspberry Pi. Then in the Vulkan driver, we move from Vulkan 1.0 to 1.2. So this is Vulkan 1.0, 1.1, and then 1.2, which meant exposing like 80 new extensions. It will compare both versions of the driver from Bullside to Vulkan. So there are a lot of new extensions. Some, I mean, I mentioned like extension dealing with sub-drops, as I said, which is very interesting for Vulkan. Extension dealing with geometry shaders. But I think the probably the most important work done was improving the performance. So when Vulkan 1.0 was released, the target was just having a performance driver. So we didn't spend any time on making it fast. And during this lifetime, we were working a lot on making it more performant, specifically in the shader compiler to reduce the liminal analysis and make strategies to make the shader smaller and faster. The good part is that the shader compiler for the Vulkan driver is actually shared with the OpenGL. So both the OpenGL and Vulkan share the same compiler. So all the improvements in the driver in the shader compiler also affect the OpenGL. So basically the improvements are both for Vulkan and for OpenGL driver. Another thing relevant to mention is that now Think, which is the driver that supports OpenGL over Vulkan, works with the V3D driver. So it means that you can use the Think to open up the applications. And then that's how we, well, we know Roman Stratenko was working on that in support for Android. So now you can run Android in Raspberry Pi 4 with the Vulkan driver. And now my colleague will continue with the workbook in the kernel. Okay, Sam. Well, continuing with our work on Vulkan on the Raspberry Pi, we need to implement several features that were not available in the hardware. We need to create what we call CPU jobs. That is part of the behavior that is not available in the hardware support for the GPU. So we implemented that in the Vulkan driver in the user space. But that implied, that was affecting mainly to some queries about performance counters, time-stun queries, and compute-shaded in this patch. So this caused issues because when we were submitting the different command buffers to the GPU, we need to start the driver of the GPU submissions, do the work in the user space, and then continue after having the result. So one of the improvements that we have just recently landed upstream in the kernel was these kernel CPU jobs. So we moved this operation that are already known to the kernel space. So when we are creating in the Mesa driver the submission, we are currently going to handle that so we don't stop the submission of GPU jobs. That was quite an interesting improvement in terms of performance. Because before this, there were a lot of stalls in the submission. Another feature that was quite interesting for the users at the end was to know if they were just really the GPU when they were running the different applications. It would happen for a lot of developers. I don't know if this is really working with the GPU. So we implemented GPU stats. We suppose these users' stats per process using the standard way of doing it in DRM. And we also suppose global stats. So this way some application just, if you want to know the global status of the GPU, just check that the value of the percentage of usage. Because in other case, you need to go to every process, check each process which amount of GPU has been used, and do the complete sum up. So because of using the standard interfaces, we can run application like GPU top. That is really nice because it works for several drivers. And at the end for the global stats, as we, there is no, no a common defined interface to expose that. We are currently using a CFS. So the hardware lacks some features to provide the stats as other drivers are using, in the case for something in Intel. So we are, we go to a simple approach. It is just, we put in the DRM schedule, when we submit a job to the GPU, we just get the time stop and the job ends, we get the finished time. As we are only processing one job on each queue, we have the information about how much the GPU was used. So we can show here, for example, it's on the top right of the, there is a graph with a widget that the users can check and the GPU users. And in the task manager, we already have the information about the GPU users. For example, in this screen, the main user of the GPU is Chromium, and the second one is the compositor, in our case it's Wayfire, because it's compositing all the different windows and surface that we already have there available. So, well, that's the highlights of the modifications from the kernel. And one of the main important changes that we did from Bullside to Bookrow Raspberry Pi OS was the change from the default desktop. Previously, in Bullside, we were running for the Raspberry Pi 4 devices, matter with Xserver, and it was being OK. And for the previous generation hardware, Raspberry Pi 1, 2 and 3, we were running the previous desktop we had, that it was an open-boss with Xserver. Matter was too heavy for the generation of hardware. And when we have this release of Bookrow, now all the Raspberry Pi's that we started, the public e-mails, they get a Wayland desktop using Wayfire. That was, and for Raspberry Pi 5, it was just a digital, it's the default one. For previous generation, we still maintained the open-box and Xserver, but I want to commend on this, now this is the last part of the talk. So, well, Wayfire is using OpenGL for doing the composition. It's based on WLroute's backend. We use the OpenGL, but it's quite tight for OpenGL. So, all the plugins are implemented there using the OpenGL API. One of the most important things we did in this transition from Bullside to Bookrow was that the user's experience don't change a lot. So, as we can see, Simon Long from the Raspberry Pi has a lot of effort here. So, it's difficult if you don't see the change of the background to figure out what are the differences between the previous version and the new one. That is Bullside and this is Bookrow running. So, all has been rewritten, the panel, the theme, because there are different compositors. And, well, now we go to the desktop on the previous generations of hardware. Well, we are still using the Xserver with Openbox. It's the file, the file we have. This has been the same way since Bullside, we didn't try to see it in two matters. The main cause of still using this is that we need to use sober composition. We use the CPU to render the desktop because the hardware limitations are supposed to have a memory limit that is 256 megabytes by default. The problem is that we don't have control when the GPU memory that is using the CMA, Continuous Memory Allocator, runs out. So, at the moment, we'll answer a new Chromium Brows tab that uses CMA memory. If we run out of the memory, next application that can do the following allocation could be the Xserver and it can crash or the compositor. So, the solution it has been there during all the time is on these devices, all we are using is CPU sober composition. So, GLAMOR has been off all the time and there is no hardware support. You can run a full screen application. All has been, you can enable it, but it's not the default. You can enable GLAMOR and you get hardware acceleration. But you are supposed to crash in your desktop at any moment. And there are a lot of hungry applications like the browser that can kill you if you open six tabs, you are completely frozen the desktop. So, during the previous development cycle on both sides, we wanted to make the possibility of enabling the hardware accelerated applications. So, if you want to launch your GLX-DRS, the GLX-DRS is not using a low beam pipe and it's using the driver for the hardware. So, we managed to do that. We enabled the hardware acceleration on the four applications while we were still doing sober composition for the rest of the desktop. So, in case you run out of memory, what is going to crash is just application. You are not supposed to the Xserver crashing or matter crashing or whatever application because they are not prepared for when you do a memory location, it fails. We assume that all the time it has been working. This was implemented modifying the mod-setting driving index server. We implemented the support for DRI-3 in this case, but without the need of using Glamour. How it is currently written is just Glamour enables the DRI-3. So, on crash-by-devices, we can use DRI-3. Even we don't have open GL during the composition of the general browser. There is a request for the server, but there is now too much interest in integrating that. We understand because Xserver development is stopped at the end. But we have been using that downstream for almost a year. It was a huge improvement for the users. With these changes, we avoid the problem of the GPU memory subsystem. When we were about to release Book World, the idea is that we are transitioning to Wayfire as the stock compositor. What can we do for the older generation devices? We need to rethink again how we solved the previous problems with Xserver, now with Wayfire. We need to do the software rendering composition using the CPU. We would like to allow again hardware-obscleted applications. The problem with using Wayfire to do the software composition is that Wayfire is quite tight to use OpenGL. It is using WR-ROOT's backend. As you have seen in parts of the code, mainly in the plugins, doing the different effects, we are doing calls to OpenGL API. We don't want to do that. The first thing is WR-ROOT already has a Pismand backend that is working. You can just transition and the parts of WR-ROOT that are using Wayfire, just do small changes to use the Pismand backend and it works. The next part is we need to reimplement all the parts that were tied to OpenGL in the different plugins that we are going to use in the distribution. There are some that are quite complex that we didn't need, so we didn't implement the change, to use the Pismand rendering logic. This way we managed to get all the rendering done by Wayfire to be done using CPU rendering. The problem is that if you do that and you start doing blending operations, in architecture they become really slow. Reading from the even buffer when they are doing the blending, assuming that we have synchronous memory, all the changes are flashing at the moment, it is terrible. We experimented with enabling for the buffers, we used for doing the run buffers, to use non-coherent memory. That makes that if you write on the CPU and then you put it on the display, maybe there is not coherency. So you start needing to flash in some places, you need to handle that. Some things happen funny because in 32 bits, IRM is different to 64 bits. Things that work in one place, in 32 bits you can just, I'm going to flash the memory before putting it on the display, and it works. In 62 bits, it doesn't work. At the end the flash is not doing anything in architecture. So we need to handle the synchronization but in the compositor to do that work. The difference of that change is that everything runs fast enough. The problem is that when you enable non-coherent buffers, you only want to do that for the compositor, but not the rest of the application. So that is complex because some applications don't work on the core and buffers and we are dealing with that, maybe enabling it with a parameter, creating a new IOCTL for the getting non-coherent buffers or what. So well, and for the other part, that was quite fast as we already know, the part about getting hardware-assisted applications because we already have the knowledge of doing this on the X-server, so at the end we need to handle in WOL routes to pass in the pieceband backend, to pass modifiers with the memory buffers, and it was already working. So I'm going to show this is the current working progress we have with this work. This is our Raspberry Pi 3 running the desktop. It's using the non-coherent buffers. In other case, you will see how the programs are moving. The performance is quite good. Some of the more complex things that are most expensive is the shadow calculation. You cannot imagine doing this in the CPU. Every time you scale a window, it's complex. We are seeing that these DLX-servers are using the hardware acceleration and it's not the best thing that we can do because there are possibilities of having another different plane to show that there is no display, but we are bleeding this by the compositor. We have enough time, I think. So we are going to see several plugins working. So as a conclusion, we are on the point of maybe thinking about putting this for the users, but it's still not ready. One of the things that Raspberry Pi devices have is trying to maintain all the generation of hardware because you can run the last Raspberry Pi OS with the Raspberry Pi 1 and it will work. We have already tested it with the Raspberry Pi 1 that has slower memory. Juan was doing that this and was good enough. It's comparable with the results we are having with the X-server. We are seeing Chrome running, that is using hardware acceleration. The good thing is that as we are not spending memory of the CPU, we can run more applications. But you can crash Chrome in just open, I think it's eight apps. You will crash Chrome in, but only in some cases, only one window. This is the Zoom working, this is all software composition. And I think that's all from us. We already implemented, this is the switcher that has Wi-Fi by default, we implemented with Pisman and we tried to do a more simple option, but this one was already working fine at the end. So we are maintaining the most complex part is doing the transparency and using the alpha channels in software. So, question, I think we are on time. What features do you need the CPU to actually get into the job? And are they used a lot, the applications need them, will that impact the performance? Well, our colleague, Mayra Canal, has a lot of positive planning in this and they are really out. The question is, which features in particular are needed to be done in the CPU and cannot be done in the GPU? I think I already commented, there are some things related to performance counters, mainly if you want, when you are running the CPU commands, to reset the counters. These need to be done in the GPU. No, you need to write in a resistor in particular. The other one is related to getting the time stamp, because there is no support to get the time stamp from the GPU. And the other one is indirect computer shader dispatch. Is that when you are sending several instances of a computer shader, in this case you need to send in the CPU one by one, because there is no support in the GPU. So you just submit the buffer to the kernel and the kernel is going to handle that. In the other case you were in the user space, you send one, wait, and you are going to send one by one. So, well, time's up. Thank you very much for your attendance. you
Delegated compositing utilizing Wayland protocols for Chromium on ChromeOS
So, hello everyone. My name is Maxim. I'm a browser engineer from EGALIA and today we are going to talk about the delegated compositing utilizing Willem protocols for Chromium on ChromoS. And here's our today's agenda. So first we talk about the goals and the motivations of the project, why we have Willem on ChromoS and why it's in Chromium. Then I will talk a little bit about what lacrosse is. Also I will need to cover a little bit about the Chromium display compositor to give you some of the ideas, why, how it works and why we actually needed the delegated compositing there. Then about the delegated compositing itself, the Willem protocols and a big picture of what we actually have. So Chromium and Willem on ChromoS. So there are quite a few vendors who are shipping ChromoS on their devices and as soon as the devices become, well, they are aging, right? So they are not receiving the updates. That results in having them with the old browser and so on. So in order to improve that and improve the maintainability of the devices, it was decided to split the Chrome browser from the ChromoS system itself because they are tied together. And that would also make it possible to receive them the browser updates. But how is it possible to do that? So the idea was to decouple the browser, as I said, from the operating system itself. That was called the lacrosse project. And the ChromoS itself, it has a system UI and the window manager called Ash. And yeah, Chrome was tied to that operating system. And at this point, there was also a Willem implementation already in ChromoS and it was decided to use Willem. So basically in 2015, if I'm not mistaken, the ChromoS received an own Willem's version of the implementation called Exosphere. It's currently used by ARC to run 100 apps on the ChromoS and also Crestini to run Linux apps. And in about 2016, we started to port Chromium to Willem and on Linux, you can use Chromium with Headless, X11 and Willem. So it was kind of a natural choice to employ that implementation and have a browser running them. And basically Willem is used for graphics and the wind handlings with stable protocols employed and also with some custom extensions employed. And for the high level features like file picking, Cross API is used. Well, it's basically Google's implementation called Moja IPC. This is similar to Win32 and Cocoa. But what is Lacrosse? So Lacrosse is a project to decouple the Chrome browser from the Chrome OS window manager called Ash from the System UI. So on this box, on the green box, you see the Chrome OS operating system. And on the yellow box, you can see the Lacrosse browser, which using Welland backend through the Ozone layer. The Ozone layer is basically an abstraction in the Chromium browser, which allows you to implement on backend. And as a sent on Linux, it's X11, Headless and Welland. And it's switchable in the runtime. For the ChromoS itself, it runs on the DRAM, but you can also use like X11 and run ChromoS emulator on your Linux device. So the Lacrosse is using Welland to communicate with Exo, which is in built in the Chrome OS, which actually forwards the input devices and has some graphics communication there. But there was a problem. So this split resulted in performance and resource cost. But why and how to mitigate that? To understand why it was causing a problem, we need to switch to the Chromium display compositor and understand a little bit how actually Chromium draws frames. So as you may know, Chromium has multi-architecture, multi-process architecture. So we have a GPU process or this service process. And also we have clients, which are the render process, the browser process. There's also this video client, which sends the video frames. So basically, we call them the frame things. And basically, the way how it works is that if we are talking about this GPU acceleration and the GPU rasterization, the way how it works is that, for example, if we take the render process, it prepares paint operations for the compositor frame. Then when we are preparing the final compositor frame, we submit those paint operations to ski on the GPU process. That is called the GPU rasterization. And we prepare textures. And these textures basically represent tiles if we divide the whole window to the tiles. So those represent tiles. And the compositor frames, they have references to the tiles, including some frame data like masks, filters, clipping, and other stuff. And on the right side, you can see the vService process or simply GPU process. It represents clients as surfaces. And each of the surfaces has own compositor frame. So we need to aggregate all the surfaces into a single compositor frame and do the final compositing. So this is a high-level picture, high-level overview of how it was working before the delegated compositing. So Lacrosse was aggregating the quads that would end up creating a final surface. And that final surface was, of course, represented by Zingobuffer. It was sent over Weyland to Exo. Then in the Ashcromb site, Ashcromb you can call HromoS, it was like maybe getting some other frames from other windows, I don't know, some system settings if you open that one. And it was doing the compositing once again in this step. So that resulted in double compositing and bigel resources overhead. But how to fix that? And the solution to that was to use the delegated compositing. So basically, we left the aggregation step, we created our final compositor frame, but the quads that we got, which are basically the textures, all of them must have been sent over Weyland protocol to Ash for the final compositing. And of course, I need to say, basically, this is about serializing the HromoS compositor frame, sending over a couple of IPCs through Weyland to Ash. And basically, it was at this stage, deserializing the data that it received, and it basically created, must have been creating the same kind of browser frame for the final compositing. And in order to achieve that, a couple of, well, at first I was thinking that there's actually more things we implemented, like some custom things. But in the end, it wasn't that much. So Weyland subsurface, that is standard, right? Each quad and, well, let's say we were sending quads as overlays, they were represented by own surface. Of course, Weyland buffers and explicit synchronization protocol, because we want the job to be asynchronous. And the main thing is surface augmenter, right? Because we wanted to have this data to be sent from Hromo, Hromo browser, basically, the compositor frame, with this additional information like rounder corners, clipping, also pixel precision, this is one of the important things. And we needed to make our own protocol extending the Weyland surface. Also we used, in the beginning, we used our own protocol for the single color buffers, but as soon as in the upstream, there is now, right now, a single pixel buffer protocol, we just employ that one, so that we don't need to create a real buffer. At first, when nothing was there, we were just clearing a buffer to a certain color, but that's not really efficient. Yeah, why we also needed to pass this kind of round-end corner and clipping information? And the reason to that one is basically because when Hromo sends, it thrusterizes the quads to the textures, those do not have any masks, right? So when we do the final compositing step, we apply those mask filters and so on and send them to Skiya, which does the final picture for us. And for the pixel precision, the problem is that Hromo basically works with the pixels, and as long as Weyland uses deeps, it resulted in some pixels losses. And when it was compositing the quads together, we could see some of the glitches. For that, to overcome that, we actually added some additional stuff to the surface segmenter and started to pass this information using VLFixed, basically, which allows us to use some floating wallets. It was also required to update the VP port, this destination, and some of the other stuff, like setting trust form, setting trusted damage, because when we, for example, change the order of the Weyland subsurface, this Z-order, at some point, we don't need to recalculate the damage or do we need to recalculate that? So basically, all that is managed with this additional fact. And there can be some other stuff, but I would say that was the most important one. And so this is the big picture how everything is implemented. So we have, like, on the top, Lacros viz process and the Lacros browser. So Lacros viz is basically preparing the frame with the quads and sends over the Weyland to the ashram, which then creates the same compositor frame as Lacros would have if it wasn't delegating but was compositing itself. Prepares the final frame, prepares the overlay and sends it to the DRM and that's it. You have the final frame with the system UI and the browser content as well. That's it. Questions? No, go ahead. Yes? Well, I can just repeat the question. Okay, so the question was whether the GTK and QT can also benefit from that. Do you mean the Chromium browser or you mean itself? No, just regular apps using GTK or QT. Yeah, I think so. Basically, if it's possible to have the double compositing, it is possible. We had to use some additional protocols because as long as Chromo is a really closed environment, we can do whatever is possible, whatever is convenient for us. But I think that is possible for the GTK and get this improvement of the performance as well because if the Weyland compositor can do that, why not? Yeah, basically in similar direction, but the Chromium on a regular Linux Weyland compositor, I mean, that would benefit from such features as well. I mean, there is double compositing again. So, have you looked at getting up to or generic protocols to manage that? So, now you have custom protocols, right? But for it to work on a regular Linux Uniday, yeah, a generic protocol. Do you look at doing that? So the question is basically about if Chromium Linux can benefit from the same implementation and whether we considered creating some generic protocol and upstream that. Well, if we get back to this pixel precision and the rounding corners, for the pixel precision, if the browser doesn't work in some custom scale, it's one, right? So it's fine, we don't need this kind of protocol. But for the rounded corners, well, probably we could do something like do this processing on the Chromium side, but it's not very efficient, right? Well, it should be possible, but creating a protocol and upstreaming that, it will take some quite some time. I personally did not thought about that, but it's an interesting concept for the future, of course. I mean, especially for embedded, it can also help if you, I'm guessing part of the subsurface is offered, for example, the video in the browser. If the compositor on the embedded device can then put that video on the plane, the rest is not branded, then you can benefit from these kind of things much more easily. Yes, of course. Do you delegate all the compositing and the compositor can decide what to put on the plane? Well, at least we can submit this video frame as an overlay. If I'm not mistaken, there was a, from somebody doing this, this forwarding Chromium, if I'm not mistaken, I actually saw this by the patches. I think that landed from the problem later. Yeah, probably, probably, yes. I didn't pay attention to that. I was busy with the Chromium itself. Yeah. Yes. What's the granularity of these subsurances? Like, how many would you expect to have on a regular webpage? Are we talking like almost every screen element or is it the more hard to think? Are you compared to like CIL? Well, if you just take a normal page, right? So, the question is how many subsurances we are going to have, I mean, how the page is itself like divided, whether we are going to have each, for the each element, sub-sub-surface or it's kind of done in other way. Well, basically, if you imagine a page, right, as a simple page, there are no additional textures and so on, we can split the page to the tiles, like there will be, I don't know how many, maybe six tiles, something like that. So, basically, this is how much you are going to send. But if you take like, for example, motion mark, right, there are some tests, like images tests, it can create hundreds of those textures. Then we are starting to send all of them over the pipe. But there is a limit for the IPC. So, we have to limit this, the number of quads that we are able to send. And if I'm not mistaken, it's limited right now to 50. Because after this while you, it just doesn't make sense to do any delegation. It's kind of become too expensive in terms of, I mean, there will be too many subsurances. If we could like squash this together, that would definitely help. Because it seems like it wasn't like a use case that was thought when the wheel and was designed. So, any other questions? Thank you.
Flutter in Embedded
Okay, we can start. Hi everyone. Today I'm here to present Flutter in Embedded System, precisely Linux Embedded. Quick presentation of me. I'm André Rikki. I'm from Italy and I work at Amarulla Solution as Embedded Linux Consultant and Developer. My background is mostly C++ on both console and UI application using different frameworks and also recently Flutter. In this talk I'm going to present the Flutter framework from a developer point of view. So we will not go deeper in how the framework is used in the Embedded and Outworks. Outintegrate with the most common build systems such as Yokto and Buildroot. And if there is enough time I'll show a quick video of a commercial product that we developed with one of our customers. So, what is Flutter? Flutter is a UI framework developed by Google, was first released in 2015 for Android, and then was later ported to iOS, Windows, Linux and web application. The idea behind this framework is to have a single framework and codebase to create a good-looking UI application, natively complied and multi-platform. And it uses Dart as programming language and we will talk about that later. So, let's go through the advantages of Flutter. First of all, Flutter is fast because it compiles natively for ARM and Intel machine. You can expect great performance both at startup and at runtime. Also, the idea was to help the developers to achieve 60 frames per second. So, you can expect fluid and responsive UI on any kind of devices. Now, let's talk about Dart. Dart is the programming language used by Flutter. Being modern and designed precisely for UI, it comes with multiple advantages and tools that are really helpful when dealing with UI applications. First of all, is that the language is completely asynchronous, so it's quite impossible to have the application to freeze. By contrast, for example, with C++, I saw in my experience multiple times where, for example, in the UI loop, people were opening files or doing some blocking operations and performance were really bad on that. In this case, it's impossible and the architecture of Dart abstracts the complexity of a multi-threaded typical shell memory multi-threading. So, this is really, really important in my opinion. Another important point is that the language is completely no safe, so it's practically impossible to have a segmentation fold in Dart application. This is another really important point because the UI application is what the final user sees, so it's important to have it always responsive and alive. Finally, all the error management is handled through exception. So, even if there are some errors during the execution of the application, simply error exception are thrown. And even if there are not catch, the application keeps running. And we will see an example in a future slide. And all of that in an easy to learn and familiar syntax language. I came, as I said, from a C++ experience and working with Dart was really, really smooth and really easy to learn. So, Flutter is fast, Dart is great, but that's not all. Flutter and the entire framework is also productive. And with that, I mean that allow development and maintenance of application, which really is. One of the most important key is Outreload and OutreStart. Outreload is the ability to apply changes without recompiling and restart the application. As you can see in the GIF, by simply changing the code from a dark theme to a light theme, without recompiling, the changes are directly applied in the running application. This is really important. We reduce a lot the time of development and maintenance and also the stress on the developer. Also, being modern, it comes with a set of useful tools such as Stotics Analysis, Widget Introspection and also debugging and logging and much, much more. Finally, Flutter Extract is flexible. As I said, it's meant to work on a multi-platform. And you can expect the same look and feel of the application on any screen. This means that from a more commercial point of view, if you need to deploy application on embedded and also on mobile and Windows, you can expect the same look and feel. And this is really important from the final user point of view. And also, the out-of-time compilation and natively compilation of the code allow for really fast and great performance on any platform. All of that in a single programming language, of course. This is my typical setup. I use Visual Studio Code with the official Flutter extension and on the right, it's the running application. As you can see on the top left, there are the typical running tools and debugging such as Start, Stop and Step Into, but also the Autrestart. Autreload is embedded in the Flutter extension. So if I apply any changes in the code and save the file, the changes are automatically reloaded in the page. And in the bottom, there is the Debug console. As you can see, an exception is thrown because the application tries to save a file when starting. But even if there are some errors, the application is still running and works without any problem. In my opinion, this setup is really important. It's really easy to use. It's really productive. And I think that from the developer point of view, allow the maintenance and development to be really, really easy and less stressful. I'm happy to see that Flutter is becoming more and more popular in the embedded system, precisely Linux. Yesterday I saw a different talk regarding this topic. And it's important because the community is really active and huge and it's becoming more popular on Linux. So if you have any difficulties or if you are facing any problems, most of the time you will find a solution online. Then there is a huge list of packages. There is an online repository where free packages are hosted and developed by the community. They come in different types. I use the packages to visualize, for example, a lot of animation or SVG file, but there are also packages that are more into the code such as MQTT communication or file parsing. Flutter is actually developed and updated. In the last year, I had to update the Flutter version both on my laptop and on the target multiple times. So Google and the community keeps updating the framework and improving security and adding new features constantly. Finally, it is used by big tech companies, first of all Google, the creator, but also BMW and Toyota. And those companies keep the project alive contributing because Flutter is completely free and open source. It goes on a BSD3 license so you can use it without any troubles. Now, let's do a quick comparison with the most famous UI framework for embedded Linux. First of all, a VGL. The first point for me is the most important one. C is not dark. C is a really powerful language, but when it comes to UI application, it's not so easy to use, and it's really easy to mess things up. Instead, Dart is designed to work with, for UI application, and we saw all the advantages that the language come with. Also, autoload and autrastart, there is no way to achieve that in LVGL. Of course, you have to rebuild and recompile and redeploy the application every time. Instead, Flutter has this amazing feature. Also, it has more platform supported because LVGL can only run on desktop or embedded, and Flutter can also run on mobile. There are more packages available. Let's call it, also call it their Libs, as we saw in the previous slide. And of course, Flutter is a bigger community behind it. Finally, using Flutter is much, much easier to build the application and publish. We've seen a later slide, how to integrate the Flutter application inside the Yocto, and you don't have to mess with the build argument. It's all handled by the framework and the Yocto project. Instead, with LVGL, if you need to cross-compile, it can be a bit tricky. Then, Flutter versus Qt. C++ and QML versus Dart. C++ is a step up against C, but still can be quite difficult. I saw multiple times Qt application having a really bad performer issue because C++ was not optimized. And QML is designed for UI application like Dart, but first of all, it's an interpreted language. So if you start doing any kind of logic inside QML, the application will be crap. And I think that Dart is still much better for UI development. Here again, autreload and autrestart. It is possible to achieve autrestart and autreload with QML, but in my experience, I was never able to do that. Most of the time, QML is strictly connected to C++ for modeling and such, so you of course need to recompile everything. Third point, Flutter, as I said, is completely free and upersoos. Instead, Qt has commercial licensing. So if you want to use Qt in a commercial product, you probably end up buying a lot of royalty. And finally, Flutter, I think that it's rapidly improving. I mean, Qt is improving too, but the release cycle is much slower compared to Flutter, so I think that this one is also really important. So we saw a lot of advantages and good points, but it's not everything is perfect. So in my experience, one of the, let's say, tricky part of Dart is that when working in embedded Linux, you can expect coming from C or C++ to be able to do anything you want, for example, accessing the hardware directly from the U application. For example, in the product that we showcased later, I added to read proximity sensor input directly from the U application to simply turn on the display. This one was not possible because Dart doesn't allow to read directly the hardware to access directly the hardware. The structure is a bit complex to use, so long story short, I was not able to do that. But the solution that is foreign function interface, also known as language bindings. So what is possible is that we can create a C library with a public interface and then we can call those methods directly from the Dart application. This is really important because we can solve a lot of the issue related to more complex stuff from the language by using a C library. So start up the application, the library is loaded, and then I can call directly the public function. By doing so, I was able to solve the issue and read the proximity sensor input from my Dart application. Now, how to integrate Flutter in your project? Well, build route, there is the Flutter package developed and maintained by my co-worker Adam Daskat. It has done a great job on this package and is currently maintaining it. In my experience, I used Yokto for my project, so I'm a bit more into that. There is the Metaflutter hosted on GitHub that is the, let's say, official Flutter, is maintained by some guys from Toyota and the community. Integrating the Flutter inside your operating system is really, really easy. Just include the layer, add the dependency in your image, and you are pretty done. The Flutter engine and Flutter Embedder are automatically compiled and added in your system. If you need to include, obviously, you will need to include your application. You can use the Flutter Gallery Recypes as reference, pretty straightforward stuff. You just copy the Recypes, adapt the repository, and maybe if you want to adapt some build arguments, and then add as a dependency, and it's done. You don't need to mess up with any related cross-compiling stuff or any of that. On GitHub, on my page, I have a repo manifest that I've done almost a year ago, hopefully still working. For a tinkerboard, simply create the Yocto project and download all the layer needed, and then add the dependency inside the image. When you compile, you can simply put the result on SD card, and you will have the Flutter Gallery on your hardware. Now I'll show a video of the product that we developed with one of our customers. The device is an intercom device. It's running a Rockchip processor, so obviously a bit more powerful than the tinkerboard. This is the intercom device. It's able to connect to another device on the other side and have video and audio stream. Under the hood, there is a lot running, but as you can see, the application is still smooth and responsive, and the performance is really, really good. This is a custom keyboard that we developed to achieve the design that the customer wanted, and those are some lotty animations that are running with the package that I included from the Flutter Dev. The final result, in my opinion, is really good. We as developers are really happy with the result. The customer is happy with the result from the commercial point of view, so that's why I was presenting Flutter today. So, if you have any more questions, thank you everyone. Great. You mentioned this Dart is a custom language, but there's an MQTT library. How does it work? Do you have to, for each protocol, implement it in Dart again? So, Dart is not... The question is if, for example, MQTT is actually... There is a custom stuff running under the hood. Not really, because in reality, Dart can be recompiled as C++. So, simply, Dart, you can use Dart as you... It's like you are using C++ code, and MQTT actually didn't really look into it, because I downloaded the package. It was working flawlessly, so included on running, and I was really, really happy with the result. So, yeah. Thank you for the talk. I'd like to know what is the memory footprint with Flutter Engine, I mean, like Flash and ROM. Okay, the question is the memory footprint of Flutter Engine. So, let's say that one of the disadvantages of Flutter is that it's a bit more resource-less hungry compared to Qt or LVGL. I think that the Flutter Engine was like 14, 15 megabytes on the storage, and on the memory, I didn't run any kind of profiling on the hardware, but I think that it's comparable to Qt application when running. Yeah? Yachto, do you really need a big operating system underneath, or is it capable of running on, say, a pre-artore or something like this, a more lean operating system underneath? Okay, the question is that if Flutter is able to run in less powerful operating systems such as FiatOS. Oh, perhaps Bare Metal. Oh, Bare Metal. I don't think so. I think that at the moment it requires a lean operating system. Okay, thanks. I have one more question. You said about integration to a distant project based on Yachto. So, we just integrate a map of Flutter, and you said we do not have to take care about any cross-compilation. What was the work? Like, my distant project is compiled by GCC, and Flutter, I read that there is a dependency to C1. So, the question is how is the end-all the layer, and how is the cross-compilation managed? So, I think that everything is end-all by Yachto, and the meta-layer, the meta-layer. The meta-flutter is really well done. You can simply download and add the layer dependency in your Yachto project, and if you include the dependency in your image of the Flutter engine, it will automatically be compiled, because Yachto takes manage everything about that, so you don't really need to take care of that. Hi. Hi. What about Flutter, Yachto, and IRM32? Sorry? IRM32, ashtag, too. Have you had any experience? Actually, the question is if Flutter is capable to run a 32-bit platform. IRM32? No. Simply no. Do you have any port, any project on Yachto? No, at the moment, I don't think so. Yeah, the question is if I know any company that is moving from Qt to Flutter. The video before explained one of our customers was mainly using Qt for your application, and now is moving to Flutter. So I think that because of the open-source licensing, this is really tempting from a core-mesh point of view. Yeah? It's a bit of a tilt question, so for the current project, did you do the whole project in Flutter and Dart? Yeah. Okay. In your company, do you also have projects where part of the product is made in C++ or something else? And how do you, would you integrate that with Flutter? Okay, the question is if the whole project was made in Flutter or if there are also other applications running with C++. Well, in this case, the UI is run with Flutter, and there are a set of microservices that are running under hood with running C++. For example, they take care of the video and audio stream and all of that, and the application communicates with those microservices via NQTT, for example. Okay, I think there's no more questions, so thank you very much. Thank you.
Building Cross-platform GUI apps with ease (and Go) - desktop, mobile and beyond!
Thanks very much everybody for coming here to the graphics room to listen to another talk about building native compiled applications that are going to work everywhere. I had the title of the slide up and realised it didn't actually say go on the title slide so I just wanted to get that out there right away. It's very exciting to be here in the graphics dev room and to be presenting in the same place that fantastic people over the last decades have shown great new features in KD and Nome and fantastic discussions around all of that and hopefully I could bring something new and interesting to the room as well. Just out of interest to get this right I recognise some faces from the Go Dev room yesterday so maybe a show of hands if people have programmed Go at all. Wow, okay cool probably unusual for this room and anybody then that is a C developer just in case I need to go back to some common ground. Right, okay cool well thanks very much. So just a little bit about myself. Hi my name is Andrew it's really nice to meet you all. I am a software engineer have been for 20 years I think now, I stopped counting. I work a lot in startups either my own or other people's companies solving interesting technical and personal challenges, building teams all that kind of stuff and I've written some books gone on a couple of podcasts on the topic of building applications like the ones I'm going to show you today. I have a background in open source if you've seen me here before it might have been talking about the enlightenment project where I spent a lot of my time and before then the Maven project as well. I started the Find project which I'm going to present today to build graphical applications built on top of Go in 2018 and I have been a Go developer since two weeks after the project was founded. I'll tell you a little bit more about Find as we get into it but I didn't pick up a language I wanted to learn and decided needed a graphical toolkit. I had an ambition to make getting into graphical application development so much easier I knew what I wanted to achieve and then I hunted for programming language and I don't know if it's a good or a bad thing to say I wanted it to be rust I so wanted it to but I couldn't figure it out and so I picked up Go and I haven't looked back I've never felt more productive. My day job is at Fine Labs which is a company that's set up to help businesses get more out of the types of platform agnostic technology that I'm going to show and so we have products and services that could help companies working in this space. So I don't know whether or not people would think that Go is a strange choice of programming language for building graphical applications. It's certainly what the Go development team have said over the past few years although I think they're coming around now they've seen how easy it is but just to summarize the benefits for anybody that doesn't know much like Dart in the previous presentation it's going to allow you to write applications that compile natively to absolutely any device so they can pretty much run anywhere from desktops through mobiles, wasm on the web browser through into embedded devices as well. It's important to me also that there are no runtime dependencies. These pieces of software should drag and drop or install through a store in the usual manner without any need for additional steps, no runtime setup, no hidden pre-conditions required to get the applications running. We may have to do some as a developer but we take the pain so that our users get the big benefit. We're going to deliver native performance. These applications are compiled down into the same machine code at the TIN level that any piece of software built with C or other built for specific platform technology is going to offer but fundamentally I thought it was important to lower the barrier to writing graphical applications, help people to realize it's not so difficult. It's something that you can see and do and have installed on your device very very quickly indeed and Go provides the ability to do that whatever platform you're on but also the standards and the pros and the tools, the techniques in the language help to make everything easier to understand. It's good documentation, it's standard ways of writing things, unit testing built right in, all of those good things helping to promote good engineering principles. And so for me this is why it made such a good fit and it's why the fine toolkit picked up Go because we're wanting to be the simplest way possible to get people building beautiful and usable native graphical applications and to not have to think about any changes that might be necessary to get them running on any particular device. So the fine project like I said started in 2018 so that is now six years old I think possibly as of this weekend actually, complete coincidence I was not sitting in a Fosden room when I thought of the project which is same, would be a good story. It has become the most popular graphical toolkit for Go which is pretty exciting. Over the years there's been actually quite a few have started and it's nice to have choice. They have started perhaps with different technologies under the hood some are using embedded web browsers for example and others are interested in enabling more control, more power where we're focused on the simplicity and the ease of use I suppose. OSS Insight if you track them on Twitter, X, wherever they are have ranked a sixth out of all cross-platform graphical tokens which is very exciting although for some reason Qt and GTK don't seem to list in the top 10 so how they came up with the numbers I don't know but it puts us up there with others like Flutter, React Native and other names that you would have heard of and just last week I realized that we had got into GitHub's 1000 most popular repositories across the entire entirety of their code base if that's the right word which I think something like 350 million repositories and as part of the Go ecosystem we make use of the really excellent and welcoming community that they have established over there and across Slack, Discord, Matrix and in-person meetings we've got about 2,000 people that like to get together and talk about building applications offering help for people who want to get started. So let's get a couple of pictures on these slides as well. This is the fine demo application so if you're interested in checking it out you can load this right now it's in the standard repository that we have we ship a few demo applications and if you're on the Google Play Store you could download this right now onto your phone and see how it renders on a mobile device. Hint looks exactly the same except it's adjusted for the different screen sizes and of course I mean as a developer at heart Light Mode is no good to me we ship the dark mode by default as well sorry they're both in there by default it will pick the right variant of the theme depending on your user preferences. So let's get started and build an app. I'm not going to overwhelm you with complex code which is perhaps a relief to people who don't know Go or C but I'll step through what we do have. Go is known for being easy to compile across all different platforms from whatever developer device you have which is fantastic it's a good place to start but because we're going to be doing some graphics programming and we want optimized binaries that are going to use your hardware acceleration we've got to get a little bit of C in there under the hood you're never going to see it but we do need a compiler installed as well so you'll need to install Go and GCC or Clang or a compatible compiler. If you're unsure whether you've succeeded setting up a development environment we have a fine setup tool which will verify the runtime yeah the runtime it's linked from the Getting Started pages which I'll add reference later and that's just going to check that the Go compiler, the C compiler, are found and catch typical challenges around just having your path set up properly so that tools are indiscoverable and we have a tool called Fine that's going to be useful for our packaging later. So there are a couple of steps that we need to do to get started with a project nothing like if we were going to be starting with a C code base but nonetheless it's something to be aware of we need to make a directory for our code we need to initialize the Go module this is a step that was introduced recently it adds for much more powerful dependency management on the Go project it used to be that you could just open a file save it run it and you would have an application displayed and I'm trying to coax the Go team to allow that as a default for really early stage because the mantra is start with the smallest thing possible and then add to it over time so apologies we've got a couple of steps there that you need to know we're calling Go get which is going to grab the library and goes looking all of the stuff up on the web through pretty efficient caching mechanism but as you can see that's a URL it's finding the source code and that's going to download it into the module that you've just created actually it's referencing in the module and putting in a common space so you don't need to download it again for another project and then we're going to edit our Go file I'm calling it UI.go because I'm really good at naming this is the code that we're going to put in it not adventurous to live code I'm afraid I'll step you through it we have package main because every application enters through the main package we're importing two packages they're in the same namespace of find.io slash find slash v2 because this is our second major API and it's the app and the widget sub packages that we're going to be using the app package sets up the runtime pools in the appropriate drivers for the device that you're running on and then is going to bootstrap the application and the widget package we're using to add something into our window our main function again probably no prizes here forgetting that's the entry point for a Go app is creating a new find application that's invoking the driver is creating a new window from the application with hello as its title so if your device has title bars hello gets popped in there and then the one line which is basically our entire user interface says set the content of the window to a new widget a new label widget that says hello fine and then we then call show and run on the window which is a little shortcut for show my window run my application if you're not familiar part of the challenge with graphical apps is they do have to run in an event thread the operating system has specific requirements for things we just bundle it up as simply as possible so there we have I think four lines of code and a couple of import statements we can type go run full stop the period there is just say the current directory you could equally have said main.go because we only have one file and it's going to load this picture this window here says hello fine I was running this on a desktop in dark mode at the time which is why it looks that way wow yeah I can see you're really really excited about a hello world application I mean I was the first time it appeared on the screen but that was a few years ago so let's do something a little bit more interesting and show that it's still going to be easy to do something useful going to make a markdown editor for you we have built into the standard widget package and an entry widget that seems to say editor it's an entry widget that's going to take the user input we're going to use a rich text in our application to render the output part of the reason this is going to be really straightforward is that a rich text understands markdown as a source for the information to mark up a text document and a horizontal splits container for laying out our user interface I showed you widgets before but containers are sort of like a type of widget where you have multiple things in it and it has a layout that's going to describe how things should lay out on screen you don't position widgets manually you have an area and a container fills it which means that we can adapt to screen sizes orientations very easily widgets don't have to think too much about how they're placed what type of device they're running on it's actually very powerful when you're not really wanting to think too much about what system you're going to be running your application on and I'm going to hook the two together with an unchanged callback so that when the user edits their text the runtime will change it okay so I described four lines it is a little bit more than four lines of code but we have the same imports with the addition of the container package and we're starting the application and window in the same way although as you can see a very exciting new title is going to appear on our window the editor field the oh sorry the editor field can't seem to point sorry anyway you can read it's not a lot of text is a new multi-line entry which is a standard entry widget but has more than one line in it we don't need to specify how many because it will fill the space available we have a new rich text we're saying load markdown but we're loading nothing as you can imagine if you passed in a string a markdown string there it would actually render that as it was loading for the first time then the hook that I mentioned is again one line of code the unchanged on entry passes a string to whoever is interested in what changed and the parsed markdown function of our preview accepts a string because you would parse your markdown from a string so we're able to set one function to to the other one so when unchanged happens it fires parsed markdown so we can avoid signal slots string based IDs and comparisons to connect multiple widgets together and just use a single line of code instead and then the most complicated piece of code in this entire snippet is the container we're using an adaptive grid which is like a grid but it adapts to is the number of columns slash rows that it should it should have if you had a standard widget a standard grid it would have columns or rows specified and as it reaches the end it flows onto another with an adaptive grid it's going to decide whether it's columns or rows based on the space available so if we were loading this on our phone in portrait mode one will be above the other and if it's in landscape one will be to the left and one will be to the right so sorry about the sneak preview before but this is a markdown editor there we go that's better thank you thank you you're too kind as you can see this is not difficult we have the entry widget on the left we've typed some markdown into it and it has rendered on the right there is a link which you could tap and it has referenced a local image as well that's quite cool but this is cooler it's exactly the same software that has been packaged as an IPA and dropped into my iPhone simulator actually it's a dot app because it's a simulator not a real device but exactly the same so the code could be dropped onto a device as well as you can see it's also running in landscape so the arrangement is the same so there that is the application running across multiple different platforms how did we get it there so compiling for targets that aren't your current machine is a little bit more complicated but let's start with what if you are compiling locally the fine tool the helper that I mentioned before is pretty important and very helpful as many help helpers are you can get it from the URL there and that go get command is going to download it and put it into your path and then you can use it to do helpful things like package the application or install it locally as I'm sure you're aware a binary that you get out of a compiler is fantastic and efficient and you can move it around because it is portable it doesn't look good and you can't put it in your start menu so fine package is going to give you a binary with whatever metadata around it is necessary so it will inject an icon into an XE for Windows or it will put the icon and desktop file into the appropriate places on your Linux system and fine install on that second line there is doing all of that for the current system and installing it into the right place for you so user local probably for most people here or the start menu or your applications folder on Mac line three there is how do we do it for a different platform because we can't just invoke the compiler we also want to package it differently the XE is going to appear instead of whatever our native system is that is going to use local tools and so if you're familiar with cross compiling and see having a tool chain specifying the CC variable is going to be likely needed for some of these cross ports we'll come back to that in a second the fourth one there is to build an Android application on our platform to do that you just need the Android SDK installed essentially quite straightforward and relatively portable the only reason we it's a bit more complex is we need to say what the application ID is because the sandboxing and the operating systems rules say you can't just be an anonymous piece of software so we pass that in there is also metadata file that you can do if you prefer to avoid command line arguments all the time fine app.toml I'm not going to cover it but it's there and can help you save a little bit of pain but what if you don't want to manage multiple developer tool chains installing packages even if it's for your local environment you just might not want to you might not be able to so contributed to the project is fine cross from Luca Corbo and Cedric by who many of you will know and another guy Jacob on our project have pulled together a Docker based build system with a very standard command line front end so you could much like you would say fine package OS windows you could say fine cross windows and it's going to take your application bundle it up inside the Docker container put the binary back into your current directory and and exit the container so helps you to avoid all of the setup if you don't mean running Docker or podman on your local instance that's going to be super super helpful very briefly want to touch on some more interesting parts of the toolkit because it's not all about just showing dialogue sorry showing graphical elements on screen one of the hard things about making applications portable is the file system we take it for granted but we shouldn't it's not always there so we've provided dialogues to open and save files and a package that helps you to manage storage in an abstract way even more abstract actually than the recently added go file system package it doesn't assume file paths it uses URIs to uniquely identify any data source so you could have your your data remotely on a network somebody made an application to bores their steam library they connected it through the storage api and they used the file open dialogue to browse their steam library cool but why it's really cool is the picture on the right here i've asked my application to open a file i've put that onto my iphone simulator yes and it has shown me this file picking dialogue i don't know if people are familiar or not but this is what's going to come up if you have an iphone set up with an iCloud account i can pick data off the cloud or i can back out and i can go to the dropbox picker where i might have something stored so third-party applications can provide data as though they were files because we're not making the assumption that they're files and if you get further into this and you want to separate your ui from the data that you're managing internally separate state from from rendering then we have a binding package so you could pass around a string binding not have to remember that it's going to a label or you could multiplex i have some data it's going to go to two or three widgets and most of the standard widgets will provide a with data constructor so i can pass the data binding in and that's a two-way data binding everything's always going to be kept up to date so two pretty helpful things but i wish i had more time to tell you more obviously there's a full widget library or i wouldn't be shouting about hey everybody you should try this we have a dialogues library and full featured forms as well which surprisingly is one of the things that could be a little tricky to get working on a mobile app menus some more complex containers than i have shown you we have notification integration system tray for desktop and popping in and out wherever it happens to be appropriate for the device you're on and we've provided native access to apis that you might not have in go so if you need to use a library it's not available in go you can call out to that natively through a c api the go team have done a fantastic job with making that integration really easy you just essentially import c and call it with a c namespace and it works pretty much transparently again there's some complications if this is android and you want to access the end the k you need the jvm instance so we've provided some native integration that that give you the context necessary i'm not going to step through that today however there is a little bit more than that it wouldn't be a presentation in the graphics dev room if i wasn't able to say but hold on a second um we built an entire desktop system using this this api stack the presentation that i've just run through is in an app called slides it is a markdown file we support markdown rendering that is pulled together in a fine app the terminal here another fine app in fact the desktop system everything in front of us it's all rendered in fine it is go apis and very very easy to understand so there you go i feel like i've fulfilled what's necessary to to consider ourselves a serious graphical contender if you would like to learn more well i'm here i'll hang around outside if anybody wants to chat there's a lot of documentation online the like i said the project's been going for a while so you can find a lot of what we have at docs dot find i o there's also a pretty good video channel um at fine i o on youtube where you can find tutorials examples and do search for fine tutorials outside of what we have because there's translations in okay i'm not going to list them in case any of them politically insensitive but plenty of different languages for folks who find that they want to try um a platform that's this different to the standard ones available perhaps um there's a book available about fine that i wrote you don't have to buy it but it's out there um if you would like to contribute and we would really love it if some people came along and helped us to improve this project everything is on github including the documentation the websites the examples you can all find it in the organization and the main repository is simply called fine and you can find the source code to everything i've shown you today we're of course looking for sponsors but who isn't if you love it you know help out in whatever way you can appreciate your time thank you so much and i'll take any questions that you have excuse me please yes um we do a lot of complicated stuff to make oh i'm sorry of course so what's the support like for ios is it more complicated than android it is very complicated for us it is trivial for you the developer who's using the toolkit you don't need to think about it you don't need to do anything at all the tools that i've shown you will create any type of application from your code um the one proviso is that if you want to put it into an ios into an onto an ios device apple is going to insist that you own some hardware that they have produced it may or may not be possible to do it in other ways but that's the license um but no fundamentally this is this is platform agnostic the apis are all guaranteed to work absolutely everywhere um sorry the the fuzz done to fit yeah yeah i was just wondering if you could um i if you could provide a sense of are there any certain kinds of application that are both particularly good bits and maybe less good i good good bit the video in this framework yeah so are there applications that are good fit or not a good fit um for using fine i think the easiest answer is going to be that if you have a document rich piece of um content if you're helping people to browse archives of um documents and things like that you'd probably be better with a web framework honestly because i mean that's what it's built for if it's more interactive if it's graphical driven um then that's something that we're going to be much better place to do fundamentally if you want to get this out to many people we're going to um alleviate a lot of the pain of getting it out there quickly and some of the things that other toolkits might offer as built in or community add-ons might take a little bit of time to implement but you've saved a lot of time up front i wouldn't go and implement games because we don't offer the 3d acceleration as part of our api we just use it internally for the speed improvements but it has been used for such a wide variety of things we have remote desktop application screaming uh 60 frames a second um full screen so that that kind of thing that's pretty cool and can we squeeze one more in yeah um just there please uh yes you mentioned about uh using open gel as the back end under ios for instance are you going straight to metal or are you using say angle or some other solution okay so yeah what are the graphical back ends that we're utilizing um it is open gel on the desktop you're you're quite right and we are using gles on the mobile platforms ios and android i'm aware that some of these have been deprecated and they may change over time um on the desktop mac is trying to kill off um open gel they've not really said that they're going to kill gles on mobile but it's inevitable that they will want to we're looking in the future to build more back end sorry more platform specific uh engines because you know performance also it offers slightly better portability ironically if you build for everything separately internally but we've designed the api so we don't need to make those decisions it's really easy to use it's going to work and over time we're going to adapt the back ends to be more efficient or or whatever is needed by the platforms um and so we're offering we do updates every um six months four to six months so that we can keep up with the the specifics of each of those platforms so if we have to look at a different one it'll be there before you have to worry about it thank you so much everybody enjoy the rest of your day
HPC Container Conformance
Okay, next lightning talk. So we don't have a lot of time to switch between speakers. Please take a seat. Next lightning talk as Christian, who's notorious for being very good on staying on time. I did once a very great job. I benefit from that still. Yeah. So if you see me, I talk about containers a lot. So this time I would like to give an update to the HPC container conformance project, which we started or I started last year and which got a little addition by being introduced to the OCI working or we created an OCI working group together with some other folks. So what is the problem? I mean, to just maybe call it challenges, everyone knows modules, right? If you're not, if you're new to containers and you use native code, you most likely use modules to figure out what's the best binary for your program on the current system you add. So you do module load grow max and the module system will pick the best binary for the current system you are on. So it's a runtime decision, right? So you have a bunch of software in a software share and it would just pick the best one problem or not problem. I think it's a good thing with containers. You don't want to have a lot of binaries or different variations of binaries within the container. You want one, right? So a single set of libraries and a single binary for a given problem. So what we end up doing was to create multiple containers for different systems. Let's say for the CPU like Graviton, Skylake, Zen 3, or maybe even we use a name to identify a cluster we are running on. That's fun, but problem is how do you pick the correct image? Within the container space, you have something that's called an image index, which is just a matching artifact that says, okay, you are on an arm system, you get this image. You're on an AMD system, you get this in or Intel or x86 system, you get this image. And if you are a wasm guy, then you even get another system. But the thing is that's not, that's not fine enough, right? It's very, it's very gross grained. You cannot just put like your, your, all your x86 code in this. So what we actually want is an image index that is more specific. So they can say, um, this CPU, this accelerator gives me this image. If I have this CPU, I get another one, maybe even configured with MPI in, in, in mind so that you say, like, if I have MPH, this version, I get this image. If I have open MPI, I get a different image. So you get the idea. So have a very, maybe long image index with different variations and then you pick the best image. And another thing that I didn't mention in the first slide is, uh, run times will go through the, uh, image index, the normal image index and we'll just pick the best or the first match that they get. So even if you have an image index with five different x86 images in it, the runtime will just pick the first one. It matches and off you go. And with this, of course, we, we cannot do this. We need to go through all the versions that we have, all the different specific images and then the runtime ideally picks the best image for you. Okay. So I did some hacking back in the days, right? So I changed or used an unused feature in the image index to make some identification. So I saw it. Okay. This is a broad will this and media driver and I hacked, uh, the Docker runtime to also recognize what the best image matching is for this specific platform you're on. So with this ugly heck, you were able to identify, create an image index with a lot of different images for different, um, different systems. And then you configured your runtime to search for a specific tech list, if you will, that was like hacky. And, um, I didn't intend it to be, I created a pull request for Docker, of course, what turned down, but, uh, because it's, it's, it's ugly, right? And what's ugly about it is that you need to implement it in every runtime. You need to implement it in any scheduler to make sure that it works. And this was of course bogus. So what we did, uh, as I said last year or the year before, uh, we created an HBC container conformance project to establish best practices and provide some guidance on how to build images for the HBC and how to implement the use of those images. The first thing, which is very brief, uh, is what we expect an image to behave or how it should behave. So the first one is, uh, we want, there are two types of containers, application containers, I call them and login containers. Uh, application containers is just if you have, for instance, a binary and you put the entry point to be this application, you can create an alias that just runs some program within a container without you knowing about it. So for instance, let's go release a example. You just, instead of running a binary, you point to an alias and then you run, um, this problem here is if you want to debug things, uh, and if you, it's, it's hard because the entry point is always tricky to get rid of, or at least I need to look up the Docker command or every time. The other thing is multi or a lot of HBC applications have multiple binaries you want to run. Maybe you have a pre-processor or the application and a post-processor with this case, you would have like three different images for this because the entry point is different. So that's kind of ugly for HPC is not really usable. What we actually want is a login container. So you start the container and it drops you into a bash. That's that way you can just, um, augment your, your script and just say a Docker run or a singularity run or whatever to, um, execute the GMX command. For instance, you can just run it here, uh, within a container and it just works. Um, another aspect that's very important, but everyone hopefully does it anyways is, um, that the user within the container, if you use a shop, a shared file system needs to be agnostic, right? So you cannot rely on a certain user within the container. So you might, or you should, um, make sure that the, that the container is able to run with nobody because the username will be dropped from external, uh, the user ID and group ID to have access to share file systems so that the process is owned by the user outside of the container and the container itself has no knowledge about the actual user. Um, okay. And then that's how we expect the container to behave. And I think that's common and already understood. I think I talked about, yeah, last time was annotations that was an idea of us HPC guys and girls, a simmering in our own soup and tried to come up with something to put forward. Um, that was kind of a nice exercise, but at the end, um, we jumped on the train of the image compatibility working group at the old OCI initiative. And you might ask, and hopefully a lot of you know it already, but what is the OCI? It's the open container initiative. It maintains the more the, the relevant specifications about containers. So what's the, what's the image like? How do, uh, run times interact with images and distributions and registries and so on. What is the distribution specifications or how do, uh, registry work and security stuff? So it's kind of like a body that maintains the specifications and we formed a working group together with others. Um, that's called image compatibility. So we want, as I discussed in the, in the beginning, we want to extend the manifest list or the image index to not only be able to, um, pick by platform by architecture, but extended so that you can make what I, what I said as a, as a desired state for the image index so that we can pick the right image and an optimized image for, uh, a certain application. And of course we want to express like what the image was built for, what we expect from the host, what runtime we might want to use and so on. All this cool stuff we want to incorporate in this. And why is it a better way? I mean we HPC folks, we like to, to do our own thing, right? And we are kind special and snowflakey, but this is of course a better way because we interact with the OCI community and we put it in front of them so that we can take into account other things like for instance, wasm is a thing, uh, haven't used it, but seems to be a thing and it's a runtime was in, was in the, was in the container ecosystem. And of course we also have different run times, right? We have like singularity, obtainers, saros, what have you. And, um, picking a runtime over the other is something that we are interested in. The wasm folks are interested in. Say you have a Kubernetes cluster and you have an x86 image for an application and a wasm image for an application. Maybe you want to pick one or the other like different, depending on the condition. So they want this, we want this as well. Uh, scheduling a registry, of course HPC is great, uh, but the container tech is much wider than HPC, say the least. And, uh, we want to make sure that we align with Kubernetes. We want to make sure that the registries are aligned with us and the OCI working groups have like, they have an oil machine of sanitization. So that's also very cool to do. Okay. Where are we now? Uh, we discuss around use cases and while discussing the use cases, we already brainstormed some implementation ideas and we came up with a couple of use cases or, uh, stand in, um, stand in stakeholders, let's say, um, for instance, like the first one, of course, and we are all building images. So the first one is image author. If you build a container image, you want to define this compatibility definition that we propose that we want to propose. Uh, easy. I, ideally it's implemented with an easy build of spec or geeks that, um, you don't need to do it yourself. So all the stuff you can put there and, uh, Vanessa already wrote a little tool for that. The other is of course a system admin that wants to make sure that the system that he's maybe pursuing, procuring, uh, is able to run the container. So you just go through all the competibilities and then you, you figure out what's, uh, what, what works and what not work. So that's all, uh, this good stuff. And, um, you also want to make sure that the configuration of your system is actually able to run this image. The end user just wanted to work, right? So we need to make sure that the system admin and the image author and the other stakeholders just hum together and, and conclude on a certain configuration. And that's what it wants to do. There are other use cases. I don't have time to go all of them, but, uh, we have a list of, of, of this use cases we are going through currently. Our meeting is every Monday. Uh, and if you want to join, please do. Um, I have some links. There are resources. The, the slides are available online. If you want to get in touch, we have an HPC container slack. We have an OCI slack channel. There is a HPC social slack channel as well. So if you want to have a more general overview and if you're at ISE, make sure to, uh, join our high performance container workshop. It's a tense edition. So we do it for 10 years now, which pretty cool. And we have a friends of container boat trip. So if you like to, to, um, meet container guys and girls, uh, make sure that you point your mark, your calendar at the 13th of May. Yeah. That's it. Thanks. And now the famous and I think I'm good on time. Awesome. Maybe do, do I get a sticker if I do it three times in a row on time? You, you get a beer. Oh, even better. Right. We have time for one question.
Automating Spark (and Pipeline) Upgrades While "Testing" in Production
Okay, that's it. Please take a seat and we'll get started. So Holden is going to talk about automating spark upgrades and also lots of testing in production. That's going to be interesting. Testing in production is the best place to test when the alternative is no tests, which it often is. Okay, cool. So let me know if you can't hear me because I'm very easily distracted and get excited and I might not notice that I'm not talking directly into the microphone, so please grab my attention if I screw up. So yeah, I'm Holden. My pronouns are she or her. It's tattooed on my wrist. Super convenient when I wake up in the morning. I'm on the Spark PMC. You can think of this as like having tenure, except it doesn't guarantee I get paid. It just guarantees that I have work to do, so it's like the shady version of tenure. And I've worked at a whole bunch of different companies, not super relevant, but I've seen a lot of mistakes made in production and I have made a lot of mistakes in production so you can learn from some of my mistakes. My employer who sent me here, Netflix, is hiring and I would be remiss if I did not mention that. They're actually finally hiring remote people after who knows how many years. I'm a co-author of a bunch of books. Some of them are related to HPC-ish stuff. I get the highest royalties on scaling Python with Ray, so I think it's a fantastic book and everyone should buy several copies with your corporate credit card. If you don't have a corporate credit card, the internet will provide. You can follow me on social media and there's lots of pictures of my dog. If you're into that stuff, there's a lot of complaining about American healthcare. If you enjoy Shaddenfreude, highly recommend it. It's great. I also do a lot of open source live streams. If you like seeing people struggle with computers, once again, it's great. You can watch me fail. The code for today's talk and a lot of my other code is on my GitHub. You can check it out. And there will be more pictures of my dog. In addition to who I am professionally, I'm trans, queer, Canadian, in America on a green card, I make great life choices. It was a great time to move to America and part of the broader leather community. I can make that joke now because I have a green card. It's slightly more difficult for them to kick me out. This is not directly related. There is no secret Canadian code modification tools. Everything we use is open source. There's no secret Canadian GitHub alternative. If you go to GitHub.ca, you don't find... Actually, I don't know what you find. Maybe you do find something cool. I'm imagining you don't. But this is something that I like mentioning because I think for us who are building big data products or machine learning things, it's super important that we look around and we see like, hey, who is on my team? And if you realize you're hanging out with only Canadians, that's fantastic. Enjoy the poutine. But maybe it's time to get some other perspectives. And if you don't know what poutine is, you're missing out. You should try it someday. Cheese curds and gravy and French fries. Best thing ever. Okay. So what is our problem? And so why do we care about automating upgrades? So fundamentally, our problem is we have unsupported versions of our big data tools and other data tools running in production. And this is a problem because when things go wrong, I get woken up. I don't like getting woken up to figure out what I did five years ago. And that's just not fun. The other option is sometimes I get woken up when I'm trying to focus. That also, sorry, not woken up, interrupted when I'm trying to focus. And this is important because we are also getting Spark 4 soon. That's super exciting, super lovely. There's going to be all kinds of new breaking API changes. And that's just going to be so much fun, right? Like, yeah. Anyways. And so I don't know about you, but I'm not looking forward to going back and trying to figure out all of the different things that I've built over the years and upgrading them, right? Like, I know I'm going to have to do it, but that is not the thing that excites me in my life, which leads into, like, why do we have these problems? Why do we have old things running in production? We have it because APIs change and code breaks. And then people are just like, you know what? I don't want to upgrade. Just keep running on the old version. It totally worked. It's fine. What could go wrong? The other one is like, this isn't fun, right? I don't know. Does anyone here wake up in the morning excited to upgrade their API usage? Yeah. Okay. So this is zero people, right? And the other possibility is, right, like, we could try and keep this old software alive, but we don't want to. So, how are we going to work around our problem? So we're going to use software, and then we're also going to have to deal a little bit with humans, right? We're going to do automated code updating. It's super fun. So much fun. If you took a compilers class, this is going to look very familiar. If you didn't take a compilers class, this is so cool. AppSection.x3s are really cool. And we're also going to do automated testing and validation and prod. So the social problem is much harder. I am completely unqualified to solve it. I work with other people who are much better at talking to humans. They did a fantastic job. They made newsletters. They tried to make the project exciting. That failed. And then they tried to make the project required. That failed. And then we set deadlines. They slipped. But for sure, totally, we're definitely going to hit our new deadline for real. Okay. And now, let's go and see how else we addressed it. So the other thing that we did is, like, hey, we have this problem that humans don't want to do a thing. What about if we made it so they didn't have to do as much work? And so that's sort of the approach that we took. We can automate a bunch of this. And the other part is, like, so we've got API changes, which we mentioned. And then the other thing that we have is testing code as a nightmare, especially code that you inherited and is called untitled underscore seven dot ipod dot notebook. I don't know what it does, let alone I can't make tests for it. It's terrible. So yeah, we have a problem. We're going to fix it with computers. Google has a lot of really lovely code mod tools that I saw while I was there. Super fantastic. This encouraged some counterproductive behavior. I don't know if any of you have used Google APIs and watched them change underneath you. So this is a double-edged sword, and we should heed the warnings before we go, like, super, super all in on this. So what are we going to do? So how are we going to move on? Basically speaking, we're not going to use regular expressions. For the most part, there's going to be a few times when regular expressions are like the simple hacky way, and we're just going to do it. For Scala, we use ScalaFix. For Python, we use something called PySparkler. For SQL, we use SQL Fluff. And for Java, we looked at it, and we were like, we don't have that many Java pipelines. Get them to update their code by hand. It's fine. We know where they work. Okay. So how do we figure out what rules to make? So we could read the release notes, but they're not very complete. We could look at the MIMA changes, and so Spark has a binary compatibility checker that it uses, but, oh, dear God, there is just so, so many things in there. Or we could do my favorite approach, which is run it in production, see what breaks, and then fix it afterwards. So we went with the YOLO approach, which is just like we're going to try migrating some things as it fails. We'll add the rules that it turns out we needed to add. So what do these rules look like? Today, we're just going to look at Scala and SQL. If you love Python, you can check out the GitHub repo. It's got some samples there. So in ScalaFix, we override this function called fix. We take an implicit semantic document that's really just the syntax tree, so that's the parsed version of the source code. And we specify the things that we're interested in looking in, and then we can write a recursive function which will match on this tree and generate a patch. And so here, we can see like, hey, do we see something that's calling the JSON reader? Because the JSON reader, certainly no one would use that ever, so they cited it was a great idea to change that API because who has JSON data? That was a joke, by the way. Everyone has JSON data. And so it turns out like, yeah, this actually happens a whole bunch. So we should write a rule for this. Do we see someone trying to read JSON data from an RDD? And if so, this is the path we're going to add. Now the really cool thing here is that we're matching on a syntax tree to produce new syntax tree. I can just say, like, swap this part of the syntax tree for this string, and then underneath the hood, Scala fixes very smart, turns it into a syntax tree. Everything's happy. I'm quite happy. I've got a bunch of sketchy hacks, and they're all inside of a function, sorry, a library called utils. So it's great. We hide all of our mistakes inside of utils because only nerds look inside of utils.Scala. Huzzah. And here you see we're recursing on the tree, and we just return nothing if we don't find any matches. SQL very similar, but the AST is a little bit fuzzier because we're using SQL Fluff, and it has to support a whole bunch of different versions of SQL, not just Spark SQL. Things are a little fuzzy. So we go ahead and we look and say, like, hey, do we see someone calling this function that we know has changed? If so, go ahead and extract out the part that we care about. And so we go ahead and we grab the third element because, God, whatever, don't worry about it. Magic number, totally fine, no mistakes. And then we go ahead and we say, like, hey, what is the type of this element? If it's a keyword and it's cast, we know we're good. The types are matching. Everything's fine. Otherwise, if it's not a keyword and the type is cast, we probably need to go ahead and change this. Because the types change. We actually need to add explicit casts into this function. And so we go ahead and we check it, and then we say, like, okay, function name, no, if it's cast, we're fine. If not, we go ahead and we produce these edits. Now unfortunately, SQL Fluff isn't quite as amazing. We can't just give it a string and have everything work. We have to produce, like, the chunk of the syntax tree. But this is still better than writing regular expressions, right? So much better. So this is totally fine. Everything's great. How do we know if it works? So there's a bunch of different things that we could do. We could try and make tests, but realistically, that's not going to happen. What we do is we do side-by-side writes and we use icebergs ability to stage commits. You can do the same thing with Delta Lake or Lake FS. They're all open source. I don't know how to do it with Delta Lake because I haven't used it, but I'm sure that you can do it. You might be saying, like, holding this sounds like you're running all of your data pipelines twice. Isn't that expensive? The answer is yes. Does it catch everything? The answer is no. But it's a hell of a lot better than just, right? We've got hope and a little bit of data, and together, are better than hope alone. So now we're going to come out and it crashed last night, but it's totally probably going to work today. Yeah, thank you. Thank you. We see I made a backup copy just in case it fails. What our demo does is it builds a regular Spark project, and it also makes a copy of it first. This is a Spark 2.4 project. Did I break it? Hello? Oh. Okay. We're back. Yay. Okay, cool. So you see here we've got everyone's favorite big data example, word count. And so, okay, this is going to go ahead and it's going to add the Scalifix plugin to our example. So we're just going to go ahead and say, like, yes, add Scalifix. And now it's going to run Scalifix, and it's going to run Scalifix with our additional rules that we created. So much fun. It's probably going to work. This is where it crashed yesterday. Everyone sent good vibes to my computer. Come on. Come on. How's that? Okay. You can see I subscribed to printlin debugging. Oh, well. And now, so it's run the first set of rules which do automated migrations, and now it's doing a second set of rules, and the second set of rules warns about things that we didn't think were important enough to create rules to automatically migrate, but we wanted developers to be aware of. And one of them is the group by key function change behavior between Spark 2 and Spark 3, because who uses group by key? Turns out everyone, very few people depended on the specific weird behavior, though. And so it's just warning, like, hey, I see you're doing this, and I applied a regular expression and I see some, like, bad words, not bad words in that ones that I use, but bad words in that, like, they're bad. Okay. And we say, like, everything's fine. It says we should review our changes, but we're not going to just, like, real developers. We're just going to hit enter and see if it works. And now it's going to go ahead and replace Spark 2.4.8 with Spark 3.3.1, and it's going to run these two pipelines side by side and compare their output. And so we will see if the demo finishes, ooh, five minutes left. Okay. We'll probably finish inside of five minutes. If it doesn't, we'll give up on the demo. That's okay. That's okay. So here we see it's running these two pipelines side by side. You can tell because Spark loves logging. And it passed. Yay. Okay. And then this, this, okay. Hmm. Okay. Well, this part didn't, and that's how you know it's a real demo, is that it failed at the final end part where it's copying the jar to a new special location, but that's, that's okay. The important part of the demo worked. So we'll call that mostly a win. And if we want, actually, yeah. Okay. I'm going to go. Oh, thank you. My lovely assistant. And so I wanted you to see that like, yes, this actually did update some code. So we go here, SRC main Scala, Spark demo project, word count dot Scala. And then we're going to go ahead and we're going to look at the regular version of this. Oh, God. Emax, come on. Now is not the time. Eight megs and constantly swapping. I can make that joke as an Emax user. Okay. So here we actually do see like it has made some small changes between the two of them. And, oh, sorry. Yeah. So here we see, for example, we have this old pattern of creating the spark context and it's been swapped for the new pattern of creating the spark context. And it's done other similar updates to the code. And the important thing is it now works. And this is fantastic. I think it's really cool. Thank you. Thank you. Hand for my assistant, please. Thank you. So I'm super stoked that the demo did not crash. Unlike last night, I switched it back to I was running a nightly build of the JVM and not surprisingly that didn't go well. Okay. So this is all cool, but like where does this fail? So this kind of fails when it comes to dependencies, right? Like we can only update the code that you've got. We don't rewrite byte code. We just rewrite source code. So if you're depending on something that doesn't support the new version of spark, it's not going to work out. The good news is for us, we got to this so late that all of our dependencies were upgraded. So there's something to be said for waiting right until the software goes end of life. Don't tell the security people. I said that. The other one that doesn't work super well with is programming language changes. In theory, that was actually the original purpose of ScalaFix. In practice, this didn't work so well for Scala 211 specifically because it's just so old. We had a bunch of Scala 211 code. So in conclusion, you should definitely check out the repo. It's here. It's spark-upgrade. It is in my personal GitHub, but a whole bunch of other people have contributed to it. They're awesome. I'm lazy. I wouldn't do all of this work myself. Thanks to my employer again for sending me here. I'm super excited that I get to hang out with a bunch of other nerds. The good news from this talk is that we haven't made a system so powerful that the spark people don't care about making breaking API changes. The bad news is we haven't made a system that's so powerful that we can't just not care about breaking API changes. The excellent news is that my dog is cute as fuck. He's here. I said that at the end of my talk just in case I'm not allowed to swear. He's really cute. His name is Professor Timbit. I miss him so, so much. Y'all are lovely, but I miss my dog. Hopefully there's time for a question, maybe. Yes? We can also do... Thank you. Thank you all. Have a couple of minutes for questions. Thank you very much for the talk. Very interesting. One general question out of curiosity. How long did it take to convert everything? Because you just showed like, I don't know how big the script was, but I can imagine just how big the repositories that you guys have. Totally. So that's a great question. It takes a really, really long time to convert everything. And we actually, internally, we have a whole bunch of different projects. One of them is a project that goes through all of the repositories because we have a whole bunch of different repositories, and it generates PRs to these projects. And that code runs daily. And it doesn't actually catch everything. So what we do is we generate the changes, and then, as I mentioned, we sort of did the YOLO run in production approach to life. So we'll look at these changes, and especially for SQL, it'll be like, hey, we do this shadow run. Does it look like it works? And if not, we actually flag it for review rather than raising the PR so that we can go back and say, hey, do I need to add a new rule, or is this a one-off special case where we'll just have a developer deal with it? So I know that's not exactly an answer, but several hours. Okay. Thanks. Any other questions? Yeah. There's one right there. No. How many rules did you end up coming up with for this migration from two to three? And do you anticipate going from three to four? What? Do you anticipate going from three to four? Oh, yeah. Okay. So two questions. I love them. I don't remember how many rules we came up with. For Scala, it wasn't a huge number, and that's because while there are a lot of breaking API changes in Scala, our usage of the APIs in Scala is more narrow, and so I'm very thankful for that. For SQL, I think we ended up with around 20, maybe between 10 and 20. And for Python, I haven't kept track, mostly because that code has been working really well, and so some of my other teammates have been working more on the Python side, so I don't remember how many rules we made there. But they're all in the GitHub. As for do we anticipate going from Spark three to four? Yes. Probably not like the same month Spark four is released. I love Spark, and we'll make Spark four available internally, but we're not going to go ahead and start pushing users to migrate to it right away. We normally wait a little bit for things to stabilize before we start doing managed migrations just because it's better for our sanity, and there's more fixes to the code base in general. Cool. We got another question. Any more questions? Okay. Cool. Hazar. Actually, hold on. You can keep talking because the next speaker is on the bus. Oh, okay. So with the next speaker is on the bus, I'm super excited, and we can go ahead and we can actually look at more of the changes that it made to the code, which I sort of skimmed over because I didn't want to eat into the next person's time. So it's kind of basic, right? But we can see here, this is the side-by-side for the Scala one, and we can actually go ahead and what we're going to do is we're going to go outside of our end-to-end, and we're going to go ahead and we're going to look at some of the other SQL rules. Oh, fancy. I don't... Okay. Oh, this is so that it's better to read. Okay. Okay. Okay. Cool. Fantastic. And we're going to go ahead. I need my lovely assistant again. Thank you. Thank you so much. Hand for my new lovely assistant. So here we see one of the things that changed between Spark 2 and Spark 3 is that previously you would be able to do just an arbitrary cast to things as integers, and even if they weren't integers, it would do kind of a fuzzy conversion. But in practice, if you wanted to parse a string as an integer rather than casting string to an integer, you should use int at. And so here we see we've got something similar. We use a lot of print debugging. It's not great. But what we do here is we return this lint result, and what it's just doing is it's taking this expression and swapping it to an int when we see a cast with a data type of int. So much fun. There's a lot more rules, but I didn't do a git pull on this because the demo barely worked, and I was just like, let's not tempt fate and do a git pull because I hadn't tested the end-to-end demo. But this is kind of cool. We've got similar updates to our format string. Super fun. Oh, right. And then char versus string types also got updated. Super fun there as well. And where was another one? I want to find it. Sorry. Then we've got, there's a rule down at the bottom. Oh, no. Okay. I guess the rule that I was looking for isn't in this version of the code. Let's go back to ScalaFix. So the other cool thing about this, sorry, doot, doot, doot. So one of the really cool things about ScalaFix, just while we're waiting, is that you can test your rules. And so, for example, like, I wrote these accumulators, and this is the old bad style of writing accumulators, and I was like, okay, let's make sure that it updates to the new good style of accumulators. And this is super convenient because I don't have to manually construct syntax trees. ScalaFix just has built-in functionality for this. And we see here what this rule does is it actually throws out a bunch of situations. And it's actually going to generate a bunch of warning messages. But there's situations where, like, this doesn't directly translate to the new API easily. So we just told users, like, hey, you need to make a change here. But we'll get it to compile, and then it'll pass the test, and it'll yell at you because you're trying to access a null. It's not perfect. Like, this is very much like a, how would I say this? This is a very mediocre rule. But in practice, we didn't find all that many people were creating accumulators with fixed values to start at. But the one that we did see was people creating accumulators that explicitly started at zero long, and so that we just converted to a long accumulator. And then the other one that I saw here was I also added some tests to make sure that, like, I had a rule which was applying itself too eagerly. So I also created a test which was just, like, make sure that this rule doesn't do anything if it's not, like, encountering the thing that I wanted it to do. So we can also make essentially negative tests for AST transformations. That's super convenient. How much time do I need to kill? How much time do I need to kill? Do we know how long the bus is going to be? Okay, cool. Okay. So we see another one, the group by key thing that I told you about. We actually had two different situations. These are ones that we could automatically rewrite, and so that's what we do here. And so here we see, like, the situation where someone was explicitly using the column name in a way which we could detect. But then we also have the situation where, like, we weren't super sure, and so these ones we did with a warning. And so we said, like, hey, this should generate a warning because we don't know for sure what's going on here. So we want to generate the warning, but in the other situations where we could do the full rewrite, we made sure that the full rewrite was able to be applied, which I think is kind of cool from sort of, like, a point of view of you don't have to get everything right, and you can, like, add these warnings in places where, like, it's worth it to let people know their code might not work, but, you know, it's not 100% required. Um... Choo-choo-choo. Cool. Let's see here. Ah... Just a quick interruption. The next speaker is going to be late. He texted us that he's still on the bus, so we're letting Holden entertain you. Oh, I got an idea. I got an idea. Hi. I'm just a speaker. What does that mean? Where am I? Oh. Yeah, I got a... I think I got another minute of something fun that I want to talk about if it's okay. So the other thing that we sort of, like, lost over was the, like, side-by-side comparison in pipeline runs, right? And so that's totally really... I think it's really neat, right? Like, because it's super important because people don't write tests at the end of the day, and that makes me sad. But we've got this pipeline comparison project, and... Oh, God. I'm just remembering how ugly this code is. Please don't judge me. This code was originally written at a conference and then made it into production, as you can tell by the fact that it's called domagic.py. Very sorry. Very sorry. So yeah, so this domagic.py does a bunch of really interesting and terrible things. And I was mentioning how we mostly don't do regular expressions, but we do a little bit. And one of the things is when you've got Spark 2 versus Spark 3 and you've got Scala or Java code, you're going to need different jars. Whereas in Python and SQL, like, we could maybe just be using the same files, or we can use the same files with a little bit of a transformation. But so for the jars, we use a really nasty, really terrible, regular expression to just kind of extract what we think the version 3 version of our jar is going to be. And then this is convenient because we can run it side by side. And then so we've got sort of different options. Here we've got it so that you can specify the input table. But I actually did a hack that I'm super proud of because I'm a bad person. Where we made this plug-in, Iceberg Spark WAP plug-in, where what we do is, oh god, we use the Iceberg listener and we output this string any time something happens to the logs. And so if anyone's touching a table while their job is running, we know what tables it's worth so we can go back and run our comparison on these two tables. We actually have some special code that goes ahead and looks at these tables before doing the comparison and says, if the user updated more than 1,000 partitions worth of data, just don't bother and tell the user they're responsible for validating their data. And if they're touching more than 1,000 tables, sorry, 1,000 partitions in a table, they should really have some reliable tests. For the people who are touching five or 100, like I get it, untitled underscore seven, it's great in production. When you're updating that much data, maybe it's not time to depend on Holden's sketchy do magic dot py. So I think this is really cool. And we're going to go back to our friend Pipeline Compare and down to our friend Table Compare. And so Table Compare is really basic. And there's actually an updated version internally that I need to bring out that does better tolerances. But we just go ahead and we compare these two tables with sort of traditional drawing, which is part of why we had this limit on the number of partitions. Because when we didn't have this limit on the number of partitions and we tried to do these comparisons with some of the pipelines that ran on all of the user data, everyone was very sad. And we took down production. I hope that part. Yeah, anyways, there was an incident and I got woken up when we did not have that. And so, yeah, all kinds of fun. But you see here the thing, the magic here is the snapshot ID, because the other thing that we output in our listener is what snapshot IDs we're writing to. Super convenient. And Iceberg allows us to read from snapshots even if they never got committed. There's a new thing in the new version of Iceberg that allows for branching that would be even better because then we would have named things rather than random git hashes. But we're not running that and it's also not supported in the really old versions of Spark. And because we want to do the migrations from the really old to the really new, I went with sort of the lowest common denominator. And that's kind of how we ended up there. Okay, that's all that I had that I thought was interesting. And I think there was someone else who had something that was interesting. Do you want to come and do your interesting bit? Thanks to Holden for filling in. Does anyone have any questions? Does anyone have any questions? That's that? Yeah, all right. First of all, thank you for the talk. I have a quick question in the summary of your talk. You also mentioned that if time permits, you might have an overview of the changes coming in Spark 4. Do you have this overview? Yeah, so if you're interested in the changes coming in Spark 4, the place to look is the Spark Jira. And there's actually like this meta tracking Jira that's in there. And you can see sort of like the things that we're planning on coming. Historically, I would say without naming names, there's a particular vendor that loves to show up at the last minute with the giant piles of code and just kind of yolo it as a nice surprise for everyone. So this Jira will give you a good idea of what's coming. But my guess is there will be a surprise that we find out about in June, just based on history. I could be wrong. Maybe everything is actually planned this time. That would be a pleasant surprise. But there's a non-zero chance that there will be something new in June too. Cool. Okay. Take it away, my friend. Or no, you don't. Oh, okay. You've got a USB key. I think my employer would be mad if I let you plug the USB key into my work laptop. I enjoy being employed. No, no. I just had more time to kill.
Semantically-driven data management solution for I/O intensive HPC workflows
So people can hurry and sit down for the next speaker please. Okay, thanks for our next talk. We have met in talking about semantically driven data management solution for IO intensive HPC workflows. Thank you. My name is Metin Chakrachalov. I work at the European Center for Medium Range Weather Forecasts Department, Forecasts and Services Department at ECMWF. I will talk about the semantically driven data management solution for IO intensive HPC workflows, which the work was funded by the EUR HPC project called IOC. It is work done by many people. So a little bit background on the ECMWF, European Center for Medium Range Weather Forecasts. It is established in 1975 by 23 member states and 12 cooperating states as an intergovernmental organization. There are three base duty stations with more than 450 people, Redding, Great Britain, in Germany, Bonn and Bologna, Italy. So ECMWF is both a research institution and 24-7 operational services, producing numerical weather predictions and other data to member states. There are two big projects that ECMWF is a key player. One is Copernicus. It is the Earth Observation component of the EU's space program. We provide climate change information, atmospheric composition information and also flooding and fire danger information. The other big initiative, EU initiative, is the Destination Earth project, which is prototyping digital twins of the Earth. So the ECMWF's production workflow looks like this. There are per day 200 million observations, collected acquisitions and fed into the Earth system model. Those observations and the output from the weather predictions are archived. Also, these data are used to generate products, which are 300 terabytes of data per day, which then accounts for 65 terabytes of data per day as products disseminated to around 350 destinations to member states and other customers. So the information system, the data is central. It provides access to data, models and workflows, and the data management is very critical for the operations. We need transformation of data into information, insights and decisions. So, semantically driven data management, we have been doing this for a long time. It means managing data based on its meaningful logical description, rather than just storing data. We also abstract the backend technologies. We also abstract where and how the data is stored from the users. So instead of, we try to avoid nested folder structures or UIDs, such as this home user projects ECMWF and blah, blah, blah, or some cryptic UIDs that doesn't make much sense to the user. Instead, we want to use meaningful, scientifically meaningful metadata to describe the data. For example, in this case, this project is ECMWF experiment number 42. The data is 224 parameter pressure and level. So for that, as part of the IOC project, we developed DAISY, data access and storage interface. So we provide, we index and identify data using its meaningful description. And for that also, that allows us to implement optimized algorithms to retrieve archive and retrieve data. And this is based on the ECMWF, ECMWF object store called FTB, which is also free and open source on GitHub. And this abstracts, we also abstract the storage technologies behind POSIX. We support POSIX, DAOS, Moto, and Ceph. And we provide interfaces and tools as well as CEC and Python APIs. So the schema, the main complexity is the schema which describes the database. And it is a collection of rules and each rule is a tree of attributes. In this example, I have a schema file and inside that I have two rules and each rule is consisting of multiple parameters. For example, here project experiment date parameter level would translate to a key project ECMWF experiment 42 and so on. The other rule is event city year and this could translate into event for stem, city is Brussels and year is 2024. So the rules are, the rules have, the rules are blueprints of the database, how to construct the database. And they have three levels and they, each level can have multiple attributes. And to make a rule, it has to be unique and complete to describe the data so that we can identify data from other data. And we also need to think about the locality where data, where different data is related to it, we would like to store them together. So each level here, we can think of the first level as directory, the second level file level, and the third level as the indexes in the file. So the locality would be increased when we go deeper in the level. The other, we can set daisy, we can set up daisy by the configuration file in YAML configuration file. We can point to the schema file to find the schema file and we can set the backend storage technology by saying file in this case is reference. We can also have different parts to the databases. We can have multiple databases. It's called roots. And we can set multiple behavior to individual roots. So aside from data, we also need key and we also have query. The keys would refer to single objects while queries can be any number of objects. In this case, key defines, identifies, and single objects on the right, I have level as a list of values, 0, 1, and 3. So it means I make a query for three different data where the differences, the levels are 0, 1, and 3. So we provide multiple interfaces, command line tools, C, C++, and Python APIs. But here I present an example for Python API because it's simplest. So I need, for storing a data by key, I need a key and data. So data can be anything in this case. I just put a string here, but it can be PNG file or PDF file or any other type of data. Then I make a key. User is met in project is IOC. Date is 2023 and city is born. And I pass this key and data to Daisy and Daisy would archive it. Then the other main feature is list, searching for data in the database. I need to make a query, in this case, user met in project IOC. And in this case, I just want two data objects for two different dates. And I pass this query to Daisy and it returns me the keys that I need for retrieving. And in the next example, I have the retrieve getting a key, getting data by a key. I make a key, user is met in project IOC and so on. And I pass this key to Daisy and retrieve the data. So it's very simple. So to sum it up, we describe data semantically instead of your IDs and nest directories. And we index and identify data by its meaningful semantic information. And this also allows us fast and efficient retrieve and search and archive algorithms. Also, we abstract where how data is stored from the user. And we make blueprints called rules. And we make keys to attach to the data and pass it to Daisy. And Daisy would store and manage the data using multiple different storage technologies. So more about Daisy, we have, we published, Daisy is free and open source. We published on GitHub. We have example C API and Python API. We also provide binary packages on GitHub for C, C++ as well as Python. We also have Python packages for Linux, RPM and the beam packages are available. We also have documentation on read to docs. And that's all. Thank you for your attention. APPLAUSE Thank you. Do you have any questions for Metin? No? Oh, there is one. Thank you. Hey, next presentation. I was wondering if you can specify the type of the values in your SEMA. You mean integer? Yeah, yeah, like that. To facilitate the queries. Yes, attributes can have types. You can set integer, date, string, they can have multiple types. Okay, thanks. Hi, thank you for your talk. I was interested in indexes because you mentioned that you index and identify data. Did some standard type of indexes or you have some format on your own, optimized for this three type of data? Yes, indexing is based on the rules. So the rules here, we have three structures which has three levels. So it has to have three levels which translates to a directory file and data. And we have in-house mechanism algorithm to that indexes that translates this three into identifying data. So is it something like gene indexes? I couldn't hear. If it is something like gene indexes? I'm not sure if I understand. Gene index? Yeah, I'm not the right person, I think, because we use the FTB which has been developed since long time. And it's a big library. I cannot answer the question because I haven't worked with that level. No problem, thank you. Thank you for the talk. I would like to know where are these keys stored because you need to query this kind of index and where are they stored for the user? Yeah, so the indexes would be stored separately but together with the data. And they would go, for example, in this case, inside the roots. So each root would be different database. And if you would look inside the root one, output one, for example, here, you would have index keys inside as well as the data, so together. Okay, so it's a file system or a database stored inside these two directories? Yeah, for POSIX, it would be directory, but for object storage, it would be not directly contained or something like that. Okay, so the way how the index is stored depends on the type of the storage you describe here. Yes, but we can also have, we have two different abstractions. One is indexing and we call it catalog and we have a bulk data. So we can have indexing catalog inside POSIX directory and bulk data on an object store. Okay, thank you. Any more questions? Is the next speaker in the room? Okay, thank you again, Metin. Thank you.
How the Kubernetes Community is Improving Kubernetes for HPC/AI/ML Workloads
about Kubernetes and HBC and AI. Hello everyone. Yeah, so today I'm going to be talking to you about what the Kubernetes community is doing to improve batch workloads in general. So just a brief background about who I am. I work as a senior software engineer at Red Hat. I'm a big upstream developer in Kubernetes and OpenShift. At Red Hat I focus mostly on Cryo and KubeLit now, but I also dabble where I'm also a reviewer in the job area in Kubernetes and a project I'll talk about also called JobSet. I was a maintainer of a batch project called Armada, which was for running batch jobs across multiple Kubernetes clusters. And generally I actually started my Kubernetes experience by trying to run, trying to build a platform that could run jobs on Slurm and Kubernetes. So I kind of liked the Kubernetes aspect a little bit better in some ways, but the Slurm scheduler was a lot more easier to use in Kubernetes. But I think I saw a gap in Kubernetes and I've been kind of helping try to contribute since. So just to give a little outline, I'm going to kind of give a historical perspective about Kubernetes and how it developed and why we're in this area that we are now. I will not really be talking too much about how best to get the most performance out of your cloud vendor or what other things you need to do to get Kubernetes. I'm going to be kind of focusing on the APIs that users could use in Kubernetes. So this is my couple slides of what is Kubernetes. It's pretty complicated. But generally I've noticed that when people start using Kubernetes as a library, I like to kind of think of it as sort of a react, but for distributed systems. So you're kind of using all the Kubernetes client libraries, you're using the APIs, you're composing custom resources on top of objects and exposing them to your customers. That's kind of where I've seen a lot of companies start using Kubernetes, especially when you're trying to build like a quote-unquote Kubernetes native platform. So what does that mean really for most people? Well generally I think the benefit for in this community is you have declarative API for workloads. If you're running on the cloud, failures happen, it sucks, but it does. And a lot of times your users also don't want to be told, oh yeah, you had a network failure so your job failed. Sorry, restart it. And a lot of users are pesky and they ask more and more of you as time goes on. We all know this. So and also for better or for worse, everything starts with YAML. You take it with what you want. But generally what that really means is that we have a big focus in Kubernetes on what is your API, backwards compatibility, most of the time, and also how to make it useful for people. So generally a Kubernetes cluster has not too many components, but I want to try to focus a little bit on a couple of components for this talk. So generally you have the API server which everyone talks to, CLI, whatever. NCD is your database essentially for storing all your objects in Kubernetes. The scheduler is an interesting component because it's, I think, the hardest thing for the HVC community to kind of grasp with the Kubernetes scheduler versus Slurm is Kubernetes is a scheduler focus for the node. You don't get as much fine-grained control in a slur, you get a lot more control in a slurm scheduler than you would in Kubernetes because slurm can actually target like, I don't know, sockets and everything on a node. It's much more fine-grained than Kubernetes. So I like to think of the Kubernetes scheduler as kind of a heat-seeking missile for a node. You give it hints and it just, it targets it and then your pod is on a node. So in the node, what is actually on a node? Well, there's this thing called KubeLit which talks to the container runtime and actually I will talk about that next slide. So the point of KubeLit is to actually start a pod, but I want to walk through what actually happens with a pod. Like this is, you know, step one, a user creates a pod that's a workload and it goes to the API server, the API server stores it in that CD and then the scheduler says, oh, you don't have a node specified on your pod. Okay, let me do a little scheduling loop, finding a node. And then once it's, once your pod is located on a node, KubeLit will pick it up and actually start running it and if you're running a batch job, it will run into completion. If you're running a microservices, it's just there and it keeps running. And KubeLit actually talks to a container runtime and the host. KubeLit also handles a lot of stuff with volumes. It's a pretty, it does a lot. So now you saw the pod lifecycle and I'll be honest, my first time using Kubernetes, I was like, deployment, stateful sets, this is so complicated. I'm just going to use a pod. Unfortunately, I learned pretty early on that you kind of lose a lot of the benefits of Kubernetes if you're using pods directly. Pods are stateless, so if your node goes down, you essentially lose your pod. And a lot of times if your cluster is overworked, you're actually going to lose, you, well, not overworked, but your pods will get deleted after a while. You also don't get self-healing. That is an important part of Kubernetes, even in, I think, the batch community. It just means that when you define an API, things are going to keep running and if you have, like, a job, you are going to keep retrying, is one example. The more pragmatic thing is the pod API fits the need in both microservices and the batch area, and you cannot really change it for one area, not the other. So generally, I don't recommend people using learning stuff that people like. Unicorn is actually, it's more popular in Spark community. It's trying to bring the yarn scheduler to Kubernetes by replacing or by adding a separate scheduler. And then MCAT is a project from IBM around trying to deploy arbitrary objects to multiple Kubernetes clusters and adding its own queuing. So now, what does this mean when you have all these projects? Well, you have chaos. You have Kubeflow, I'll pick on Kubeflow a little bit. I only have two machine learning frameworks, but from the last I checked, there's like six different APIs for representing a machine learning job in Kubeflow. And that means that there is a lot of APIs for running a batch job from Kubeflow. They are trying to consolidate most of them into a single one called a training operator. Still, you have a new API. You have two versions of running MPI jobs on Kubeflow. Now, it isn't as, I actually don't know if that MPI operator fits for all the use cases that people can give with MPI, but it is, as far as I know, the only public open-source way of running MPI on Kubernetes. And you also have things from Armada and Volcano that have their own representation of jobs. Well, this is honestly pretty chaotic. It's not really fun as a developer to be told, like, you know, how many, like if people want to bring a new API, can you support them? And you say no, because we don't really want to install all of Kubeflow just so you could run a PyTorch job or whatever, or install the controller. And it gets kind of complicated. So this group was founded, it's like a working group in the Kubernetes community. Batch workloads run the full gamut on Kubernetes from the scheduling all the way to the node to some representation of the batch APIs. So they actually had to form a working group to kind of coordinate, not really have to, but it's kind of a way to sort of allow you to focus multiple people on a single area and try to improve it. And some of the goals of this group are, let's make the batch API useful again. Let's allow people to actually use these APIs without having to install something like Kubeflow or Volcano to run a batch job. And also, the other one I'll talk about is queuing. Carlos over there could probably talk to you all about DRA, which is another exciting area that's happening, and that's about getting more use out of the GPUs, and that is in scope of this group, but that is actually mostly led by NVIDIA and Intel right now. And I'll be focusing on the two bullet points for the rest of this talk. So what is the job API? Well, this is generally a pretty simple way of representing a batch job, and I think that's one of the downsides of it, is that it was really focused originally on kind of simple use cases. I have an example here of computing Pi, and I'll just walk through the API so you'll see it kind of repeated again and again. So generally, Kubernetes has this concept where you define a template and you define a replica. And the job API that's called parallelism, and that just means how many pods do you want running in parallel? So the first thing that I want to talk about with this group is how many of these do you want to actually are complete before you consider my job successful? Active deadline is just how long the job takes to run, and then back off limit is retry. It's kind of how the job gets some self-healing, if you will, because it just says if the job fails for any reason, I want to retry, in this case, up to the back off limit, or the default is six. And one of the first features that this group added is a pod failure policy. It's essentially a way to kind of short-circuit the retry limit, because let's say your user has a segmentation fault and they're using a GPU. You probably don't want them to be using that resource when other people could be using it, and you probably don't want to keep retrying. And there's no limit on these retries, so someone could say 10,000 retries and kind of be on that node forever or whatever. So generally, that pod failure policy was kind of a way to short-circuit that. Now, how do we actually make the job API useful for workloads that need to talk to one another, which is pretty much most of the most interesting use cases in the HPC world? Well, this is kind of this idea of an index job, is can we provide a static name and environment variable so that the applications can actually refer to a replica of a pod and not have to worry about not being able to communicate to it and not be able to say, you know, my replica zero, that's my index zero, is always going to be this, and so then you can kind of talk to it. So you could think of this as sort of being a common way in like an MPI area where you have maybe like a rank zero pod and you have a series of workers and you probably want to make sure you have a rank zero. And that's kind of the idea of an index job. Now, I should wish I would have shown a slide here, but when you couple an index job and a headless service in Kubernetes speak, you're actually able to get all these pods to talk to one another. So when the last area is if you're trying to build queuing in Kubernetes, you kind of run into this problem where this pod lifecycle, I like to kind of joke, the way I envision this lifecycle is it's kind of like a racehorse. Once you create the pod, it's just, it's running and it's never going to stop. And effectively, the why this can take down a cluster is because if you have a million of these things running, it's just an infinite loop and it's going to kind of drain all your resources of your cluster. But you still need to know kind of how many objects are being created, but you also do not want when you create the object to start this loop. So this was kind of this idea of suspend in the Kubernetes community, adding suspend to our essential queue supports, a wide range of jobs via this use of suspend. So kubere, all the kubeflow operators, a project I'll talk about next called jobset, job, and then another project called a flux, which is, I don't know what I'm going to add, but, and so this is kind of a nice thing that queue provides. So what do you do about representing a more complicated job? Well, generally the job API is only, is, you kind of have to have the same pod definition for all of your workloads. And that may not fit for a lot of use cases. So the job set was kind of created as this way to say, can we create a representation of a single job that could have maybe different pod templates and then also have its own kind of failure and success policies. So when you run these jobs at large scale, you're going to see failures and you may want to restart some jobs in case, or maybe you don't want to restart, or you want to, and I'll talk about one interesting use case of success policies. And one of our goals is Kubernetes is kind of an implementation detail. Most people don't want to know about it if you're a researcher, you just want to know I'm running this. So we kind of want to streamline the creation of stuff like index job and headless services, because we know people want to communicate with their pods. And so at a high level, the API for a job set looks very close to a job to a pod. Instead of replicating pods, we are replicating jobs. And I didn't have it specified here, but there's a replica field under the spec, which says how many replicas of my replicated job I want to create. And then inside of the inside of a replicated job is a job template. And so this job is a PyTorch job. It creates an index job with a headless service, and then it creates a single job that has four pods. And I'll show in a little demo why this is useful. And the other area that we've actually gotten quite a bit, it's one of both Volcano and Kubeflow have implemented this in their projects, is one of the main reasons why they kind of created these projects, is what do you do if you kind of have this leader worker paradigm, where your leader, let's say, is a Redis database and your workers are talking to it, or whatever, you know, a message queue. Well, I want my workers just to finish. Like, I want to say, hey, once my workers are done, my job is successful and I don't really care about the progress of the leader. And so this is kind of one of the use cases we had in mind with this project, or not, there's a lot of them, but this was one, like, can we use something called a success policy to say, I only really care about one set of jobs completion, the rest are fodder, essentially, or not fodder, but they play an important role until the workers are done, and then they're also taken down. So, how am I doing on time? Okay, so I'll walk through the demo a little bit. So generally, with a job set, you have this controller, a job set controller manager. Right now, you can check it's running, great. And I kind of, in this demo, I tried to take the PyTorch job and kind of show the template, and then try to run it as just a normal job and kind of show you what happens. You can't communicate to the service, because if you try to create this job normally, there is no service for the communicate with, and it just automatically fails. So then, what do you do? Well, you can use job set. Woo-hoo. And so, I already created the job set, and you can kind of see that with the kube control logs, I'm actually able to, the job set is running, it's doing training, using PyTorch. And also, I created a headless service called a PyTorch that's there. And so, this allows you to kind of hide all this stuff from the user. And then, I think in the next part of the demo, I'll show the success policy. Come on. Oh, well. So, I guess, I mean, it will go on for a little bit, but does anyone have any questions? Any questions? There's a couple up there. Wait, wait, wait. Who was first? Hi. Yeah, I'm very much from the Slurrem bioinformatics snake make next-flow world. So, and we have an IT department, and they have a Kubernetes cluster, so this is very interesting talk to me. But are you thinking about these kind of workflow managers that typical researchers like that use, because I was just in a high-energy physics session, they also use snake make, and they have schedulers, of course, but somehow that also has to interface. Do you have any comments on that? So, generally, we don't want to get into the... We don't want to add another workflow engine, and there's too many of them, but what I kind of view the job set is like a single node of a DAG, and one of our goals could be this, like either this job or a job set could be added to something like Airflow or Argo workflows. It's another example to kind of be like, this is a single element that you could run, rather than having, like, Argo has their own way of representing, like, what they actually run on Kubernetes, which is, you know, fine for pods, Airflow is also pods. There are a lot of other workflow engines out there. I've actually... We took a lot of inspiration and two jobs ago for me in applying bioinformatics, some of their workflow languages, and trying to get... Trying to standardize a workflow language so we could actually run across different environments. And so I'm familiar with the area, but we're trying not to be a workflow engine for this project. Thank you for the talk. I noticed that a lot of the things you were talking about seemed to play kind of in the same field, sort of where the Slurm plays. So, I don't know, a few years down the road, do you see Slurm kind of giving way to, you know, this Kubernetes-based infrastructure, or do you think they're targeting kind of different tasks, and Slurm will always have its place? That's a really good question. I was not at KubeCon North America this year, but I heard of a company called CoreWeave that was actually collaborating with SchedumD to try and kind of provide Slurm on Kubernetes. From what I understand, kind of using the Slurm scheduler, but also allowing people to run some of the more popular Kubernetes stuff, like have Kubernetes for services or Slurm for batch. Generally, everyone is kind of converging in this area. Our motto is actually taking from inspiration of HT Condor and trying to apply that to Kubernetes. And then I know that the... Sorry, I'm pulling a blank. The University of Wisconsin, who kind of created HT Condor, they're big on trying to actually use Kubernetes for a lot of some of their infrastructure also. But, and also, we do talk pretty closely with the SchedumD folks, at least in my last role, and there is a lot of interest in trying to bring Kubernetes to Slurm. And part of it is Slurm has been around a long time, and so they had to do a lot of work to just even to get in the fact of, I want to containerize Slurm in Kubernetes. Okay, great. Now, do I want to schedule a pod, or do I want to schedule a single container? And that's kind of where I can see... That's also what's kind of challenging, and the other thing is convincing more and more people to use containers, because it's great, but it's also a pain to change everything that you want to go to a container. Okay. Any more questions? So, if I understand it correctly, you're primarily optimizing that I do not schedule 10,000 pods, and then have job sets, right? Because when I think about batch processing, I do think about, let's say, CI, and then we are running like 5,000 jobs per day, and we do this with Jenkins, which actually works super great with Kubernetes plugin, but I'm not seeing enough features on this proposal to get rid of Jenkins or any other components. I'm primarily seeing a way of not overloading the cluster with pending pods. Is that right? No, I would say the main thing is trying, if you want to say run a PyTorch job, the one option is let's create, let's use Kubeflow. Fine, that will work. But what if I don't really want to use Kubeflow? What if I have my own representation? What if I want to add my own...
Kubernetes and HPC: Bare-metal bros
Okay, this is going to be interesting. We are relying on the Wi-Fi bit here as well. So it would actually help if you turn off your Wi-Fi. I know that's a big ask. Consider that for the next half an hour. That would be really helpful. So Vanessa is live here through a video call. Give us away, Vanessa. We can... Well, can you try speaking? What's up, folks? Sorry, I'm not working. Is that working? Try again? Still what's up, son of them? Okay, that's really better. Nice. So we'll start your recording, Vanessa, and then we'll try and do live Q&A at the end. Sounds good. I have some answers for the previous Q&A, too, so we can talk a little bit about that. We can try. We can try. By the way, Vanessa is also the one who designed the HPC social logo. So you should thank her for that and take some stickers when you leave. Thank you. Thank you. All right, here comes the talk. Hi, folks. I'm Vanessa Socket, and today we're going to be talking about Kubernetes and HPC, the bare metal bros. So I thought I would open this talk by putting two words on the slide and then I'll go to the next question. So, what is the question that you guys have been asking or very anxious? Those words are cloud and HPC. So probably the question on everyone's mind is what does the future look like? I'm going to answer this question by posing a question back to you. Where is the money going? So we can look at polls from Gartner and Hyperion Research that suggests that cloud is projected to reach $40 billion by 2026 with a smaller CGR of 6.4%. So very superficially speaking, the money is going to cloud. Now, we can also then follow up on this question like, okay, that's great, but who's going to get left behind? We can look at a paper from Reed Gannon and Degar from 2023 that identified some really interesting trends. For HPC, it suggested that the way that we design our system will not be a problem because we're not going to be able to design our system will not continue to work. We cannot depend on dentered scaling and Moore's law. There's increasing rising costs for improved semiconductors. This is going to make it harder and increasingly more expensive and laborious to deploy new systems. And they define something called NREs or Non-Reoccurring Engineering Costs that we are incurring for every new system. Now, cloud, on the other hand, is leading the space of innovation. As we know, there's this massive expansion of large-scale commercial clouds. They are not depending on software vendors or hardware vendors. They're making their own stuff in-house. And guess what? They're hiring away and attracting the talent pool. And they made a really interesting analogy with temperature. They described HPC at endothermic requiring the absorption of heat for survival. And cloud is exothermic and really giving off of heat. And we know that, folks, we're not talking about heat here. We are talking about money. But to continue the heat analogy, you'll know that if you've ever been out in the snow in a cold environment, you are much more likely going to be wanting to give off heat to survive. So who gets left behind? Well, the person that needs to constantly absorb heat that's probably going to run out is the person that needs to absorb heat. And that's the reason that we're all here. It's because we need to ensure that the needs of our science are represented in this new environment. And guess what? The success of our science, the reason that we're all here, really depends on our ability to be collaborative in this space. And so this is really kind of the manifesto of Converge Computing. So if we bring them together, we get this new technology space where we have the best of both worlds. So where do we start? Well, here is how the talk is going to proceed today. We're going to start with models for convergence, talking about patterns for bringing together traditionally disparate environments. We're then going to move into strategies for convergence. So designs that I've noticed allow for easy movement between the spaces. So let's start with those models for convergence. Now, if you've looked in paper land, you've probably seen many different models. There's many different ways to take HPC and cloud and put them together. I'm going to talk about the high-level patterns and from the perspective of someone that's maybe deploying a system. So let's say that's me, and let's say I want my cloud and HPC, I'm going to take my limited set of resources and I'm going to try to split them into two steps. So I spend a ton of money and I do this, and then, I chose poorly. No one's using half my resources, and oh my god, so four years later I come back and I'm like, all right, I want cloud, X or HPC exclusive or HPC. I understand I can't have my cake and you to choose, so I am just going to choose one. We've used HPC for all these years, red and butter, this is why you've always done things. I choose HPC. Great, six months later, someone comes into my office. Are we dinosaurs? You know, everyone over there is using YAML and automation and we have this old setup and ah, so you go back in your office, you contemplate your life choice and you're like, oh right, no, it's okay, I'm not going to wait another four years. I'm going to sneak it in. So this is where you see all of these ideas, like bursting, multi-cluster, and these are generally referring to this idea of having some home base of resources and reaching out to get more. And the problem with this approach as I see it is that the complexity of these approaches often reflects the complexity of the systems. So they tend to be snowflake, they tend to be complex, and this is why there hasn't been like a single leader that has emerged in the space. So here is a different idea that's less common because it doesn't superficially kind of make sense. I want cloud or HPC, meaning I want to be able to run HPC, or cloud, or at the same time, or something together that's more converged, like what the heck am I talking about, don't I? Am I talking about, don't worry, we'll talk about it. Let's first talk about strategies for convergence. So these strategies I need to point out, these are not just about the technology, they are also about the people which is often harder. The first is common goals. In order to get two different communities working together, they have to care about the same things. You can't get around that. The second is modularity. So the degree to which your application or infrastructure can be modular, is that you can use things interchangeably and swap them, be very creative. The third is integration. This is consumption of an entire thing in another thing by way of different strategies. So let me give you some examples. For goals, the best overlap of goals I've seen is with respect to batch workloads. So a few years ago, the Kubernetes community started the batch working group, and this was because this new need to have AI ML workloads in Kubernetes. Traditionally, Kubernetes is where you run services, you keep something running. And there wasn't this concept of starting something and having it complete, but all of a sudden there was this new need, and guess what? We have been doing that in HPC land for like a couple of decades now. Modularity, a really great example, is actually with Kubernetes and Flux Framework. So you may think of Flux as just like this workload manager, but actually it's called a framework because we assemble many different components together to assemble into the workload manager known as Flux. Kubernetes is the same, different set of components, and there is going to be a creative way that we can kind of use these interchangeably. So the final example of integration, the best technologies I can provide are containers and language bindings. Container technologies are literally this vehicle to let you move between spaces, and language bindings are going to let you take it traditionally like C++ HPC project and extend it into a language that is native to the language and extend it into a language that is native to cloud. So for example, Go. Alrighty, let's get into some examples just like eggs three ways. Here are some projects that we've actually been working on at the lab. The first is Fluids. As I alluded to, this is the Flux scheduler, swapped with Coop scheduler. The next is the Flux operator, the entirety of Flux Framework implemented inside of Kubernetes. And then the namesake of this talk about air battle grows, Flux and Kubernetes working side by side. So let's start with the Flux scheduler within Kubernetes. You may be familiar with Kubernetes when you launch a job. You ask for a certain number of resources that's given to the scheduler. The scheduler says, okay, here are four pods. Have a nice day. So what we're going to do is bring in Fluents. So our C++ package, FluxSched, that is mapped with Go bindings into a custom scheduler plugin. We're going to swap it. And so you're basically going to be asking for the same amount of resources, but the scheduling is going to be done by FluxSched. How does this do? Well, we find that the workflows run three times faster. So what you're seeing here is Coop scheduler on the top, Fluents on the bottom. You see a lot of randomness with respect to how Coop scheduler places jobs. What this leads to is a pathological scheduling pattern. So anywhere you see a red box on there, that is a startup delay. And what that means in practice is though, is that although the workloads themselves run in similar times, we have a lot of outliers. We have a lot of jobs that take a really long time to get started. And so Fluents improves upon us. So Fluents is a really great example of modularity because we're taking an HPC technology and we're literally swapping it. And the modularity of the software allows for that. It's also a great example of integration. Because we have those Go bindings, we can speak the language of the cloud need of communities. Alrighty, next project, the Flex Operator. Super cool. All the gophers in Flexland are pretty cool. Alright, so the Flex Operator is implementing the entirety of Flex framework inside of Kubernetes, your own HPC cluster. This happens by way of a custom resource definition of CRD, where you basically give all the parameters that you want for your cluster, whether that's a single job or whether you want an interactive cluster. This creates what we call the mini cluster, which, you know, Flex Operator is a mini cluster, which, you know, Flux doesn't know the difference that it's running in Kubernetes versus on bare metal. There's a lead broker that's connected to several follower brokers. So here you have one pod for one physical node. The tree based overlay network within each pod or node, you have Flux that's added on the fly to your application. And the Operator is just going to basically reconcile until the state that you need for your cluster matches the actual state of the cluster. How well does it do? We added it to the best in the space last year. The MPI Operator and the Flux Operator consistently outperformed the MPI Operator we believe because of the 0MQ bootstrap. So the Flux Operator is a beautiful example of integration because we're taking the entirety of Flux framework and implementing it inside of Kubernetes. Bro, bro, bro, is it time for the bare metal bro? Yeah! Okay, so, warning. I've been saying bare metal, but nobody's going to give me bare metal. Let's be frank about that. So I was using virtual machine. We're using virtual machine as a proxy for bare metal. So just a warning. So what's different about this picture? The orange is on the outside. So we actually have Flux framework on the outside spinning up a Kubernetes cluster and notice that we actually still have compute running on bare metal alongside Kubernetes. How's that possible? Don't worry, I'll tell you. So why do we need this in the first place? As you know, also, there are increasingly more complex heterogeneous workloads that are coming to HPC. So this means not just, you know, embarrassingly parallel stuff, but also adding in services, databases, task queues. Ah! Okay, so I was... This slide is not wrong. I was going to give you an example of such a workload, and apparently this slide is giving you this warning that I'm a bad scientist and I'm not wrong, but I will point out that my example is actually a very good example that is a prototype for this kind of design. Let's talk about that. So let's say that we're running simulations. We're training examples one through N, whatever, doesn't matter, and we want to send them to a machine learning server, a specific endpoint to do the training. We then want to wait till some metric of goodness or perhaps a number of samples, and then we want to flip it around. We want to run simulations again, but we want to instead give this to our machine learning server without the actual values, then we're going to have a vector of the true values and the predictions, and we're going to see how well we did. Now, very superficially, if we match this to HPC versus Kubernetes, this is how we do it. We would expect that the simulations would run better on bare metal, and the service thing would run better in user netties or Kubernetes. This is way to be... We need to prove to ourselves first. So a lot of you are probably out there like, user net, like, Kubernetes? Like, in user things, are you nuts? I'm not nuts. There's actually something called user netties. It's a Kubernetes enhancement proposal or CUP proposal in 2022 by a very talented developer named Akihiro Sudo. Akihiro must point out won the top maintainer award for KUKON last year. He's an incredibly talented developer. If you've used any of these technologies, he's the one behind it. Hats off to Akihiro. So last year, at the beginning of the year, user netties was really a hodgepodge with kind of bash grips. It was really hard to use. So I engaged with Akihiro and we released Generation 2 of user netties in September. And guess what? It is using containerization, which is really great. It has these components that we'll go into in more detail. So what does it mean in practice? Well, it means when you're building a virtual machine, you need to have C groups version to enable. I recommend LIMA or Linux virtual machines if you're prototyping this for the first time. It also means that you need to enable these kernel modules. So very generally speaking, the RNet filter is going to allow you to apply IP tables, rules, bridge traffic. VXLan is going to allow you to connect VXLan devices on different hosts to a standalone bridge. This is important because we actually have different physical nodes. Now it's going to use RULE stocker. This isn't such a crazy idea anymore. Many clusters have podmin these days. And so what does it mean? Actually, when you bring out these VMs, it means that you're going to run a make up command that has two contexts. So both of them are going to build and start a base image that is using kind, kubernetes, and Docker with CNI plugins. And then the two contexts are the control plane and the worker. The control plane is going to install Flano, run kubernetes, and admit. This makes a joint command which is basically a token that you give to the workers, and then the togers can authenticate and join the cluster. And so that's what they do. They're just like, I'm ready to serve. All right, so we created this garbage cluster small and mighty using Overt and Ansible. It is small and mighty because each has eight cores and 30 MBs RAM and a 10-NVVD iterate. And I want to point out that we have seven nodes here because generally speaking, we're going to have six that we run things with compute on and one's going to be an admin node or control plane. Again, warning, not bare metal, you get the deal. All right, so what's in these VMs when we bring them up? We have a complete system install a flux, singularity on bare metal for reasons I'll tell you a little bit. Lamps installed on bare metal and of course user netties ready to be brought up. So once I shell into these VMs, my flux cluster is ready to go. I can do flux resource list and I can see all my nodes. And user netties, again, that administrative node is also a control plane. So we technically have six nodes to work with. And then we have a user netties. So we technically have six nodes to work with. And we can still see them with coop control get nodes. Here's what we're working with. User netties and flux running side by side the bare metal bros. All right, bro, bro, what experiments do we want to run all of them, bro? All right. So we first need to sanity check that what I said earlier about the bare metal and lamps and the simulations is actually true. We need to look at application performance between flux and user netties. So the way we're going to do that is by running a few things. We're first going to run lamps on bare metal with flux. We're then going to do the same thing but in a singularity container. And I did this just to demonstrate that you don't lose anything by using containers. Here's great. We're then going to run lamps in user netties with the flux operator. And then finally we're going to repeat cases one and two, but with user netties running in the background to look to see if there's any overhead of that. And I need to pause for a second because I know how incredibly cool this third case is. We have flux on the outside. Flux is running user netties. Within that we are launching the flux operator which is bringing up another instance of flux and inside there is where lamps is running. So folks, like I know Thanksgiving is over but this is the ultimate production. And we expect lamps to be slower in user netties because as we know it makes MPI collective calls. User netties are using something called SERP 4.NET NS which requires additional processing of packets with a tap device. I have a great paper I can share if you're interested in learning more about that. So drumroll the results as we expected the well actually maybe we didn't expect but guess what the bare metal case is the singularity container is very comparable to actual bare metal. I was very surprised by this. So user netties does not add a lot of overhead. And this is what we'd expected that guy up there running in user netties is about twice as slow as running on bare metal. So what did we learn? Well, we learned that for a setup like this the network sensitive stuff probably should be run on the HPC. But I'll point out there's opportunity for improving this in user netties. If you have experience with networking I'd like you to go over to the GitHub right now and I'm just going to wait a lot for the talk and engage with that to hear it to work on this problem. Now the next thing we want to look at is distributed machine learning specifically two cases one distributed to across six nodes and then the second on one node so the distributed case network is a variable and for the one node obviously network is not a variable. Drum roll results same thing it's about twice as fast on bare metal or twice as slow I guess on user netties. And interestingly when you look at just a single node these are really comparable so there's no issue with running something on a single node in user netties in and of itself it's really when you bring in the networking that it becomes a variable. So it's a network right well let's sanity check one more thing here's I per thing we did one bit of transfer for each node as a client to each node as a server we see bit rate and give you bits per second is between 10 and 30 for bare metal user netties with like non detectable closest here are really really terrible we can look so we can see the same patterns for transfer gigabits per second and so yes it's the network we're pretty confident for the setup it's the network. All right can we do the fun workflow now we absolutely can so guess what I actually prototyped this kind of workflow because I was really excited about it and so what we're going to do is we're going to be launching a batch job with flux batch this means the flux instance that's only by the running user it's going to scope resources using hw lock in this backshot where we can basically bring up and tear down all of user netties. We're going to take that workflow that I mentioned before we're going to map it into our star track cluster space so we're going to run simulations with lamps randomly selecting the problem sizes predict well time we're then going to bring up a machine learning server a special server I made using river a few years ago and then we're going to basically do the test cases we're going to run lamps again but we're going to leave out the actual well time and we're going to ask our models what it is and we're going to do a thousand training samples and 250 testing samples. How do we do? I put no thought into these particular models but I did three kinds of regression the Bayesian and sampling from a probability distribution didn't do super well but for the first two there's an actual kind of pattern between the predicted and the actual time and so although I put no thought into this I was really pleased with this result to see that the general prototype this idea of having bare middle simulations running alongside a service there is something here we can do science this way with actual real scientific questions and I'll point out that there are real heterogeneous workloads out in the wild and you this capability here's Moomi the massively parallel multi-stale machine learn model infrastructure and this is basically simulating biological systems the interact between proteins and plasma membrane I'll also point out that the Moomins are what it's based on the name the finished book comic book series with really cute hippos with often yellow spiky hair very awesome so this is the perfect example the bare metal rows of coexistence adopting technologies to make it possible to go to coexist and continuing to improve upon them so that for example with networking this environment can get even better so what should you remember from this talk if you take nothing else away the first is looking out for opportunities for collaboration look for that alignment of goals between spaces that's an opportunity the second is providing handles for your components so you don't have the bandwidth to look for opportunities add some go bindings to C++ projects because someone else could find you the third is engagement we need to show up at the table we need to go to working groups, conferences places that you haven't traditionally been to engage in to find these opportunities for collaboration and possibly the most important is this mindset we've had this mindset of cloud versus HPC that one has to win but they're different for so long we need to throw that away and get rid of the adversarial thinking and have a more collaborative mindset this is the vision that we have for the future for converge computing and we hope that you like to join us so thank you that's how to reach me my email and social networks and here's some interesting links for the flux and the various projects I think I will take some questions virtually now okay we can take a couple of questions it seems like the wifi is stable enough to let Vanessa answer them do we have any questions okay so Vanessa we may have to repeat a question for you we'll see how that works hi Vanessa amazing talk congrats so I was wondering if your architecture can support sidecars because one of the nightmares I had when I was trying to do something similar was that in order to get the sidecars running I had to spin up a second network stack and that created a lot of overhead so no no just one is on okay did you get the question Vanessa no I didn't hear the question at all neither did I yeah maybe that's better okay let's do it like this you'll come up front and ask it here yeah that's perfect that'd be great I can hear you great hi there hi so I was wondering if your architecture can support sidecar containers because as I was saying when I was trying to do something similar when I tried to create the sidecars I had to create a second network stack within singularity so the network overhead was amazingly high so absolutely a flux operator actually uses a sidecar container on a net container which is similar in concept to add flux on the fly as a view what's going on in Kubernetes is sort of a different thing than the networking issue so the short answer is yes to kind of add to that though I'm not sure that singularity and Kubernetes singularity as the container runtime for Kubernetes would work I have never tried that but it doesn't sound like it would work yeah it needs to be done yeah exactly hi Vanessa thank you hi it was the most fun presentation on the post then so far thank you so when you were saying that the main difference between performance between EBM and bare metal workloads was related to network was that the case also for distributed training and if that's the case were you using infini band or not so this we did not have infini band and you make a really good point that this kind of setup would need to be tested with an actually great network and that is still a very big challenge even for cloud so for example if you use AWS you can bring the elastic fiber adapter which will give you great networking performance but if you go to other clouds and I don't have to name this specifically you tend to only get really good networks when it comes to using like TPUs or GPUs the exception though is Azure which has a lot of really great HPC stuff kind of built in so absolutely you can get that setup with infini band Hi thank you for your talk I had a smile on my face the whole time thank you for having such high energy at the end of the day what was I going to say oh yeah so probably in my workloads I can reduce the network traffic by a very large margin if I can constrain certain jobs to specific nodes because then large files don't have to be moved for certain jobs to across the network is that something that you could keep in mind so if you remember the very quick machine learning experiment that we showed when we're running something on one node and you're not using the network there's no issue so if you're just running something on one node in user netties you won't have an issue in a degree to which you can reduce anything that uses network so moving data MPI etc etc you will get similar performance at least from this small prototype experiment that we've seen as you would on bare metal I have to do this because it wasn't really bare metal thanks one more question hey Vanessa that's Danny I'm gonna die my hair soon so you won't recognize me again I really liked your framing actually I thought I was going to sort of being adversarial and then I actually realized what you were saying and I really appreciated it however though regarding the adversarial framing I have some experience with for example cloud tools and cloud environments being used as platforms for vendor lock-in I think that you described especially with your converged computing kind of the way that you can push back against scientific labs aren't kind of in-depth to corporations I actually think that you kind of made a really useful example of one way to do that in your talk so again I actually was very very impressed by the way you kind of explained that I would like to know in the more general sense how can labs and potentially RSEs make use of cloud tools without getting locked in or becoming beholden again to a corporate environment and again by the way I think that you effectively did that in this talk so I'm more looking for a general kind of thought about that You're totally correct that vendor lock is an issue and when you tend to see many sort of niche APIs in different clouds and then you built your entire thing around them you do face that as an issue but the great thing about Kubernetes is that it is this open source project that is available across clouds there are subtle differences but if you make a workload that can run on Kubernetes you're going to have an easier time to move it between clouds and that's you know speaking from my lab we work on flux framework and one of our goals with flux is to make things portable not just between clouds but between cloud and HPC that's also something like user netties running actually Kubernetes on bare metal alongside HPC is so important because all of a sudden you have the same workload and it runs in all the places that is sort of like the vision we don't we want to make sure that the scientific workloads that we're running today can run in all places not just to one niche specific cloud not just one niche specific center just convergence TLDR that is very exciting and I really appreciate that response thank you so much okay that's all we have time for this workout great Vanessa I hope you agree yeah it was really fun if anyone has further questions and stuff please reach out to me I love chatting it was a pleasure chatting with you and I hope you have a great rest of your fun then thank you and the best way to reach out to Vanessa is via HPC social so don't forget to grab a sticker and you walk out please consider doing a small donation in the box as well to help cover the costs and if you're leaving please check if you see any trash around please take the trash with you bottles anything anything you clean up we don't have to clean up thanks a lot Vanessa this was great bye
Welcome to the Identity and Access Management devroom!
We are starting the second edition of the Identity and Access Management Dev Room. My name is Alexander Bakavoy, this is Iker Petroza. Formally we are the ones who organize this Dev Room. If you have any questions, anything, please talk to us. A guy in blue t-shirt will be the guy moderating a specific session. I'm not talking about him or Trevino specifically because this is a moving target. We have one t-shirt for people who will be moderating. I wanted to do a bit of history reminder. We had the first edition six years ago. It was, I think, a successful one. We got roughly the same amount of talks we will hear today. They were as much diverse and wide in topics. Also, we had quite a lot of people coming to listen to the point that at some talks we actually caught the FOSDMQ sickness. There were like 50 people in the room, there was a smaller room and hundreds of people waiting to get into the room. So, truly FOSDMQ sickness part that we enjoyed six years ago. I hope we will have enough space because this is twice the room, the first one. For this year, as you all know, this URI probably has a whole schedule. You can get access to all the things that I will just remind to speakers here that please upload slides. Using this new interface to pre-talks, but please upload them roughly like half an hour before your talk so that people who will be watching this live stream and they have some reference point and can see their... You can omit things from the slides that you want to make a surprise during the live presentation so that it's not spoiled. But it's typically good to have it uploaded. Since this is a smaller room and we don't have another mic, please, when you're talking and hearing some questions, please repeat the questions so that it's recorded in the mic. Finally, we have smaller slides, but please leave one or two minutes so that we can change to the next one and don't get the time taken from the next one. And since this is largely done by automation and volunteers, you will get an email from the video team that will give you a link with the details of your presentation recording. And you need to act on that. Preferably if you have time today or at most tomorrow so that they can re-encode the video and publish it. There will be an interface where you get to set where the talk starts and ends and maybe not in our rooms, we have one mic or maybe two. And in some other rooms they also have more mics and you can choose which audio you're taking. And once you set this off and signed off on that video, it gets re-encoded and published automatically on the schedule page. For all the people who missed your talk, they will be able to get the recording and how fast they will get it depends on you as a presenter. So, yeah, and now Iker does an overview. You forgot one thing regarding the video meeting that you need to make sure that you sound correctly. I forgot to say that yes, you need to check that you sound correctly in the video. Last year's talks. This one wasn't very good at the beginning. Yep. So, now it's to you. Okay, so my name is Iker. Well, this is my first post-dem. I hope you are enjoying it as much as I'm doing it. And, well, this Identity and Access Management Dev Room is about Identity and Access Management. We have several talks regarding passwordless. We also have multi-factor authentication, signals unknown, user federation. And, well, just in short, I hope you are enjoying it a lot. Leave space to the next one, do the next speaker to prepare everything. And we are all volunteers. So, if you find that something is not correct or just fix it or tell it to us so that we can, you know, try to fix it and have everything correct. I don't have much else to say, so thank you and have fun.
SpiceDB: mature, open source ReBAC
All right, so this is the talk on SpiceDB. Thanks everyone for showing up. So early in the morning, I'm starting to lose my voice because there's a long day yesterday of talking and meeting awesome people. This is my first FOSDOM. So who am I? My name is Jimmy Zalinski. I'm the co-founder of a company called OthZed, an OthZed-billed SpiceDB. Previously, I worked at Red Hat and CoralS. So I've been around in the container and Kubernetes ecosystem for a pretty long time, basically since the beginning. There, I'm actually a maintainer of OCI, which is the standard specification for Linux containers. And I've also started a bunch of projects in that space, notably Kubernetes operator framework and some others. This talk is entitled SpiceDB. But since FOSDOM is more of a developer community conference, I really wanted to focus less on this talk being a vendor pitch for SpiceDB, but actually more of a level set about the problems in the authorization space and the history and status quo of that. So that everyone understands what might be the best tool to solve their problems. I'm not going to try to sell you SpiceDB for all problems, because the more informed you are, the better you can pick the product that's actually going to complement your software stack and what you need. And that means there's going to be way more qualified people using SpiceDB and way more qualified people using other authorization tooling. Obviously, I'm the most jazzed about SpiceDB because I created it. So why are we all here? We're all here because there is a not-for-profit organization called OWASP, which is the Open Worldwide Application Security Project. And they kind of got started in the early 2000s. And they're famous for having this list called the Top 10. And the Top 10 is basically an enumeration of the highest risk, the highest threats for web security. And as of 2017, broken access control was number five. As of 2021, broken access control is number one. That means this is the biggest threat to the web and to all the applications running internet facing to the web. But really, the question is, how do we actually get to this point? Like, how did this happen? And how did it happen so quickly? I'm not going to point any fingers, but what I'm actually going to do is kind of dive into two different groups of stakeholders in kind of the history of authorization. There is kind of the academia and people publishing papers in this space, kind of defining concepts. And then there's the industry practitioners that are actually building the software and realizing these systems as they're actually connected to the web. I'm going to start with academia first. So on the right-hand side, you're going to see a timeline. And then on the left-hand side, there's going to be some notes. And then not for this slide, but you'll see QR codes in this corner as well. Those QR codes are going to link to the specific novel paper. So if you're interested in any of these particular concepts, you can feel free to scan the QR codes. But our history kind of of authorization is going to actually start in the 80s. And it kind of gets really kicked off with this publication of the Trusted Computer System Evaluation Criteria, which is a security practices book published by the US Department of Defense. And in it, it's outlining a lot of different security practices that are effectively a part of the military, the United States military. And in it, they kind of describe these two different access control systems, discretionary and mandatory. Now, discretionary is effectively just if you created the idea or the information, you can share it. And if you're then given access to that, you can share that. It's at your discretion. I kind of use file systems. And Google Docs is an example here. It's not a perfect one-to-one match. But if someone shares a file with you on a UNIX file system, you can copy that file if you have read access. And then you can change whatever permissions on that and share that, similarly with Google Docs. So it's at your discretion how you're going to share that information once you're given read access. Then there's mandatory access control, which is effectively a long list, an exhaustive list, of all the access for a particular thing. Most notably, people are kind of most familiar with SE Linux as the example of this. If you're unfamiliar with SE Linux, it's a way of locking down the Linux kernel. Honestly, it kind of comes with a negative connotation because mandatory access control is very verbose and very difficult to get right because you have to enumerate absolutely everything. Some people say that the three-letter agency at the US government that created this are the only people who actually know how to configure this correctly. I don't know if that's actually true or how many people use it. I know Red Hat is one of the folks that actually does promote SE Linux. But the one thing about this slide I really wanted to kind of drive home is these ideas, they're as old as the military and war itself. There's nothing novel about the 80s where these ideas got invented. But what actually happened was someone only actually ever thought to write this down in the 80s. So it took that long after using these ideas for many, many, many years. So we jump roughly 10 years, 9 years to 1992, which happens to also be the year I was born. That makes me feel relatively old. But in 1992, we get this paper published on role-based access control. And role-based access control, often called RBAC, is kind of where actually most people believe the state of the art for authorization systems is. The core idea is basically there is a group that is assigned access to a particular thing. And those groups are called roles. And then you map users into these roles. And by means of being in this role, you get access delegated to you. The kind of number one problem with RBAC is that everyone kind of defines it differently. If you build any enterprise software, you're going to talk to clients and they're going to ask you for RBAC. But the difference is if I look at two different enterprise applications, how they implement RBAC entirely differently. The only commonality is this mapping of users into groups that then have access. This is kind of going to be a recurring theme across all of these papers published in academia, anything with Starback, because they're documenting concepts, but not actually specifications that would give you an ultimately cohesive and designed and secure system. So kind of most famously, the biggest issue kind of with RBAC is that there really is no scope. If you say someone is an admin, does that mean they're an admin of the entire web app? Does that mean they're an admin of a particular resource in the app? You just don't know until you actually build it yourself. So there's not really an easy way to reason about these systems until you actually touch them. So we jump actually well into the future now into 2015. And now is when the paper on ABAC, which is attribute based access control, is written. Effectively, the idea behind ABAC is to kind of generalize on RBAC and say, the role that you're assigned, that is just one attribute that your user can have. And other attributes might be that you logged in with this IP address. Many other dynamic attributes can be assigned to you. The kind of really important thing about ABAC is it's providing this real time context. So now you can kind of write rules, like are they connecting from this country, this subnet, this time. You can delegate access at particular windows of time and kind of perform more logic on these attributes that folks have. And now we're going to take a huge digression back to 1965. So if you're unfamiliar, Multix is actually this operating system that was developed between MIT, GE, and Bell Labs. You might not remember it, but it actually inspired an operating system you're probably familiar with. Unix. So Unix is actually an attempt at making Multix concepts ported to less expensive hardware. Multix is often credited as the operating system, like the first operating system that has access control for the file system. I actually don't know if that's true, but it's often credited as that. So in Multix, you have a file system tree, so you get hierarchical structure. And then at every branch, which would be a file or a directory, you can have five different attributes assigned to that file. You get read, write, exact append. These are all kind of file operations that you'd be familiar with. But there's this fifth one that's super interesting called Trap, and that actually gives you the ability to do callbacks and to see functions. And it was initially designed so you could do file walking in user space. But kind of like the whole thing with Multix and the reason why I bring it up is because there was inheritance, there was aback, and there was user-defined functions in an authorization system. In 1965, when in academia, the ideas behind attributes were published in 2015. So there are systems using these concepts, but they maybe haven't been formalized and written down in the concrete form. And this is kind of like a huge issue with the whole space, because people are doing things, but they're not really studying how to make these systems robust with these ideas. They're kind of more just documenting these ideas ad hoc. So getting back to the normal timeline, we hit 2019. It's actually in 2007 that the term is coined relationship-based access control. And the idea behind this is actually that by establishing a chain of relationships, like Jimmy is a speaker at FOSDOM and speakers at FOSDOM have access to the FOSDOM speaker matrix chat. If you can follow these chains of relationships, you can actually go from Jimmy has access to the FOSDOM speaker room. So this term is kind of coined around then, and it's looking forward at what tech in the Web 2.0 era will look like. It's published initially while considering how Facebook, the social graph, works internally. So when you share photos on Facebook, you say, friends of friends can view this. You're literally defining it in terms of relationship to yourself. So we hit 2019, and actually that's when Google publishes a paper called Zanzibar, which is documenting an internal system at Google powered by these concepts. And the difference and the reason why I have 2019 for you back is because Google is documenting a concrete implementation of this. Unlike a lot of these other papers talking about concepts, it's talking about an application of these concepts and really giving you a framework for how to use this effectively and in a correct way across multiple products at Google. So then in 2021, SpiceDB is open source, which is also implementing the similar concepts to Zanzibar. And obviously, I'm going to get into that later. There are other models like Starbucks, but that's kind of like the primary ones that I see mostly in industry. You can dive into Wikipedia if you're interested in other ones. But now you've got kind of the industry side of things. We're leaving academia. And industry has this problem, which they go to build in a web application. And your first job is just build the MVP, the minimum viable product of your web application. So what you're going to do is do what you do with everything in a web application, which is store data in a database, probably the relational database you're using for everything else. And you're going to try to check if a user has particular access based on some data you store in the database. It might maybe going to be a role if you're inspired by RBAC. But maybe it's just a numeration of the list of users that can do a particular thing. So you may have written code that looks like this. But the problem is this falls over at some point in time, whether fundamentally you build a system that actually is just really slow, or you have to build a new system that is way faster than you ever intended it for it to be. Or you basically get users of your software that demand new functionality that is not actually possible for you to implement until you refactor your authorization code. So a great example of that is if they want recursive teams. So if you have groups of users, what if you have groups of groups? Or groups of groups of groups, right? That is something that most people cannot build, or they don't build in their initial MVP. And when you get functionality like that, you're forced to completely rewrite your authorization system. The other thing that could happen to you is your company buys another company, and they're based in a different continent. And that means all the requests for checking permissions now have to travel across an ocean if they want to be correct. That's a huge problem. And making sure that the performance is actually going to be viable, and the answers you're going to get for authorization questions are correct is a difficult problem. So you hit one of these kind of big issues, and then you kind of are forced to enter the cycle that I'm going to get into. But these numbers are kind of fudged. But the whole point is that if you take an engineer, probably with expertise in that web app, has worked on this authorization system, it's going to take them a while to implement this. It's going to be super sensitive because someone else is going to have to review it. That person is going to also have to be deeply embedded in that code base. They're going to be extraordinarily careful because any mistake that happens in this code base is going to be a CVE. It's giving access to people that shouldn't otherwise have access. So that's going to take a long time. Then you're going to do QA. You might actually have to perform a security audit before you can deploy this software because you're deploying to enterprise environments. And then you're also probably going to want to take extra time rolling out these changes into production. You probably don't want to deploy it to everyone all at once. You probably want to deploy to a minor subset just in case you find something wrong with the code. And all of this just takes time. And the problem is it's actually putting security of your software at odds with development velocity. Fundamentally, it's going to take you too long to add this functionality. And you're going to want to take shortcuts. But shortcuts are security flaws in your software. So then as rinse and repeat, you basically don't know how long until the pain is going to build up where you're forced to rewrite these authorization systems. And that is the mystery box entirely. You could finish or not even be finished rewriting your authorization system. And then all of a sudden, a new user sets some requirement for you. And you're doomed. You have to completely rewrite the thing you just thought you re-architected to be future proof. So how do we fix this? There's never ending cycle. And OAS themselves actually have recommendations for this. They say you should no longer adopt RBAC, but take concepts from A-BAC and RE-BAC. Obviously, I'm biased towards RE-BAC because I think it's a more modern approach to this. But the OAS folks also give you some high level benefits to why you would do something, like why you would adopt these new ones over RBAC. I'm going to just take this from the RE-BAC perspective. When you're doing a graph-like thing, a relationship-based system, you're forced to basically talk about individual entities. So this user, Jimmy, has access to this particular document. Because you're doing that, it has this kind of buzzword, fine-grained. You're not resolving Jimmy to a role or a group. You're actually following Jimmy directly to the document. So you're talking about individual entities in the system. So as a result, you get actually more fine-grained access. I'm not trying to generalize about any users or paint over anything. I'm actually talking about the exact objects I care about. And that means you can actually develop systems where you delegate access to a particular row in a database or a cell in a spreadsheet. And all of these systems are designed for speed because they understand they're going to have to store a lot of data to be this fine-grained. And then because your applications are only talking about the direct objects that they care about, any of the relationships in the in-between don't get written in your code. So you just ask the question, can this user perform this action on this thing? How they got access to that? And if you ever refactor or change how they get access to that, that does not live in your code base anymore. That means you can make changes to your permission system and not change a single line of code in any of your web applications. And believe me, when you do that for the first time, it is a magical feeling because you don't have to touch any code. So then there's also multi-tenancy and management ease. And this is just simplicity around modeling. And then with ABAC and REBAC systems, you're paying it forward. So our back might be really easy conceptually for you to implement at the beginning. But these systems, the ABAC and REBAC ones, they're more focused on forward thinking. If you need to make changes, like I just described, you can change REBAC designs without changing code. It may be a little bit more effort for you to get started in building and integrating with one of these systems. But by day two, if you ever need to make a change, it's going to pay dividends. So I wanted to get deeper into this Zanzibar paper I talked about earlier, which kicked off the interest in REBAC that you see today. Basically, Zanzibar is a purpose-built graph database that is very specifically optimized for one thing, which is finding a path in a graph. And by virtue of finding that path, that means that the user has access to that particular thing. It's actually one of the few good things that came out of Google+. So there's only two things that came out of Google+. There is Zanzibar internally at Google and then Google Photos. The novelty of this paper is actually that it is solving an authorization problem with a focus on distributed systems. So if you'll notice, the title of the paper is called Zanzibar Google's Consistent Global Authorization System. So it is fundamentally trying to tackle authorization as a distributed systems problem, which is not really something else has done in the past, because they kind of acknowledge that if they're going to deploy one system at Google, it needs to work across all geos in the world. And it has to be extremely, extremely reliable, and it can never be wrong. These are really difficult requirements. But the anecdote I like to use is when you're on a cloud provider like Amazon and you go to provision something like, say, an S3 bucket, you're always choosing what region. But actually, if you go to set IAM rules in a cloud provider like Amazon, you don't pick the region. That is because these systems fundamentally have to be global. And when you're designing them yourself at a particular scale, you need to think about how you're going to make your system global. And so this paper actually inspired two companies, Carta and Airbnb, to go forward and implement their own internal systems based on the ideas in this paper. None of them are truly 100%, I would say, authentic to the original paper, but rather the paper refused with the requirements of their business at the time. So I think the real superpower to Zanzibar, though, is this, which is if you go to send someone a Google Doc in Gmail and they don't have access, Gmail will pop up a box and tell you, hey, you didn't give access to this person. That fundamentally means that Gmail actually has a way to ask questions and check permissions that are built into Google Drive. So that means you could have one central source of truth for authorization data that your whole application suite can share, microservices can share. And this is incredibly powerful because not only does it allow integrations like this, but it also lets you have that central source of truth where if you need to audit something, you can just ask that one service. It's the only service you have to trust. It's the only service that you have to query if you're trying to really dig into any of this data. So you have a problem like an outage or something, an incident, and you need to understand what the access control looked like. So you might be wondering, how do I Zanzibar? So this is exactly what we set out to do. Basically, the year after the paper was published, my co-founders and I left Red Hat to found and basically build SpiceDB in the open source. There were some folks experimenting with the ideas around Reback at the time. But no one was really moving the needle towards making this a production thing that you could use in a real enterprise environment or at a real tech company. We originally prototyped the thing in Python. It was type annotated, lazily evaluated, functional Python. So it was way faster than you'd ever think Python should be, but it was not fast enough, so we ended up rewriting it and go and open sourcing that. The name is actually inspired by Dune because internally at Google, the project was actually called Project Spice because the ACLs must flow. So the timing for that has actually been really good with all the Dune resurgence in the movies, but internally at OZ, all of our software is named Dune References as homage. But if we fast forward to today, the SpiceDB community has actually gotten contributions from a lot of companies, big names like Netflix, GitHub, Google, Red Hat, and Plaid. And there are production users in small companies, startups, where it's just the co-founders, all the way up to Fortune 50 companies. But I still haven't actually told you what SpiceDB is. So SpiceDB is, as I described with Zanzibar earlier, this extremely parallel graph database. So developers basically apply a schema, just like you would for a relational database. And I've given an example schema here, kind of modeling a Google doc. And then what they do is they store data inside that database and query that data according to that schema. And it's really magic where you can actually make schema changes and not in a forward compatible way that lets you actually modify your permission systems without changing any code. So we don't actually have a SQL API, despite being a database. We give you GRPC and HTTP APIs. And effectively, like the primary interface we recommend as GRPC for latency reasons. Because authorization is in the critical path of everything, your web applications are going to do, and possibly everything at your business, you really have to make sure the stuff is fast. Thus, everything needs to be kept in memory. Everything needs to be returned in single digit milliseconds. So GRPC is actually pretty critical for that. And then in addition to the actual main server, we also expose servers for power and dev tools. So you can get auto-complete and things in your editor. But then also integration testing services. So it's Kubernetes native. Designed from the beginning, our background is all in Kubernetes. So actually, SpiceDB is self-cluster. So if you deploy just SpiceDB directly onto Kubernetes, it will discover other nodes and actually start to divide and shard up the in-memory graph that it's using to actually serve this data across them automatically. We also offer a SpiceDB operator in the open source, which will then do automated updates for SpiceDB. Notoriously, having zero downtime updates for a database is very tricky. So we just took that problem off the table for most people and just implemented it automatic for anyone using Kubernetes. So we remain true to Zanzibar's goals of consistency at scale. So we actually have pluggable data storage systems. And basically, depending on what your requirements are, say you need to deploy everywhere in the globe, you can actually store all of your raw relationship data in something like Spanner or a Cockroach DB. And then you can deploy regional deployments of SpiceDB that will exist as independent caches for those geos. But fundamentally, they're sharing all the same core data and they're consistent across those environments. If that sounds too complicated for you or you don't really need that, you're just single region shop, that's fine. We also have deep integrations of Postgres or MySQL if you just want to use something like Aurora or Amazon or ES. Obviously, then there's also memory for testing. We also have a tool called Zed. Zed is the CranLine tool. It basically manages cluster credentials as backups. It gives you a command for every single SpiceDB API. And I just kind of give an example of running kind of with debug flags permissions check. You can actually see it gives you a whole graph traversal. It shows you a tree of how you actually computed whether or not someone has access with timing data associated with all that. So you can see where things slow down. We have a web IDE. So actually, the two things you just saw, SpiceDB and Zed, we compile to WebAssembly and then run that in the browser. And then we basically build that all on top of Monaco, the engine that powers VS code. And give you a full IDE where you don't have to install any of the software I just showed you. You can just go to play.offz.com and start playing with this stuff. Run Zed against live data. You can load in test data. And what we actually do is we can generate exhaustively all of the paths available in the graph for you. So there's somewhat of a model checking happening here. So you can actually prove exhaustively that all of the ways you can traverse the graph are the ways you think they are. And that basically lets you prove that a system is correct without you deploying it into production or having someone do a extremely long security on it on your process. And then you can check this stuff into CICD. So if you make a change to the schema, you can actually guarantee that certain assertions always pass and that everything is exhaustively checked. So Zanzibar is not a silver bullet. We actually have had to extend Zanzibar in a bunch of different ways. So SpiceDB remains true to all of the core concepts that you'll find in Zanzibar. But not everyone is Google. So effectively, not everyone relies on users being represented the same way. So we are kind of more flexible with how people can model their own users. And then we kind of add on developer experience because at Google they can say, you're forced to use the software. When you're building open source software, you can't force people to use your software. You have to compel them to use your software by having a better experience than what they're currently doing. We've also added kind of contextual relationships with ABAC. So that means relationships can actually exist basically dynamically based on context that you provide at runtime. That was a joint project with Netflix. So if you're wondering how you SpiceDB, you can go to our Discord, discord.gg slash SpiceDB or check out GitHub, basically anywhere on the internet where you expect to find open source projects. SpiceDB is there. So thanks everyone. Thank you. Thank you.
Improving Infrastructure Security Through Access Auditing
Today is Scott Bryan. He's going to talk about improving infrastructure security through access. So, you're up. Morning, everyone. So, I recently joined Red Hat and I work full time on the Adoptium Temerin JDK project. So, we use a very traditional build model with a large suite of machines. We support between 12 and 15 different platform and architecture combinations. So, it's very difficult to do just with dock containers, just single machines. So, we have a massive, massive suite of infrastructure. This doesn't work. So, we're currently undertaking a massive piece of work to secure our supply chain. So, we are looking at S-bombs, reproducible builds. But underpinning it all is a good infrastructure security strategy. And we've implemented centralized keys, rootless access, things of that nature. But how do you know all of that stuff is working? Unless you can visually see the results of all your security work. It's very difficult to prove whether it's working. So, I came in. There's no security or no strategy for verifying that any security fixes have worked. So, it's a very cut down presentation from the full length one. So, first things first for us was identifying what we wanted to get out of an auditing system. So, we want to capture, login any access attempts, anything at all where somebody was accessing a system, particularly in the build sphere. If you think about the Sol wins attack, which was a compromised Jenkins server, I believe. If your build system infrastructure is compromised, your builds and source code are potentially compromised. You build something, it's got a vulnerability in, but it checks and everything else looks valid. So, any end user sees that. The other thing we need, we wanted was an automated response and alerting. So, should somebody try to log in as root on a build system, we need to be one. That needs to be stopped straight away. And we need to be alerted that that's a thing that's happened. Come to it, why in a little while, the scale of the problem when you don't know about it is very different to when you do know the numbers involved. So, and then we want some analytics and reporting so we can, again, gauge the program and the success thereof it. Ultimately, for us, our infrastructure is all provided by a dozen different cloud providers. It's all publicly accessible. So, all of our, even our build infrastructure is open to the web. You can request access to it when you join the projects. So, again, the attack vector is significantly large. We don't have a single firewall that we can use, sneak and restrict the IP addresses. It's all publicly available. So, for us, host-based intrusion detection using Wazoo, not a tool we build, but it's open source. It's a very good tool for this use case. I would recommend you do a very similar exercise, analyze your requirements, and then have a look into the tools that are available. There's quite a few of them. Wazoo itself is a fork of OSSEC, which kind of stopped development when it became semi-paidware. Wazoo was an offshoot that is still open source, and they've continued to feature develop it. So, the scale of the problem. So, some numbers, which 24 hours across our infrastructure suite, 202, just slightly over 2 million attacks in 24 hours. It's a bit of an eye-opener. Of those, 12 are deemed by our, and the standard rule set from Wazoo is really excellent, of being serious enough to warrant concern. And you can see in 24 hours, about half a million people, people just brute-forcing the build machines to try and compromise them. I think a demo is slightly impossible without my laptop, but you can drill down into all these. You can see there are all the metrics that are available for the attack vectors, the CVEs, and you see there are also the 79,000 authentication successes. It's here on the right. What's the difference between SSH and brute-forcing? Not all machines are accessed by SSH, so they will be things like Windows, brute-force, password attacks. But Wazoo detects, again, remote services, modify registry attempts, all via RPCs and things like that. So, again, this is the, the first thing it does is give you a nice kind of visual view of how big the scope of problems are. It's why I like this tool quite so much. So, drilling down a little bit just into the authentication failures, you'll notice that Windows, by far, is the key attack platform compared to the Linux service. The numbers are hundreds of thousands times as many. And you'll see there, the top three machines are all build, Azure, Windows. It's a very popular thing for attacking and, again, get a much better kind of breakdown of the attack vectors. People trying to access restricted accounts. People trying to get valid accounts. So, although they're disabled on ours, the standard Windows administrator, the standard Windows guest accounts, although they're all disabled, everybody can guess one of the Windows or can find out one of the Windows standard accounts that, unless you've disabled it, is a very easy attack vector. And then just brute forcing things. And then, looking even deeper into just a single host. You can see down here at the bottom of the screen, you're getting the login failures, unknown user, a bad password. In theory, it's somebody just typing an IP addressing wrong. However, every single one of these attacks has been stopped with an automated response. You can go even further into blocking IP ranges, geographic ranges, so you don't even get the alerts. It's that I like the visibility. I would say only the really high priority stuff. And you'll notice once you drill down, there are actually no serious alerts. That proves it's working. So, again, you can take some knowledge in that your infrastructure is fairly secure. And then another really useful feature, and again, is you can then go into the details of each individual attack. Although you get a geographic region name, IP address, things like the target users, they've tried to brute force on our SSH-based host. There isn't a slide for this. We've extended it because Wazoo is eminently customizable. So, we also capture the SHA-256 checksum of the SSH key being used to try and attack. So, we can then determine if it's one of our valid users, because we have all our keys stored centrally and distributed centrally via Vestillion. If it's not one of our keys, we can then start blocking SSH keys at that level. But, again, we've extended it to capture that information. And Wazoo is basically an Elk stack-based tool system. It uses the logging part of it, the elastic search, and it just captures all the logs from all the systems. Again, you can customize it to capture whatever you like, your Windows system registry, whatever the Mac equivalent is, audit log, syslog, and it just harvests it all into one. Really nice, easy to query, work with. It's got the capability of doing dashboards and searches. We're still fairly new to rolling out and leveraging it for real serious stuff, but I think it's worth sharing even at this stage. And again, more extended audit information. This is from one of our dock hosts. Somebody there has logged in as Root. It's probably me. But again, you can see the kind of information you capture even on successful logins. If you're trying to find out who's doing stuff, they shouldn't. And Wazoo itself goes much further. It's got a file integrity management tool, which again, you can alert on, so you can track all the changes to key system files. It's got a SEA component, so it will check your system against the NIST databases, look for any vulnerabilities, give you the links to the CVEs, and then the potential fixes if that information's in the NIST databases. All of that in one happy place. Worth a look, and if you want some more information about how we use it, feel free to connect with me on the adoptium slack after this meeting. Whatever you need. I think we've got like a minute left, so time for one question, maybe. Say we're already using something like HashiCorp Vault. There's that lagging behind in audit capability. Let's say audit capability is something we want to elaborate on right and get ahead. Does that even give us an advantage? Is it doing everything in Vault or not? What is the wisdom and that? Okay, so that's the question is, compared to HashiCorp Vault, what does Wazoo give you? I can't see any reason why you couldn't use both. You could still use Vault for everything you're using Vault for, but what this would give you is the reporting tool on top. Would that work? Yeah, yeah. How much effort would go into it? I've never used HashiCorp Vault, so I really... But Wazoo, say you could get it to monitor your Vault, as long as Vault's putting some logs out for you to monitor, you could customize Wazoo to look at those logs, as well as your system logs, and still use the same visibility features and log harvesting. I don't see why that wouldn't work, but... So it's string matching based, right, as long as I have log performance? At the base level, yes, it's string matching and regex from log files, but that's just what it ships with by default. You can extend it to do whatever you like pretty much. If you're willing to write it. OK. Right, I think that's it. Thank you very much. APPLAUSE Thank you. Thank you. APPLAUSE Doctim is an Eclipse Foundation project for Temrin JDK. Although I ran out of paying my wages, I worked full-time on the adopting project. So... Wazoo is a third party to look into the Eclipse Foundation. I just think it's... Yep, sorry. Sorry. Well, cheers, George. I'll catch up with you later, mate. Wazoo, just a little bit. I saw it was best for our needs. OK. And all good things about being a little bit independent about working for the foundation.
Role of IGA in Access Management with Multilateral Identities
I have two affiliations, Manis with Evolvium, which is company behind open source IGA system midpoint and also I'm active in academia helping scientists across the world get together and solve their identity problems and in this talk I will kind of combine all of my experience with this rather complicated topic. So let's see with some introduction. If we're talking about multilateral identities we are meaning basically the whole scale of identities that are available to the users because the users today have a lot of identities that they just own and they can use to access systems. So it can be like identities from one's institution but it can also be social identity, identity on for example GitHub or even though states especially here in Europe are pushing some European IDs, digital wallets, some academic identities, banks and so on and so on. There's a little lot of them and all of these identities can be used somehow. Then the next item in the name of the talk was access management. So it's a component basically responsible for really giving access to people and do everything related to access. So one thing you can do is of course just type your username and password in but in principle you can use all these identities as well and that we have IGA which is identity governance administration for those who don't know this term is basically an extension of identity management and its main purpose is get the identity management rather technical stuff for administrator to people who are actually making decisions. So some managers or even support just get them in, let them manage what they supposed to manage, not have everything done by technical people when the others call them. And in this talk the identity governance system will be represented by midpoint and I will try to show you how all these pieces fit together and what can you do with the combinations. So let me introduce midpoint as well. As Zawar said it's identity governance and identity management system and because I'm here of course it's fully open source and usually suppose it's not important to say it but when you are dealing with identity management and access management areas a lot of the products there are claiming to be open source but in reality they are just open core or something even else but with midpoint we are really doing our best to make it fully open source including all the documentation guidelines for the developers whatever is needed everything is open and available to use. And the product itself is maintained by Evolveum and we have few external contributors. We would happy to have more of them but it's kind of hard. The identity management, identity governance is very complex tool contains a lot of code so it's very tough to get contributors. But luckily we have some contribution at least to some integration part something that is easier to get to. So about midpoint it's very feature rich and I would say it's really comparable to any commercial alternative. So I consider it a big success and even we are recognized by some analytics company which is really nice and what we can expect from open open source system it's really customizable is using as much standard as possible if you want to get more there is a link that you can find all the information. So let's get to access management integration because this is quite a common but I think there is a lot of potential if you are integrated IGA system of access management. So from the IGA to access management this is the more common path so the IGA because from the identity management part most less information about users and their accesses naturally IGA can provision all this profile information about users to access management and also provide data for authorization. It might be attributes, might be roles, even some combination. So this is quite natural. The other way around it's something not that heavily used and then I think there is a lot of potential in that because the access management especially when we when using some external identities have a lot of information to pass back to the IGA because and if we are talking about single organization, if you're using a password and you have no new information, if we are using these external identities usually with the identity we are getting some attributes that can be used. So if it's a state identity we at least know this person was verified by state and we have some identifier from the state that can be laid to use. For example if we are dealing with some big security incidents, if we have academic identities we can get information whenever this person is an academic employee or student and again use it for access control later. If we have social identities at least we have some social identifiers of the person that we can use for some integrations for example or we might have also other attributes like names, emails, whatever that can be used such as to make life of the person easier just to request them just use the information that we already have. Second thing that we can get from access management is access timestamp because the access management of course know when the user was accessing the system so we can get these timestamps and work with them later and I will get to this. What are the typical interfaces for the integration? There is no standard unfortunately but we have some standard common option so from the identity management part integrated anything is usually through some kind of connector basically writing custom connector to to get whatever API is that for the access management or there can be some middle layer like some let's say LDAP or Active Directory some standard database that access management can use. And to get some information back I do if there's this direct synchronization of connector identity management can read it back or if you want some like just like runtime integration you can always call some API and do some something that. Let's move to identity governance benefit. Basically if you are familiar with identity governance it probably won't be anything new but I will just repeat it. The very important one is overall visibility. If identity governance deployed because you usually deploy it within a single organization you want to be in control and have some visibility of it's happening in your organization mostly to tight your security be able to go through audits and so on. So the main feature is some kind of reporting or web pages dashboards who has access to what and why so you can visualize for example if you are using role based access control who has which role what the role is entitled to to which applications and why every person has this role. In midpoint we are using something that we call they are calling policy driven RBEC because the RBEC is very good tool it's very easy to visualize it to explain to people but you need something more in order to work with some attributes with some automated rules so we have kind of extension of RBEC and if we are thinking about this talk how we can get this multilateral identities and this data we can get through the access management to AGA I have to use it. So first one is use these attributes for example if I know the person was wedded by state come with state identity I may note this as a level of assurance attribute I have big restaurants in this identity and based on it I can give some access through RBEC classically and then I can visualize it in the standard way using dashboards and really know what the person has access to. Also when I have time stamp when the user is accessing each system for the last time again through the access management I can use it to build some policies either to remove unused accounts and to tighten security or I can naturally work with some kind of expiration renewables of account whatever I need for my particle workflow and of course AGA wants to automate all of this so using airbag automated rules some provisioning through connectors and integration in the system make sure that everything that I just said is completely automated and you don't have to worry about it. If the full automation is not enough you want some human element some kind of interaction you can have some approval processes, expirations, renewables and so on. So let's get to some interesting feature about integration all of this together and I will start with integration of basically access management to given surveys using just in time provisioning now without identity governance. What you can do and this is very nice trick you can basically create accounts on the fly because when we are using these multilateral identities which are coming already with attributes we can just pass this identity to target system and by passing it we basically authorizing the identity to access the system and the system if the system supports it can create the identity and accounts for it on the fly, use the attributes and give proper permissions within the system. What this basically here is how to how to deprovision such accounts because this creating of accounts is ideal it's very simple you can use it really on the fly but you have no way how to disable these accounts the only way how to do it is again by the end system itself to have some kind of expiration because what is important here if the person accessing losing the access they are not going to the system anymore and the system doesn't know never gets the information there's no way how to get the information and also with this we have no central visibility who has account where and why which might be tough for doing some audits or resolving security incidents you have to manually go through all the systems and get through it. So with midpoint if identity governance component in place we can basically extend this using some extra tricks so the basic premise is in midpoint we are managing entitled users so I'm not saying the user should have active account on the target service at the moment we're just saying he or she is entitled to have it and whenever the user decides to access the system again using just in time provisioning we can create the account on the fly using this entitlement at midpoint midpoint managers. Also what is nice about midpoint midpoint supports provisioning and it's really quick it can be done in real time so even though if the target system doesn't support just in time provisioning can create in accounts immediately basically access management can ping midpoint and say now it's time to provision this account midpoint checks if the user is entitled and if so triggers the provisioning so we can have just in time support even for system doesn't support it not that able. Also midpoint and is this provisioning connector can read some data back so regardless if the account was created on the service or through midpoint midpoint will get the information the account is there is active and also we can read any additional information and basically then we have full scale information for the IGA we have who has the account active who is entitled but doesn't activate the account and we can build all the policies on top of that including some expiration renewals work with last access timestamp and combine this all together so for example if the user doesn't use the system for a long time it's kind of a security risk we can deprovision but we still know the user is still entitled so for the next next usage we can we can still work with that. And now gets to this part with multi lateral identities because that's brings just another level of complexity because with multi lateral identities we are expecting that a single user can have multiple identities and we can even combine them because we can say okay one identity for example the state identity brings your account to the higher level of assurance we are know this account was vetted then you can have some social accounts saying okay this is your social ID and we can connect with some with some system with some social systems and we can integration because we know this ID because we know this ID if we have some academic scenario we know the person is a student or employee of given university or even more universities you can combine it all together. Tough part is how to correlate these identities because there is no common identifier nothing that we can automate it. In midpoint we support two way we call it smart correlation and it enables you to configure how the individual accounts could be correlated and you can you can base it on the source so if in the source you know this is the email which was verified and I'm happy to correlate with existing accounts based on this email you can set up this rule you can set up some fuzzy rule like matching on name or even even have fuzzy matching counting with typos and some something like that but this probably you want to fully automate because there is some risk if it's really the real person so you can also define what match should be like processed in an automatic way and just connect these identities together and what rules need some human interactions and there are two ways how to do it. If you want strict control it can be done by some administrator or some other delegated responsible person basically manually whatever process you need you will just see okay this is attributes you have this is the new identity you have to decide here are some potential matches decide if one of the matches is real or if you want to create a new account for this user. Second option might be to again use the access management part because basically the user in principle own all these identities and can use any of them to sign in so let's just the user sign in with first one then the second one in the same session and then we know for sure that the user's own both identities and can be connected together. Also what is nice here you can combine all these external identities like state, social, academic with local one if I have deployed IGN and usually I have within a single institution I have some local accounts managed by HR department so even combination of these local accounts and this like remote identities is possible using the same principle there's nothing really different there and what I can do with that is build some kind of unified profile so take all these attributes that I'm getting from different sources and then usually I want to build a single user profile I don't want to work with users who has six names from six sources and most of them are exactly the same the value is the same maybe sometimes if you have like your name from the social network maybe you have different biospelling because you like it or something like that so we can just gather this data and then put a formula how to build single user profile how to select which name is the one how to select which email is the one that should be used or if we need it we can just build this like one profile which is always handy to have if you don't have any special requirements but then we can also have some extension of this profile for example we can have like official email within the institution and then like a personalized preferred email and then we can decide based on the target application which one of the emails should be used which one should be provisioned and this is basically or possible with midpoint just put all the rules in how it should be processed of course the most difficult part is how to decide it because we want to have it simple so people can understand and also give an option to select for example they prefer the email or preferant address at least for the system when this this is not that important and what we can also do it thinking about these rules and how to combine all this data together how to put some organization policies in because it's really nice to have if you use this freedom to select their mail or they prefer preferred name or preferred email but sometimes we have some systems that we really want to enforce strict rules because this is something that I don't know is sent to send to authorities for for some validation or I don't know it's it's might be tied to to your payroll and you want to have real data there but then you can have like a company social network and let user give them the freedom so when constructing these rules we can combine like organization policy with some user preferences and even based on the target system decide it what results should be used and where sounds complicated it is but it's all about programming and how to put it together for your organization and again with the end goal to have fully automated processing at the end with some middle user inputs user preferences and so on it's not complete there are still some missing pieces we were we were experimenting with running some demos improving midpoint as a product to support this better but for sure it's not fully complex the biggest issue is user experience because yeah a lot of these options especially dealing with external identities when users needs to sign sign in actively work with them it's hard and this will be hard for a while but it's getting better how how how people are getting more and more used to work with work with their identities and use the identity to sign with completely other system now we've pushed for european evils and so on again people will be getting more getting used to this principle and it will get easier also the interface between access management and aga is not well defined now we are just writing custom integrations on both sides dependent on the on the needs for sure it will be better to have some like prepared interface that we can use and we can connect our product midpoint to existing access management systems that would be really really handy also life cycles of the individual identities because we are combining different identities to a single profile and also we should think about life cycle of the individual identities some of them are pretty persistent like the state ones but other like if i if i know that someone is a student probably i should verify this this this statement once in a while and i can put some policies or condition it would be nice if the protocols where we're going to support this so we can for example query each day but with other protocols like samble basically until the user didn't sign in you don't know the current state of the information so have some maybe some explorations some renewables here as well would be really nice also the whole assurance and trust model in this might be very complex again working with different sources of information which are trustworthy which are not how we can process them how we can use it what's our assurance on this information it's difficult to even decide what we want to do and we are and when we are when we are have this decision is an essential thing is how to process it we experiment with some kind of small project which we called mid privacy and it was about putting metadata to each value that we are storing at the metadata source of the identity the assurance level and also for example how we can be used within the gdpr framework so having this all tied up and again automated so we can use it for automated processing and provisioning rules would be really nice we started as an experiment just to get some feeling for it it's fully available to people but as far as i know nobody tried to put it in practice yet which is which is a bit pity and again there is a link if you want to read more about it so just to conclude and hope to leave some time for the questions it is it is really possible to combine these worlds and really tightly connect identity governance system with access management and basically unlocks new potential for new features and nowadays there is a lot of identities that people can use to sign in to our systems state banks and i'm expecting there will be more and more of them and people will get think more and more custom to use them especially with these eids on the european level so this is something that we should be prepared for and i ga even though and i think about i ga is mostly within a single institution to make sure everything is tied everything is well well ordered everything is automated it can be audited it can work very nice in this world of multilateral identities and bring these same conditions and the same the same benefits from the i ga to this world as well but having a full implementation covering order english is complex and it will take probably some time when we all get it there midpoint is kind of halfway through and that doesn't mean halfway exactly we have something now that can be used it can be experimented with but if you want to reach maximum potential it will need to improve the product as well and because everything is open source and available all the contributions are always welcome so thank you for your attention and we have a few minutes to get some questions yes the question was if we have some machine learning on our roadmap we have we are already experimenting with that not for this particular problem we decided to first start with role mining so if you are importing roles usually within the existing role that you already have how to mine some business roles out of it because if you have if you are migrating towards i ga you have a lot of roles manually manage and it's good to build some business roles that can be easily managed and we are using machine learning principles for that it could be good to use it for example for this identity matching but so far we were happy with some customers yes so this identity of management is only one side of the picture because if i have a user he might be a suicide man or whatever he's leaving traces in the application you grant access to okay so if this is now leaving the company the institute the university how do you deal with the traces do you have a mechanism like maybe scramble the username and change the username in the application so if it's reused that yeah re-usage of usernames in the target application might be a big problem yes so so the question was we have all this in place and what we what we can do when user is leaving the organization with his or her data scramble them remove them somehow something with that and this is a tough question because one part is the application itself and when you have this automated identity management identity governance system in place usually you can deprovision the data completely out of the application but then you have this central point of the identity governance and the question here how long you want to keep the data for security incidents for example and that's valid and probably you want to have them unscramble for year two depending on your policy and then you should again automate the process how to either scramble the data or just completely get rid of it i would say accept identifiers because especially if you are talking about usernames you probably don't want to reuse that or at least not within a central period of time so i would recommend to keep that yeah but it's a bit more than that like in the talk before we had the wasuo what we have some web application and some person is creating a dashboard so within the application it belongs to that person and everyone else is using it so i kind of delete it but the the creator is done so it's more complex than this yes yes so the so the comment was if the user is creating something like a dashboard in web application that others are using if the original owner leaves can we delete it or not and if you are within an organization when you have complete control over your users you need some process to pass this work to someone else and i would say you have to i process for levers in the same way you are returning your keys to your office you should also return all your digital systems or transfer this to someone else but what you can at least automate in this case if you have this like a dashboard something have a process that before the deletion will send the notification or let someone approve it and that could help to automate it okay time is up thank you and we can continue this question later
FusionIAM - a full Open Source Identity & Access Management solution
So, we're going to start our next talk. It's going to be so as to explain what we do and we'll be doing this so that I tell you fusion I am a full open source identity access management solution. Bonjour, je suis français but I will speak in English, okay? No problem. Yes. So, some words about me. So Clément Wido, I work in a French company which is called Vortex and I'm doing a lot of stuff about identity management of course because I'm here to talk about it. I'm also doing other things like music. If you want to listen to French music, open source music, it is a creative commons, you can go on my website and I'm also doing a theater and other things. Very quick about Vortex, we are a service company and we provide many solutions like collaborative tools, containers and of course identity access management and I will talk about this thing. And if you want to not play music but work on open source, you can apply on our website. So, for the topic today, I will talk about the fusion I am project, explain why we created this project and which open source component we use to try to build this big solution. So, we decided to create this with Benoit Martier which is the leader of fusion directory. I don't know if you know fusion directory product, who knows it? Okay, but many people. So, it's cool that you come here so you will know about it today and Vortex. So, we are both people working on open source product around directories and identity management. The goal was to offer a complete identity and access management solution because you know that in the propriety solutions, when you buy one, you get all the components of identity and access management. But if you are using open source tools, most of the times you only get one piece of the full picture and you need to install them and connect them. So, our opinion was to say, okay, we know that in open source, each product must do one thing and do it good. But if we want to be able to go to companies and say we are doing identity and access management, we must provide a full solution integrity. So, that's why the reflection. And we are today working on this project in Vortex with David Coutador and myself. So, who knows OW2? Okay. It's normal because it's a French consortium like Eclipse, but you know Eclipse, but you don't know OW2. So, today you know OW2. There are a lot of products inside OW2, Blumain, GLPI, and Lemanelda, et cetera. And so, we are an official project of OW2. So, one solution when you want to do a new open source software is say, okay, all that exists is mess. So, I will write everything, but of course, I have a family, so I don't have the time to write everything from scratch. So, we took all the open source projects that we know and we tried to combine them together. The one you may know is this one, OpenLDAP. Who knows OpenLDAP? Okay, yes, one. Of course, we are not the developers of the OpenLDAP software. It's something that is managed by a Siamese company and OW2, which is the leader of OpenLDAP. But we are very implied in the community and we work a lot with OpenLDAP. So, our choice for the directory server, which is clearly the base of the identity management, is OpenLDAP. And then, we put a lot of products. So, Le Mans-Haldap-NG, who knows? Ah, yes. And we have the founder of Le Mans-Haldap-NG, which is Xavier Guimard here. So, we have some of the community here in Fosdame about Le Mans-Haldap-NG. I will explain all this fusion directory. So, I said, Haldap Toolbox. Okay, LSC. Okay, it's normal because it's the products that I created. So, okay, it's normal. Okay, so, these are all community projects, open source projects. Of course, you know only this one, but you will see how we try to combine them. Our approach was to say, okay, we can be as IBM, HP, et cetera, and we can go in your company and say, we have all the components. So, access management, access manager, the directory server, the directory manager, synchronization, the connectors, and two other components, white pages, and service desk, I will present them. So, that's a typical big proprietary IIM solution, which is, okay. But we put all the open source software behind the same. Okay, so, of course, the directory server is open-end app, but we added some tools in Haldap Toolbox project to better manage open-end app, to do backups, et cetera. The directory manager is a fusion directory, connectors LSC, the access manager is Le Mans-Haldap-NG, and other tools are some part of the Haldap Toolbox project. Of course, I will present them, but I know that you know other software to do that. Typically here, the most open source tool known to be the access manager is Kiklok, who knows Kiklok? Of course, everyone knows Kiklok, but I will explain why we do this one. We have another choice to have a single sign-in product, and this is Le Mans-Haldap. And for this, of course, Evolvm midpoint, that we just saw before, is another possibility here for the directory manager, et cetera, et cetera. So, everyone can choose which technical components it will bring in identity access management. We did this choice because we are clearly developers of a lot of these components, so we can act on the roadmaps of these components, and we know how to make them work together. So, if you choose Fusion IAM, you will take the choice we have done. If you do not agree, you just can fork and replace the components if you don't like them. On a technical point of view, if you already install Kiklok and a directory server, you know that it's quite simple. All components are linked to the directory server because that's where you have your users, passwords, groups, et cetera. And here you have the connectors to be able to synchronize from a database, from an active directory, for example. So, it will go into the Haldap server. These are tools to manage the data. So, to white pages, just to display the photo, et cetera. Here is to be able to reset the passwords. Here is to create icons, et cetera. And the access manager will also be connected to the directory server to do the authentication. All these are Haldap, HaldapS flows. You have just one database used for the access manager to be able to store the configurations on all the sessions, but the other tools did not need any database. All the tools are only using the directory server. And of course, you have here the access manager. So, a user, the end user will only see the access manager part to be able to access all the data here and to access also all the components. So, some explanation on the software. The first one, everyone knows. So, like Tynitero said, it's simply the best. I hope you all have the song now in the head. It's the best Haldap server in terms of performance of standard, because the people coding on Haldap have also written a part of the RFC of the Haldap protocol. So, we're sure that this component is respecting the Haldap standard. And if you manage your Haldap by yourself, you know that you can add a lot of features with overlays, like password policy, which is very important in identity and access management to be able to expire your account, to lock your account, et cetera, et cetera. And we will see that we bring other tools to be able to manage the open Haldap password policy. And in the Haldap Toolbox project, we provide some package to be able to install open Haldap on a different distribution. You may know, are there people from Red Hat here? Okay, it's not a problem. But you may know that Red Hat has chosen to push away open Haldap from the distribution to be able to use the Red Hat directory server as the main directory server. So, if you want to install open Haldap with a package on a center-est, et cetera, you can use the Siamas package or the Haldap Toolbox package. And of course, we also provide package for Debian Ubuntu, et cetera. Okay, so the directory is okay. The directory manager, so we choose a fusion directory. It's a PHP application. It's not like PHP Haldap admin, which is a very technical tool in which you browse the tree, et cetera. Here, you have a functional view of all the objects that are in your Haldap directory. So, of course, users, groups, but you can also modelize the service icons, applications, et cetera, et cetera. So, it's a very functional view of this. And it includes administration delegation. So, you can say a people is connected to this interface, but it can only manage the people in the service, et cetera, et cetera. So, it's like the midpoint or all those software like this. It's just to offer user interface to people to read or edit and illustrate data depending on their why. The connector, no UI. It's just a command line, but it's a very powerful tool written in Java. And it talks with RISTPI. It talks with databases. It talks with Active Directory. So, we are able to easily synchronize Open Haldap and Active Directory with the store. So, very efficient. And Lema Haldap ng. So, the key clock killer. No, I know it's not, but okay. It's like key clock, but we provide an application menu. We manage all the access control. White pages is an easy way to display the data of your directory for end users to search for phone number or email address. So, these are only Haldap data. So, I created an Haldap directory with Star Wars data. And you can display them, search for the umpires, the Jedi, et cetera, et cetera. But there is no database. It's only an Haldap directory. And ServiceDef is a little tool for the support team because you're able to see first one. You can see all the password policy data from Open Haldap. If you work a little with Open Haldap password policy, you know that it's very technical to understand how all the state of the password is managed. So, here you have all the dates, et cetera. Here, and you can test the current password. Of course, you can reset the password. And you can see if the account is locked, you can unlock the account. You can see the password is expired, et cetera. So, it's very easy for a support team to know if an account is expired, it's locked to unlock it, et cetera. So, moving to the cloud, because that's how we need to work now. Why? Because before, and we still do it for customers, we have virtual machines and we deploy all the package and we configure all the package. And we say, okay, Haldap directory is here and you need to connect to this web server, et cetera, et cetera. And when you want to put the logo of the customer, you need to put the logo in every product. So, the customer say, okay, it's integrated. Okay, it's integrated. But, okay, this still works, but it's a lot of work indeed to reproduce this by every customer. You need to write that. And the cloud approach is to say we will move from package to containers, images, and we will try to configure all the images, all the containers through variables. And indeed, we saw that Haldap server is the same for each component we need to connect to the Haldap server. So, I only need one parameter, which is Haldap URL for all components. So, I configure it once and then I can have the full solution. Of course, you do cloud. Okay, it's a mess. Okay. We need to have pods. We need to have volumes, et cetera, et cetera. So, you see that what was a little easy with some bricks and some components is not easy with the cloud because you need to identify which volumes you need to run the containers. And when you split, you usually split the web application between the front end and the PHP, FPM, or the past Haldap server. But it's better because we can run all these images and we can have, so, of course, for Haldap, we have a volume for the data, a volume for the configuration. And also for the certificates, KN certificates. And so, we identified in the FusionIM project, we identified all of this and we created all of these images and volumes. So, you just need to do make, run, and it's running. We have a container registry. So, it's open source. It's available. You can just pick the images and you can run with a Docker podman or a Docker compose. So, it's very easy to test. And you can also download the Git and run the Mac, run all with a Mac file and it works. The only thing you need to do is to initialize the volumes and, of course, put some configuration for your domain, et cetera, et cetera. But you just have to do this and you will have the full stack of identity and accept management running and configured. So, it's very easy. In Vertex, we choose to create a new offer which is called with us, identity as a service. And we put FusionIM inside of our, in our cloud for our customers. So, we can run for each customer. We run one FusionIM project. And so, a customer don't have any directory, don't have anything, but he can connect all this application through SAML or OpenID Connect. He has all the applications to manage the data inside the enterprise directory which is inside the cloud. And we, of course, have a lot of RISC API. So, we have RISC API for provisioning to create accounts, to create groups, et cetera. So, you can do all this with RISC API. And we also have some RISC API here to be able to create a new OpenID client or a new SAML client. So, you can provision the users, the groups, and you can also provision through RISC API the applications. And, okay, I know I have five minutes. Yeah, I can do a demonstration. Okay. Ta-da-da! So, it's not a screenshot, okay? It's a real, it's a real interface. So, it's hosted by Vortex. It's running on OpenShift, which is Kubernetes from Red Hat. And so, this is the login form. And so, you see, it's the access manager component, so, LemoneldapNG. And inside, we plugged all the IIM components. So, this one is for configuring LemoneldapNG. So, just the administrative interface. And, okay, so you get all the parameters. And here, you have the other components. So, why? Of course, it's a demo. Yes. Yes. Yes. Okay. So, this is how we manage the users. So, what you can see is that you can work with departments and branches in the end-up directory. And we can create, so, I create, for example, a codecobain because, okay, it's 30 years ago. It dies this year. So, it's a simple account, okay? But if I want to, so, I'm an administrator, I have this view, okay? But if I want to browse the directory, I can see it. So, I'm an end user, and I want to see the information of codecobain. So, I can browse it through web pages. But this is clearly the same. You see that you can also browse the groups, so, with Brittany. And this tool is wonderful because we can dynamically use the postal address inside the end-up directory to display people on the map, okay? So, it's a nice feature for an intranet, for example, when you are all in a remote location, you just put the postal address of people and you can see them on the map. It's quite nice. And, of course, you can click on and see, okay, this one. And if you are in the support, okay, Brittany Spears has lost his password, okay? So, Brittany. Okay, I can reset the password. I'll say, okay. Baby, one more time, okay? And, okay, the password was changed. We activate the flag, so, she must change the password as the next connection. All this is managed by the open-end-up password policy. But, of course, when she will connect through all the components, the component will respect the password policy and she will be forced to change the password at next level. Okay, that's all for the demonstration. Thank you. Some questions, maybe? Yes? Can you change actually directory passwords from this? It's a feature that we did not implement. Can we use this component? This component is an adaptive box service desk to change the passwords on active directory, not yet. But, all people are saying, okay, this is wonderful, but I don't have open-end-up. I have active directory and I want to use it. So, it's in the roadmap, but it's still not available. But, for information, this one has some hooks, so you can reset the password in open-end-up and also hook it to a change at the same time in active directory. If you have both directories, open-end-up and active directory, you can use the hook to push the password on active directory. But, if you only have active directory, you can, for the moment, not use it. But, it will be the case maybe next year. Maybe next year. Yes? Do you support private ACME servers for the certificates for these web services? Sorry, private, what? ACME. ACME. Let's encrypt. Do we support ACME or let's encrypt? Of course, yes, because you just have to run it in the container, yes? How do you handle applications which cannot use OpenID or Sample? Okay. And, where do you use the host headers and authentication? Yeah, so how we manage applications that are not modern applications which use either Sample or either OpenID Connect to do single sign-in. Lemana Lab is also compatible with the CAS protocol. So, we can also use CAS, but in the cloud, we say that CAS is not secure enough to do the cloud. And, of course, we can have a component in Lemana Lab. We have a component called the Angular, which is an agent that you can install remotely on your infrastructure system and can communicate through REST with the portal in the cloud. So, you can secure it, some local application with an agent on your side and let the agent deals with the session, et cetera, through REST API in Lemana Lab. So, we can do a mixed mode between the cloud and your local applications. It's over? Last question. Last question, a very good question, so. Can we authenticate users using certificates, personal certificates? Yes, you can use. The question is, can we authenticate users with certificates? Yes, Lemana LAPNG can use certificates, Kiberos. We are compatible with second factor authentication, WebOTAN, et cetera. So, we have a lot of methods. It's like K-Club, but it's French. Time's up. Okay, thank you. Thank you.
Add user self-management, brokerage and federation to your infrastructure with Keycloak
Adding a user-sales management brokerage federation to the infrastructure with Keycloak. Keycloak has been mentioned now and then in the previous talk, it was great to hear. I'm Alexander Schwartz, I'm just Alex, I'm working at Red Hat for the Keycloak project full-time and I'm also a maintainer since last year. I've been using Keycloak for several years. When I was a back at IT consultant, we were building applications, we were using it as an identity next-to-management solution and back in the time, a lot of customers did not have Keycloak, so we brought an application in there and the custom-built one, we put Keycloak next to it to do the IAM stuff and over time, then we built our applications for them to customers. They already had Keycloak, so it was great. Two years ago, I joined Red Hat full-time working on Keycloak. What do I do at Keycloak? I'm doing a lot of performance testing, database stuff, also a bit of LDAP. Keycloak has so much to offer and when I was reading the corporate presentations, this was then stating about Federation LDAP and I thought, yeah, I could present you this slide today and this is what I will do, presenting what's already existing in Keycloak and also some of the things that will arrive in the next version of Keycloak, like the current version is 23 and the next version is Keycloak 24 and you can already download the things that are shown today in the 90 build of Keycloak. Right, so yeah, and the agenda that I brought for today is more like a journey that I saw customers going through when they entering the identity and access management space. It's like day one is seeing a sign on a school, right? I need only one password to access all my services, so that's where it all starts. Day two is, yeah, well, I need to get a bit more flexible because I have maybe one directory with users, maybe multiple directories of users that I want to integrate, lots of applications and then day three, yeah, I want to eliminate a daily churn, like reset of passwords, user self-management and that's especially where the things come in that we have in Keycloak 24 around user self-registration and declarative user profiles, what we see there. So why is Singleton on cool? I said, well, users need to remember only one password, that's, yeah, and then they authenticate only one today. In the morning, usually when they get to work and then it's, depending on how you configure it, maybe more valid for 24 hours, for 10 hours, for eight hours, that's the policy of the company and then they can access all these applications over the day with the credentials they entered. And well, usually a password might not be enough, so you have a second factor, you have one-time tokens, you may have maybe a mobile app that generates these small codes, you have file keys, web auth and all that stuff, and maybe some applications need it, other applications don't need it when you access them, maybe you want to re-authenticate during the day when you access a special application, so all those things come with KeyClick. And well, not the last thing, but usually in the middle when you deploy KeyClick to your organization, you want to theme the front-end, right? It should look like at least the colors, maybe the logo of your organization, it's to make your users feel at home. It might seem like a small thing, but it really helps the exceptions of that in an organization. So I say, even if you're deploying a single application and need an identity nexus management for it, it makes sense to deploy KeyClick for that, because you then don't need to reinvent it yourself, right? And doing user management right with all the bells and whistles is not a nice thing. So how does KeyClick work in the end? Like you have a user with maybe a mobile device, maybe with a regular device, and they log in with KeyClick, so KeyClick presents a login screen, does the handling of all the second factors that you come about with, and then the user sends from their browser a token to the services in the cloud, whatever they are, and the application can then either check the token directly by inspecting the token's cryptographic signature and the timestamp, or it will send this token, for example, to KeyClick to figure out who's that user, I want to retrieve some additional information. This is possible. You might also use that token, I don't know, when you're integrating other authorization services that then return like OPA or something like this, where they come up with is this user allowed to access this service or not. So that's the basic setup, and KeyClick, you can deploy it as a single container connected to a bunch of databases that you can choose from, be it Postgres, MySQL, Maria, Oracle, MSSQL server, usually, well, as an admin, you don't, or even as a developer, you don't have a choice, like usually an organization has chosen a database, they know well how to do backups, how to restore, how to operate them, so we give you a choice which database to connect to, and then you have KeyClick either deployed as a single whatever binary container, or you deploy it using an operator with a high availability setup to the Kubernetes of your choice, to the bare metal of your choice, that's what you do and do. And well, this is what users then usually see when you don't customize a login stream, it's a username and password, right? And once I log in, let's see if the demo goes with me, so I'm logging in here, maybe it's expired, oh it hasn't expired yet, so I get an admin screen here, so where I can set up clients, basically clients or applications, and have client scopes, users groups, so all of this and rows somewhere as well, right? I can configure all these in a web UI and it will, in a very basic installation, will just start to be in the database of KeyClick, and it will then take care of all that. So, yeah, that's a simple start, you have your application, it's secured, it's all working well, but then, yeah, you usually don't start in the green field, that's very rare, so you need to become a bit more flexible in what you're doing and to integrate with all the existing stuff that's already in your organization. So for example, there might be one LDAP, there might be many LDAPs in your organization, I think it tends like whenever there's a merger there might be other LDAPs joining, other user factories joining that you want to integrate with and there's Kerberos, so people might be already authenticated on their machines, especially in corporate environments, there might be some service around in your organization or external to it, but it only talks summer and your applications want to talk open or disconnect, so it's great to put KeyClick there in between, there might be also other OpenID Connect things, but then why would you put OpenID with KeyClick in between? Yeah, well KeyClick can train it to summer or KeyClick can also give the right tokens to the right application because maybe your this one application is on a special diet to require that or the other attributes in the right tokens and KeyClick can do that in the way this application is then finally working. You can also create your own extensions to KeyClick, so for that you need to get familiarize yourself with a bit of Java and then you can integrate custom stores, you might have, well it's called legacy usually for a good reason because maybe the old systems, the customers are known to those systems, they make money, you can't shut them off and you want to integrate KeyClick with existing user stores, you can do that, you can then connect it to a database directly, call some rest services, wherever you get these information from and make it work and also we might hear later today about SCIM integration, all that is then possible by adding extensions to KeyClick on this area. So we use everything that is already there and integrate and connect with that, so that's very, I say, essential on your day two things when you say yes KeyClick is cool, single sign on works, but then you need to integrate with a lot of stuff and yeah, KeyClick hopefully makes that a lot simpler for you. All right, so that's, yeah, some diagrams around that, so identity brokering, Kerberos, Samo, OpenID Connect, you can connect to those and yeah, we can show that in the demo shortly, well the good thing about Kerberos is you don't have, your user might not see KeyClick at all, look, well the user tries to actually see the application, the application wants to get an OpenID Connect token or some Samo token, it forwards the browser to KeyClick, KeyClick will negotiate with the browser that the user already logged in using Kerberos and then will not even show the login screen but forward directly to the application back with the right token so the user can continue, so the user will never see the login screen, so there's Kerberos, but on the other hand if on that system the user is currently on, Kerberos is not configured correctly for whatever reason, it will fall back to a login screen and you can use the regular credentials and then what we see in a second maybe use that credentials and verify these credentials against an LDOT, so it's yeah, it's like Kerberos but without the Kerberos it works the same way with the same credentials in the end, we can get all these social logins integrated, so with those then the user usually has login screen where they pick the right social login provider, they want to use to authenticate, it might not be the right thing for corporate environments, but it might be the right thing when you are integrating, well your public facing website with users coming around that they want to integrate, yeah and Federation as I said OpenLDOT is their active directory, custom user stores, you can have none of those when you want to store things only in KeyClick database, you couldn't have one of those but you can actually have multiple of those as well, so I wish or I hope for you that you have a simple environment but on the end, on the other side you can't really choose when you are, I don't know, there's another merger coming around the corner and or yeah then you might have another directory to integrate or maybe a customer has some users they want to bring there and you want to integrate as well, so yeah looking at the demo, so you can identity providers that would be OpenLD Connect, all the social logins that you want to integrate with here, they're here either custom or predefined with some defaults or some sensible defaults, user Federation I already configured LDAP here, so LDAP telling you okay this is, yeah I'm running some patchy directory server here locally on my machine because it was simple to set up, the usual LDAP I'll say, I can choose if it's a read-only writable or synchronized, all these things are here and then yeah not all OpenLDOTs are, or sorry, no not all LDOTs are the same, they need some special configuration seen here, yeah and you can configure it that it matches the organization, there's usually also some methods so there are lots of attributes in LDAP that you want to leverage either to put them into the tokens, that you want to pass on to the applications or that you want to leverage and the user into endpoint where the application can then carry those if you don't want to put them in the token, so all these things can be configured here mapped on a per realm, per LDAP connection in the needed to work, eventually you can also configure it on which application should get what kind of attribute and what kind of token, yeah but then it's the real world catching up on this, the simply can make you set up the better you'll be off but on the other hand you need to make it working with the things you have and I, well we're hoping that we got Keeklog in a way that it's not standing in your way, so let's go on to day three, a limited turn, so all these repetitive tasks that you have to do every day when it comes to users, they're well annoying for admins and also annoying for users, ideally they want to do these things themselves, they don't want to be bound to some opening hours of IT or so some things, I've shown a minute as users required actions to basically you can as an admin choose, well as an admin you might have sent out an email please enable second factor and you sent another email saying please finally enable second factor for login and then you say well now's the time I go through maybe all of my users or some of my users let's, on the next login they need to must enable the second factor no matter what, so you can do that as an admin and you're done because no one will enter your system without a second factor enabled. Also password recovery, you can add a link to the login screen we will do in a second that you can do password recovery that you send out an email, click, the user can click on a link and you will, it works with an external with an external database of key cloak but it will also work when the user's on an LDAP, it would also work when the user's on an active directory, also well this kind of bits work when you're using the password recovery mechanisms of key cloak. Also well in a corporate environment you might not want to self register for people right, so they probably need to sign a paper contract first but then on the internet on the public facing side you want the people to self register, again this is something that comes with key cloak. Also once you're registered you want to maintain the data yourself as a user maybe update your mailing address, your blog, your social handles whatever all these things should be managed by the user themselves and key cloak allows you to do that and this is something that greatly improved over the last releases in key cloak 23, you can enable it as a preview feature and we are pretty sure that we will have that in the final release of key cloak 24 enabled by default so that you can really use that in a very good and configurable way. So yeah and it's great to resolve the need for either phone calls or tickets or chats in nowadays right. So let's go back to these required actions so there are lots of them so let's maybe have a look here. So in authentication for each realm I can really decide what I want people to be required when they log in or to be checked when they log in for example one-time passwords, maybe you want to have them confirm the terms and conditions, I need updating the password, update profile, verify email address that we sent out an email with a link people can click on it. So that's for public facing registration very useful. WebAlhtim is in there, people should be able to choose their locale, we want to verify them the profile and I can enable those and maybe also have maybe some policies when and why and then on the realm settings I can, well this is basically the tab called login which configures the login screen and I say okay from now on user registration should be enabled right. For good password flow yeah I want to have a link there where I want to allow that people can reset their passwords and once I do this I can just when I sign out here now these fields have appeared so for god password link is here and I'm asked for my username and email address and I have a register button where I can register with some fields that are here required and if I then log in again and we go for the user profile, yeah there we are. So this is the configuration where I can say these are the fields that exist that should exist for both the admin to be edited in the admin UI they should these are the fields that should exist on the user self registration form and those are also the fields that are available for user self management. So basically you can think of this as a form configurator and for each of these fields I can go in and say okay this is the name to be there's an attribute name like a technical name I can reference it by later a display name well this is an automatically localized name here but you can put a simple name in here as well I have attribute groups here so we can group things on the form for each field I can decide who can edit this either a user or an admin who can view this either the user or the admin and then can put lots of validations on top of each field about the length for the username it's the minimum length of three the maximum length of 255 characters for username I can there some prohibited characters so you should use regular keyboard characters for that we also don't want to have any homographs in here basically letters looking like real letters from a Latin alphabet for example but they're actually from a very different alphabet so you could have like a user registering with a username that looks like an already existing username it might need to lead to confusion so that's a really sensible good security by default so and I can add more things here I can also add annotation and saying how should this element be formatted should it be an input type should it be a helper below and below the input size the columns I can also reorder those so it's it's basically a form builder and the form builder will be consistent in all three places user self registration user admin form and or admin form for users and user self management right so when I go here for example the block so I can I can change it here with a different display name and once I go to as an let's do that as an admin to manage my own account then I would see here okay now it's renamed and another field here and I can then choose when is this shown is it shown is it mandatory on on first login is it maybe mandatory once a month like it can have maybe a scheduled process that inserts that actions on each login once a month and then I can see here all these things are then how I configure my login flow and have this information populated by my users so yeah we saw that we saw that as well right we have this recovery we have seen the configuration how we can configure those with validations and attributes and all necessary information and again the three areas where you have the admin we on the left in the middle the registration screen and on the right the personal information the users can self manage and all this information is either stored in key cloaks local database or if you then choose to store it in an external service like LDAP it will be stored in the external service of LDAP right so that's basically almost the end so while we saw day one singer sign on is cool and it makes a lot of sense to not reinvent yeah identity and access management even for a single application database you want to get more flexible and integrate with a lot of existing security infrastructure in your organization once you are a happy user of key cloak and then day three it allows you a lot of automation around users when you really want to scale especially when you if you want to scale with lots of users signing up on the internet and if they want to manage their stuff on their own and you don't want to get calls from them or emails and stuff so I brought some links so this is the key cloak homepage please pay the visit we have some docs on there how to install it the key cloak nightly release I linked it directly so and if you go there you can download the zip file if and extract it but there's also a docker registry on kio that you can yeah have a container built ready for that with a nightly release if you're on github please give us a store there's the key cloak book second edition that been published last year so if you've been using key clock maybe two or three years ago you might know that this was based on eap and wi-fi it was now moved to quarkus so some of the things changed so it might be good to look at this second edition something that is my of my very personal goals I want to start a key clock hour of code so to get more people into contributing so I'm planning for maybe once a month maybe every two weeks I think an online session um to get people familiarized with coding how we do how do we code in key cloak how do you maybe contribute documentation how do you um yeah how do things work around key cloak and at some point we also want to bring in community to also review issues helping with triaging those helping if another community member creates a pull request maybe it also the community joins in and helps and helps to get that to a material level that we can merge it in so that would get some some weight from the shoulders of the maintainers that would be great so that's my my thing for this year I want to try out yeah so that's me so I'm around for the rest of the day so meet me here meet me in a hallway I also have some stickers of key cloak some postcards if you want to sell key to your managers or friends or colleagues so send them a foster postcard with key to it thank you very much all right we might have like two questions or something yeah and what is the best way to configure key cloak declaratively so um usually want to use the UI to figure out what's there and how it works and then you can one way is to maybe export the full realm as a JSON and then re-import it so that's like the full export full import there's also the chance that you there's a terraform hopefully open tofu compatible key cloak provision mechanism as well and there's a rest interface so you might use a rest interface to yeah use this API to configure it and there's a command line interface as well but the command line interface is basically a wrapper around the rest interface so that you can yeah configure different settings of a given client or maybe override the client with a new config that kind of way but it then depends on how you want to do things if you have the chance to um I don't know delete it and re-import it might be very helpful for test environments or if you're more bound to um incremental like database scheme immigration style of things that you really want to things like one step at a time and always in that order and maybe open tofu would take some shortcuts that might not work but you want to have maybe some migration so it's depending on what you want to do but it's the good news is it's all automatable. Just one question how key cloak can be beneficial in Linux ecosystem? So how can key cloak be beneficial in a Linux ecosystem so like if you then logging into you say um with a SSH somewhere or I haven't seen it this way but it kind of connects very well if you have like for example Kerberos around so if you have Kerberos I have them on my machine as well when I'm in a corporate environment that key cloak can leverage that okay maybe then okay to repeat it for the video so there was a talk on 20 23 on FOSTEM on password list authentication on Linux here at FOSTEM right okay one note there's a redhead SSO Ansible Collection yeah okay okay yeah so there's also redhead SSO Ansible Collection that allows you to configure key cloak right yeah the old name well key cloak is the upstream project it's a CNCF project there's also redhead SSO like the thing that you get with a subscription from redhead where you find tools that work with that as well and from end of last year there's no longer redhead SSO but redhead build of key cloak so it's going to be easier to find in the future so whenever you need so for something for key cloak it will be both the upstream project and that what of redhead offers for a subscription okay I think this time is up thank you very much
Beyond passwords: secure authentication with passkeys
Alright, so I had a talk yesterday, you missed it. So today I'm going to talk not about Passball, the open source password manager, which we are building with my friend here, Kevin Clayton and Shmouty. I'm going to talk about Passkeys because I have the chance to participate and be a Fido Alliance member and sit and participate to the Plenary Conference and CPSIG. You will see Fido, they love acronyms, so it means Credential Provider Special Interest Group, so because we are a Credential Provider. So what is authentication, just so that everybody knows a little bit what we are going to talk about. Authentication is something you know, or something you have, or something that you are, like biometric, or something you do. You can even have like behavior based authentication. So authentication, these days generally a combination of one or two of these factors. You know about passwords and password based authentication have a lot of issues, so you have issues of the user selecting a weak password, or basically people being able to brute force, or like phishing is a big one, and you have all sorts of other issues. Generally you can implement content measure to make sure that basically your authentication is good enough, but phishing is the one that is really hard to solve because it depends on the user. So you can solve for example password strength selection by introducing a Credential Manager, and you can prevent a little bit of the phishing with a Credential Manager, but you still have some room there. So who has set up Pasky's as a user in the room? Well quite a bit of people. Who has as a developer implemented either Authenticator or like implemented Pasky's authentication on the website? Yeah, three people. So we can see here that it's still a new topic. So what is a Pasky? You will see that Pasky's mean different things for different people, so I'm going to try to give you like a 10,000 feet view of Pasky's, and not like go too deep into the protocols and the options for the protocol or particular implementations, but just give you like a high level view of the landscape, something that I would have like when I started working on this because it's like really tentacular and there's a lot of options and a lot of different views. So Pasky's the official definition is Pasky's a password replacement. There are public key, private key pairs that are used for authentication using cryptographic signatures. So basically you have like a site that gives you something to sign and you sign it and you prove that you're you using an Authenticator. Pasky's are user credentials that are discoverable, so it's possible for the browser to know if you have a Pasky for a given website for example. And because in the browser the JavaScript is served by a website, it means that the website can also discover whether you have credentials. So these Pasky's are stored within application or security keys and they may be synced across devices. So this is the new stuff we've seen in the previous talk that we were talking about device-bound Pasky's, the Pasky's that are sitting on devices, but they are now a new class of Pasky's that can be synced across devices. So you can see this is the lay of the land like depending on like who you ask about Pasky's, they will tell you maybe they are like thinking about device-bound Pasky's, so Pasky's that are on physical devices or if you ask Google and Apple, they will talk about sync Pasky's and sync Pasky's are basically keys that can be synced across multiple devices. So you can for example have them on your laptop and on your phone or you can basically transfer them or like do an attestation using your phone while you're trying to authenticate on your laptop. And these Pasky's are supposed to be exportable and transferable, but in practice they are only transferable within a given ecosystem. So for example Apple will not let you export Pasky's to Windows for example. So it's their advertiser's being like you know interoperable, but because they are not coming from the open source world like we do, interoperable for them means different things. So they are also another class of Pasky's which is called up level Pasky's which generally have lived with the device-bound Pasky's and meaning that they can be used for other things or you have like additional properties added to them so typically you'll see them in banks. So for example like a bank application will use Pasky's to sign transactions or they will use additional signals to unlock a Pasky's something that is not there with a classic authenticator. For example they will check like you know your location or like your working hours like you can do all sorts of different signals. And you can build a custom authenticator so like you can build like the UI that you want. You don't necessarily have to follow like the OS or like the physical device design. So there are a lot of different requirements. We've seen like basically Pasky's means a lot of things. So like on one side you have like people that are working on the enterprise level. So people that want hardware keys that are like very strong and you cannot export them. And on the other side you have Google and Amazon for example they don't really care that it's very, they want the friction less experience. So they are ready to trade a little bit of security for usability. So for example when you're doing a checkout using Amazon they want you to go as fast as possible through that checkout and pay. Even if there is a security issue they will be okay to give you back your money. But if you do that with a bank they are not like having the same mindset. So on one hand you would have like Pasky's that require certifications. The banks will check is it the authenticator that I gave you to authenticate. Is that really your personal device? And on the other side you basically have like a website like Google that just want to show like okay you authenticated with the UB key but they don't really care like which one it is. It's just to present you like okay you have these Pasky's and this is the kind of authenticator that you are using. And on the enterprise side they will issue for example for like super high security setup they will issue you like a security token that is just for you. So you can see that these are some privacy implications. So for example if everyone was using like a device that is unique for each of us and we log in on each website with this device you would be able to do cross domain tracking. So you would be able to see like okay this guy logged there and then he logged there. And this is a privacy issue obviously. So on this side you want privacy. On this side you want basically no privacy because you are in an enterprise setup. So all of these are like very complex requirements and they are all fitting in the same standard. So basically like it's a little bit complicated to know like what's going on. But the common denominator is phishing resistance the fact that the Pasky is domain bound. So like you have one Pasky for Gmail, one Pasky for AWS. But you don't reuse the same Pasky twice. And it's always supporting HTTPS. They made this choice which is very wise. It's like no support for HTTP. So the Fido2 project is a project that works with the Fido Alliance. Fido Alliance contain like Google, Amazon, Visa, but also like TALES, you know people doing like security devices. And on the other hand you have like the WebOtent protocol which is managed by the W3C. So you'll see like basically people working on the Fido Alliance are also part of W3C. It's the same people, you know for example the person at Google is the same on both projects. And together this is called the Fido2 project. So on one side you have W3C that manage the WebOtent protocol. And on the other side you have Fido Alliance which manage the Ctap which is credential. Sorry, client to authenticator protocol. So basically the relying party is the website you're trying to authenticate to. It uses WebOtent over HTTPS. Then you have the client which is basically your browser and the JavaScript application that is running in it. And you have the authenticator. So authenticator can be the OS platform. It can be like a device, can be like a UB key, can be anything that is basically Fido approved. And these days it can even be credential manager. So it can even be like Bitwarden or Dashlane or OnePassword. And the interface for client to authenticator is a bit more messy than WebOtent. Basically you have like it works with Bluetooth, it works with what I call, everybody calls monkey patching. So basically if you want to integrate in the browser in JavaScript and become an authenticator, for example as a password manager, you will just hijack the JavaScript API and replace them with what you want. So that technique is called monkey patching. It's the only way for a browser extension, for example, to act as an authenticator. But you have like also proprietary protocols, for example, like when the Google Chrome browser wants to use the Google authenticator, you know, you don't know what's happening underneath. They are using their own stuff. So I hope that's clear and gives you like a high level view of what we're talking about. So there are two ceremonies. There is the attestation ceremony, which is the registration. And there is the assertion ceremony, which is the login. So there are no other operations. So for example, you cannot list what are the pass keys available for a given relying party. You cannot delete pass keys. These are not part of the protocol and need to be implemented separately. Like it's not normative. So we will see that this goes some issues. So the attestation ceremony, you have the client, which basically post a username. So this part post a WebOtten attestation option is not normative. You can do whatever you like. As long as you send a username, basically the URL doesn't matter. So it's for the relying party to decide what is the language that it wants to use. What is the URL? So recently they introduced WebOtten.wellknown file that you can place on your web server to basically say like, OK, this is my attestation URL. Then the relying party reply with the public key credential creation option, which includes the RP, which basically the ID of the relying party, the challenge that the user need to sign, and some other options. Like for example, we've talked in the past, in the previous talk, people were saying, like, do you check for user presence? Do people have to enter a pin? This is basically the moment where the relying party can say like, OK, I want to use this algorithm and I want to check the user that way. I will require user presence. I will require you to do user verification. And then the client does basically what it wants. So from that, the client called the navigator credential API. There is not a WebOtten API. Basically, we use a JavaScript API, which is the credential API, which can be used for other things, but it's used mostly for the WebOtten protocol these days. And then we basically enter a setup protocol or something else. So either like a proprietary protocol, but here I put setup for, you know, like clarity because it's the one that is the most defined. So same, you will send some data about the RP and the user, and the authenticator will assert the parameters, see if the crypto operation is supported. So it's asking to use a particular type of key. Can I create such keys? Then collect the user gesture. So like either enter or enter the pin. And then generate the credential and generate the signature. It will return the attestation statement and the authenticator data. And the client will send this information over to the relying party. The relying party will assert if the key is valid and will check the signature. Is it valid for that particular key? And it will check if the RP ID is also matching. So that basically you don't have a client that use another request from another website. So we keep the property of having a domain bound process. So the assertion ceremony, I'm not going to detail it again, but it's pretty much the same thing except you're not like giving the new public key. You're just signing with a key that is already there on the authenticator. And then what about account recovery? So obviously if you lose your device or you lose your passkey, then what do you do? So there are two types of account recovery. There is account recovery for the RP. So basically the solution to passkey is more passkeys. So it's good if you have like device bound passkeys because then you need to buy more devices. It makes a lot of sense when you're selling devices. And you can also use passwords. So generally you have a website like Amazon that will let you have a password but they will propose you to have passkeys on top. So basically you will default to passkeys but you can still use your password for account recovery and magic link. So basically passkeys are as good as the account recovery mechanism. So we're kind of back to square one. Unless you get rid of these methods for account recovery, you're not like really changing your security posture in my opinion. And on the authenticator side, it's a little bit more complicated. So I think Apple recommend you to have several Apple devices. Makes sense. And then you can also set recovery contacts. And you have custom procedures. So I give you like for example what happens on iCloud if you lose all of your devices. And it's actually possible to do a recovery for my cloud using all your devices and it's quite smart. I'm going to say the problem with we have in open source world is we don't have such a service that is ubiquitous where people have an account. So for example if we are Ubuntu and Firefox, we don't have such infrastructure to exchange such a scroll mechanism. And that's going to be a challenge I think moving forward. So how does it look like from an authenticator point of view? It's a work in progress and there's a lot of change. So like maybe like by the time I put this slide together, it's already updated. So this is an example on macOS and Chrome. And so you will see on Chrome by default Chrome will prompt you to use, when you click continue it will use the Google Authenticator. So you have the impression that you're using the OS level authenticator but you're actually using Google Authenticator. That is leveraging the API of the OS to provide an experience. But it's not the Google Authenticator, it's not the Apple Authenticator. So you can see already it's already kind of sneaky because like if you're using Chrome, they will prompt you to use the Google Authenticator. But you have some other options. For example you can use the phone or a security key. So basically you see there is more clicks if you want to use something that is not Google. And then you will scan this or press your security key and you will have the same result. So if you can even do like two device ceremony where basically you scan this QR code and then you will unlock your phone and the signature will happen and will be exchanging through Bluetooth BLI. So there is no pairing with the laptop and the phone and you will be authenticated using that mechanism. So it's possible for example to use like an authenticator on Android phone to login on Windows device using this mechanism. So if you use Firefox, you will start directly using the Apple Authenticator. And it's the same if you're using Chrome, you'll see like basically if you use that option that was there on the previous screen, then you switch authenticator. And for me like I don't expect like people that are not like knowledgeable to understand what's going on. So and it's the same like depending on like the options that are provided by DRP you may have like different mileage and different user experience. So it will be quite confusing I think for the average user. So it's the same if you want to manage your pass keys, they are like buried. It's really hard to see like how many pass keys do you have and where they are registered. Same on iOS, if you want to manage your pass keys you need to go to passwords. So you need to click on the password to see that it's a pass key. So pass keys are like okay we've solved the password problem, we can all go home, right? Like that's it, mission accomplished as George would put it. But no we still have a lot of issues to solve. Like we have like the what happened when you lose devices, especially when you don't have like a sync fabric that is common to different authenticator. And there is no real work being done on pass key management and review as we have seen. So for example in the future we've seen with the previous talk with quantum computer coming soon. We will need to roll out new algorithm, maybe we will need to change algorithm faster than we had in the past. So we will need to revoke keys. So in order to revoke keys we will need to design experience where the user understand okay this key is using an old algorithm that is not supported, you need to create a new pass keys. So we will need to have a user experience to manage pass keys that is understandable for the average Joe which is we are very, very far from there. And it's the same like for developers, I think like for developers to understand all the different options and what they mean when you're implementing as a RP. It's quite tentacular and you for example you can't follow the implementation of Google because Google does not care about user enumeration because you can already send an email to a Gmail address and it will bounce. So you already have user enumeration in place so they basically don't implement best practices to prevent user enumeration. But for you, for your use case maybe it's important. So like you can't even follow like what the big players are doing, you will need to do your research and find out. And I think this can lead to some problems down the line and we will need to do a lot more of education on like what are the security problems around pass keys. There are some other issues as you've seen like the user experience is quite fragmented and it will not be the same on different OS and different authenticators. And there is an entry barrier for authenticators so like one of the few open source projects on the Fido Alliance project it costs around 50K a year to be in the room when these things are being normed and so it's basically like a pay to play initiative. And in my opinion for something that is supposed to replace password but so ubiquitous that's an issue. So we even have like I think Firefox have a seat at the Fido Alliance but they didn't have the staff last year to be there. So I think it's an issue. And there's a lot of proprietary protocol and monkey patching happening and like we need to do much more normalization and I invite you to get involved to be interested about this because if you don't act on it basically they will make the decisions for you. That's it. Do you have five minutes? Yes? All right. Yes? The complications that you just mentioned to implement that is also true for somebody let's say a software service that wants to offer pass keys to their users. Yes. Do they also have to deal with all this complexity? Is there maybe a simpler way? Yes. So do the RP have to do their homework to understand the issues around pass keys? Yes. And they are not like super easy to get. You know like for example if you let's say you're building like a service like that authenticate people using pass keys but it's like a globalized service for your enterprise and you're doing just one pass key authentication on that domain and then you're switching to another protocol. Maybe you're using an iFrame and in the case of pass keys you may have issues with UI redressal so you need to basically take care of these issues and these are like you need to read the specs and I think maybe there will be more education and more easy resources and I think we need like all tools like for developers and like not for them to trip their feet on. And same like what kind of algorithm you should support like we've seen there is a lot of legacy website like maybe like they start creating keys with a certain algorithm but two, three, four, five years down the line when quantum computing becomes like something that you have to do because the state is telling you you have to do it. Then you know what happened with these keys you know like. He said earlier that kind of the relying party can say the use of the present for example. So how does the relying party know that the client actually should that. So. How does the client ensure that the authenticator is doing what the RP is requesting. I think it depends on the option but most of the options are basically like for example the client can ignore that the user needs to be verified but with the data that will be sent back as part of the assertion you would have. Okay, what what did the authenticator do? Did he verify the user or not? So it will be the RP to verify. Say to the client I want the user to be verified and when he has the response to say did the user get verified? Yes or no. So you say I want this but you need to check if it actually happened when you get the final assertion. Makes sense? Yeah but you have a password manager so someone says please make sure that the user is present and you sign the challenge and say oh I didn't say the user is present but you actually like because you can do that. Yes. There's no way for the website to know that the password manager is present. Yeah that's where the Fido certification comes into play. But it's like nobody wants to be like you know caught off and like say like hey look there you know so it's also like a gentleman agreement that you're gonna respect and do. But I could I could suspect like that some you know people that are like do it yourself kind of thing they can make their own authenticator that does whatever they like. But for example a bank may refuse to use an authenticator that is not Fido certified. So you know depends also on the RP. Because you assert which service are you using for the station. When the. Well often. So on your response do you know who certified that. Yes so you have information so it's like in the response do you know do you have information about the author of the cater yes so you have information which is I level which is what they call the a a good which is basically a global you you ID. That says like you're using like a UB key five but you also have like older version of this way where you have like actual certificate and you have like a signature that you know like. So this is stored in the MDS of the Fido alliance so if you are like for example a bank and you want to make sure that. You know it's not somebody pretending to be a UB key. You can actually see the certificate the route certificate and check against that assertion. One minute that doesn't make sense so you have you have two level you are one which is like the user agent kind of thing that most are using but you have another one which is like more complex that involves like cryptographic and signatures. We've seen some examples with the ecosystem and what about the system. Is there anything like that piece for private. On the unique system basically like to my knowledge not much is happening I think there is a talk with a GNOME the GNOME team was going to present like what's what's happening on the Linux site but it's way behind in terms of like Microsoft. Hello Apple. Kitchen and Google. I've not seen an open source one like but you have like credentials manager like for example. Bitwarden or dash lane that are basically can bridge the gap in an ecosystem where there's no OS level support. Thank you very much. Thank you.
Automated Integration of FreeIPA with AD and External IdP
So let's start with the next talk. And this welcome, Thomas Werner, talking about Ansible Free APA. Hello. So, let's go in here. So, this talk will be about Ansible Free APA to use for AD integration, so Microsoft Active Directory, and also to set up external identity providers. The plan was to have an online demo here, but yeah, there have been some issues. So, there is slides and there will be online demo later on on the web page. So, for the automated Free APA deployment, I have been using the work from my colleague, Raphael, and we had to change something in the inventory to make it work, especially in my environment. So, if you want to do it on your own, it's used as a base for the whole presentation. These are the steps to do, and one important thing is please fix time and time zone on all the machines, otherwise you will have fun. With covers, tickets, not valid tickets, tickets that are in the future or in the past and so on, so not fun. The first step was get your windows you want to use. There is a nice documentation on this web page from Raphael. The different steps you need to do where you can get the images and what kind of images are working and so on. The first step that we need to do from Windows AD step setup is to change the first script that we have this, Windows AD setup playbook to disable IPv6 because if we do not, we will have lots of fun with DNS later on. So, this was one of the most important things. And we're coming to the setup of the IPA server. For IPA server, as we wanted to have a replica deployment later on, it was needed also to enable DNS and auto reverse. Sadly, there is an issue with auto reverse creation later on, but it is fixed manually. And there is another issue with DNS with Windows, so you need to disable DNS stack validation. In the lab, that's a lab. Yep. You will find out if it's working for you or not. So, then you can simply do the steps that are on the web page. So, the first IPA setup, then there is a nice test to make sure that DNS is really working on both sides. This is the NSLookup test. So, it tries to find the Kerberos TXT records on Windows side and Linux side on both ends. So, it verifies that everything is working in here. And the last one is setting up the trust. I'm not adding the information in here because it's completely unchanged from the script in here. And after we've done that, we can log in with the AD administrator into our Linux server. ServerLinIPa test. You can see I can log in, I have a ticket, I can get my AD, and then I'm trying to do a change in IPA. And it says, hmm, invalid credentials. Okay, so, but we have a solution for that. Also an answer for your IPA that was added lately. So, we can add, we can grant the rights to the AD administrator to act as an IPA administrator. So, the first one is adding an ID overwrite that is needed to be able to use the AD administrator. And the second one is adding this an overwrite for this administrator to the admins group to make sure that this user is able to do something so that it has admin rights on IPA. And after we've done this, we can directly add a user, remove a user. You can do everything. So, host, users, whatever you want be. AD administrator is an IPA administrator. And the next part was, okay, let's try to do client deployment using this AD administrator. So, the inventory file needed to be a little bit changed in here. So, there's a client set up. There is also a setting. I don't know if you know about that. This is configured DNS resolver. This is a feature of the IPA client role to set up the client in a way that is that the DNS server or the DNS server you're configuring here. This is the IP address of the DNS server that was used. So, the first step of IPA client role is to set up network manager or systemdresolved or resolveconf so that you are able to use the DNS server directly. So, it's not needed to do this manually. And also, if you do an unconfigure, it will remove it again if you set the variable. So, it's doing this automatically. And the next two lines are to force it to use the administrator of AD. And there is one important thing here. You need to write it correctly. So, that means a capitalized A in administrator and also the domain capitalized. Otherwise, it will not work. If you log in, it's working without because there is a rule for that. But here, there is nothing. So, you need to write it correctly. This is the first issue. So, why isn't it able to find the AD administrator? But, okay. With this, we are able to deploy the client. And it's working afterwards. So, you see here, the next one is the playbook to deploy the client. So, it's the normal thing you see on Ansible Free APA. There is a playbook for this. So, you can simply consume it. Yeah, so, client was easy. What is next? Yeah, replica. But for replica, we ran into an issue with command line and also Ansible Free APA. So, there is currently an issue in the replica connection check. It tries to use admin. And for sure, the password is not valid. So, it fails. We will find out what exactly the issue is to solve it. It's affecting command line. So, Free APA package itself and also Ansible Free APA. So, it doesn't matter which you're using to deploy. They will both fail. But there is a temporary workaround. So, disable replica contact. But make sure that it's working. So, DNS needs to be working. And also, the reverse lookup needs to be working. Otherwise, it will fail also. And the next step is simply to deploy the replica. And then we are there. And we have a working replica. We can use it also to deploy clients. Also using administrator AD8-PI test and so on. So, we have some issues and we will work on them in the next time. But they are relatively small in my opinion. It could have been worse. And now we are coming to the second part. A colleague of mine wanted to present this here, but he was not able to come. So, the second thing that we added in Ansible Free APA was the possibility to configure and use external IDP. There will be another talk later on about external IDP. It will be true. But we will be a big deal for that. So, any open questions here might be solved by this one. So, Free IPA has the modules for external IDP. There is a new module that was added to Ansible Free APA and also the use of external IDP was added to the user and so on. So, we can configure Free IPA as an OAuth application on GitHub. This is the example that I will show here. So, let's go directly to this one. So, we are creating a GitHub OAuth application in the first step because this is needed to be able to configure external IDP with IPA. So, the steps is simply go to your GitHub, go to developer settings, OAuth apps and register a new application and read the docs. If you do so, it will ask for several things. So, it will ask for application name, the homepage URL which is also the authorization callback URL. So, you should have the same in there. This is the iperserver.com. And please also add a description to be able to find it later on. And enable device flow, this is needed for IPA to be able to handle this at all in the end. So, this is very important to enable. And then click register application. And if you have done so, you will get a client ID and also a client secret. It's very important to keep those secret. But one thing, you need both of them in the next step for the setup of external IDP. So, there is no way to get to the second one again. So, either write it down, make a screenshot, whatever, but in a safe way. And if you have those settings, you can go to Ansible Free APA. And here you see, we are using simply them in text form. But you can also use Ansible Vault for that. So, that you do not have the passwords here, but for simplicity. It's here, the same with IPA admin password. It's here simply to make it simple for us to see what's going on there. Otherwise, it will be a little bit cryptic. And so, this is simply creating, setting up the external provider. And in the next step, we need to retrieve the GitHub user ID. Oh, one thing that we should add here. So, IDP user ID is set to ID here. There is another way, but this is way better because with GitHub, it's possible to reuse names. So, it's really good to use ID here for authentication because then you will not run into this possible name clash later on. And this is common problem for many IDPs, that if you delete the user and after some time, another user registers the same user visible name, that user becomes basically squirting the previous one. And many of those providers, they run like 90 days protection of the accounts. Even if you delete accounts, you can not register one. But eventually, they expire. So, somebody can squirt your account in this way. If you've configured your systems to trust whatever was the user name in the system, good luck. You will be hacked a year later when you start doing it. So, taking these other fields into account is very important. And this is part of administrators to kind of design this thing. Unfortunately, all these fields, they are not visible in the UI. So, normal view can not see this information. So, it's admin that needs to discover. Yeah. So, use it this way. So, we're retrieving the GitHub user ID. It's stored here. And in the next step, the IDP user ID is using this retrieved user ID. The bad thing here is, sadly, IDP user ID here and IDP user ID here is not the same. One is a user ID, so a number, and the other one is really a user name. So, be careful and read correctly. So, the thing is, Ansible Free API is trying to use the names from IP itself. So, you will see the same naming issues in Ansible Free API that you see in free IP itself. And after we've done this, the user is able to authenticate. So, it needs to get the code, and with the code, it's possible to log in. And that's it. Thank you. So, we have like six minutes. Do you have questions? Yes, please. Please scroll to the beginning of your presentation where you describe the service stuff. This one? Yes. Well, I hate the NSSEC myself, but why do you disable the NSSEC presentation? If you do not disable the NSSEC right now, the IPA server gets a reply from the Windows DNS server, it's ignoring it. Maybe you should have said that this is about the lab setup. So, if you have a lab where you don't have a Windows configured DNS SSEC in the Windows DNS server setup, which is not default, they don't set up the DNS SSEC. So, your lab is basically disconnected from the Internet, and it doesn't care about DNS SSEC. So, that's your ADC DNS SSEC in that lab, or you don't. So, Thomas choose the easiest way to configure DNS SSEC in the lab. But, IPA configures DNS SSEC by default if you use the internal DNS server. That's why it's forced into disabled validation because it's enabled on IPA sign. So, either that Windows setup in the lab needs to gain DNS SSEC configuration, or both of the sites needs to drop DNS SSEC validation. Sorry, I don't get, because, well, if you implemented DNS SSEC validation, implying that the DNS records SSEC are required, it's something weird. Otherwise, you just don't get any signatures, and you have nothing to disable. Now, a buy-in server, which IPA uses as an internal server, has DNS SSEC validation enabled by default. You cannot switch it off unless you explicitly say to switch it off. Yes, but if there is no signature, it should not reach anything. It does check, and it does reject the request. It rejects on signed address. But this is just a lab story. I try you do not disable DNS SSEC validation in the wild, unless you know what you are doing and then pay consequences. Okay, thanks. Yeah, I think probably the problem, the reality is that a lot of RokyD setups don't have it enabled because nobody ever talks to them, they're blue. In many cases, people are using cloud-based DNS services, so it's not the DNS server on your infrastructure. It's some of the DNS servers provided by, I don't remember the names of those companies, and those typically do not have DNS SSEC enabled for the whole zone that the company rents out of them. Okay, yes, just a second. Okay, so there's some language. Does this experimentary work with Samba as well? Does the external ID provider work with Samba also? I think I need to answer this. Thank you. One of them is using AD user to manage IPA. This will work with Samba, because this is just a normal job between IPA and activity. The external ID provider is called IPA users, because IPA only authenticates users that are in IPA. AD users authenticated by the active territory domain controllers. Microsoft implementation of active territory does not have perverse pre-authentication method that supports anything like that at all. Completely. Same with Samba AD. Samba AD built with Heimdall has no way to handle this. Samba AD built against MIT perverse has theoretical way to handle this. It's not implemented, and it's on my and Andrea's plans to complete this work on Samba AD site. There will be more about it in the hour talk, that will be the last talk of the day. More questions? I hope it was captured by the mic. I hope. Yes? It's the difference between this kind of integration and the one you spoke about in the morning. Wow, good question. Maybe I can answer this one. This is basically Azure Freebase. It's a new Samba network that will make the point of computer development. Here we have an ID integration. This kind of Azure Freebase is simply created to establish a class with AD. The one for my presentation is basically a container service that is capable of connecting to the LDAP from AD to make a request to LDAP, Python LDAP, something like that. The one that was in the morning, the PAA theorem, in the center, the client enrolled into AD or IPA and provided services to web applications, which is key to all of this case. Any more questions? We have time. Thank you.
Connecting IBM AIX to Red Hat Identity Manager (FreeIPA)
Thank you. So you know if you work a lot of the green screens, so after some time you cannot distinguish between green and yellow. So it's what happens to me and we are talking about IBM RX, and it's usually green screen. So we are here in the deaf room and the first, I'm not a developer. I'm a classical system administrator. So yeah, you can say to me I'm a DevOps engineer because I can quote, I can quote and see and Google Go and Rust in Python and Ruby, so I used everything. And I ported a lot of tools to IBM IX and to Linux on IBM Power, and on OpenPower. I'm not an IBMer. Usually IBM has talks about IBM Power. I'm not. I'm an IBM champion. So if you know what is AWS Hero, Microsoft, most valued professional, so this is the similar program from IBM. I'm not a red-hitter. I don't have a red-hitter. So as you see, I'm a red-hitter certified engineer and I'm a red-hitter instructor. So that's why some of these talk may sound for you like a training material, so it is not. And we're here at FOSDEM, beer, open source hackers. What does it do? What has it to do with IX? IX is a series of proprietary, Unix operating systems called sourceWare. So this is my attitude to it. Real hackers don't need source code. I was born in another time in another country and we didn't have access to source code mostly. And the real hackers, or real hacker, is someone who can understand how the program works without looking into the source code and who can change it without looking into the source code. So, who are some of the men who use it today? Who uses IX? Nobody? You are all wrong. Do you have a banking account? You're a bank user, so do you have insurance? Your insurance user, so how did you came here by car, by train, by flight? Everyone uses IX. So retailers, manufacturers, if you bought something, it was done on IX. So do you have this thing? No, you don't have an onix on it. But in the back end, it is IX database, or database on IX, who processes all your orders and sign-ons and so on. So I added some, yeah, I added some marketing sheet because nobody knows what IBM power is. Sheet with long e, not with short. I just want to say that this is my favorite guy, IBM Power 8 and AC. So it has 1,920 logical CPUs and 64 terabytes of memory. So you can have it everything in one position, so in one virtual machine. So I did once with six, first time I did with 640 CPUs, virtual machines. So I wanted to look what are my CPUs are doing. And you know, even if you have, let's say, 40 lines in your terminal, 640 CPUs, it is how many? 16 pages of CPUs. So it takes some time to get them. But there are some funny facts about power which are not very well known. So fun fact number one, zero successful data breaches. So it is 2022. I don't think it is about that IX is so secure. IX is not so secure. So it is like any other operating systems. And most of the guys, system administrators, cannot use it correctly. So but the other side of the story, nobody knows about it. So it's difficult to breach into it if you don't know how to use it. Fun fact number two, it's 14 years the major reliable server. So somewhere here, the typo. So it's really reliable. To bring it down, it's impossible. So it just works. So it can work year long. You can forget about it, it will work anyway. So and I like this fun fact. So this is about performance. Let a P in power is performance. So you see here IBM E-server P5. So this is the fifth generation of power servers 2005. Did an SIP benchmark, somehow 8000 something steps. So eight years later Fujitsu's park could do almost the same. Last year, the latest and greatest Intel Dell Power H with the latest and greatest Intel CPU just outperformed by 1%. So you know what is the most powerful server right now in this benchmark. So the first three places are there. Before going further, we had to talk a little bit about IX and what we should understand. What makes it so easy and so difficult working with IX. So it's a real unique standard operating system. It is everything is standardized and everything what is implemented in IX is standardized. So implemented according to standards. But you know standards can be a little bit different according to the developer. So it depends on developer who develops or who implements their standard usually. One of the most things is binary compatibility. So if you ask any IX admin, they will say binary compatibility is the most important thing because yes, I can run on my most modern IX server. Binary is a program which were compiled 20 years ago. I did it even more than 20 years ago. Another case of this binary compatibility, so you don't innovate. Because you have it, it works. So why should you innovate? Why should you do newer things? It's not BSD based and not system 5B based. So it's a usual, it's OSF-1 based if someone remembers OSF-1. So it was end of the eighties, beginning of the nineties when IBM, HP and digital united together to make a new standard in Unix world. And they did OSF-1. And of course, because not everything can be standardized, it has some unique features. And let's go to authentication. So PIME, everyone knows. Everyone uses in Linux PIME. IX has support for PIME. Everything is good? No. So it's originated in Solaris, yeah. Can be used in IX. But IX uses old Solaris implementation of PIME from the end of the nineties. And it's really paining in the ass, sorry, to port some PIME module from Linux to IX. I tried to port Azure AD authentication to IX. And I failed and after one week I said no, I will not do it. Because the differences between APIs in IX and old PIME interface and newer interface. But IX has something different called load authentication module. This is original IX idea, how to make almost the same. So it was done even before PIME. So five years I think before PIME, they did LAM. Almost the same, but a little bit different. IX only technology and very popular in IX world. Again, not because it is the best technology. Usually because system administrators don't know anything else. So that's why they use LAM. It is by default there. So and the most big feature of it, there is almost no documentation how to use it and how to develop it. So first time I developed my LAM module, I used Samba source code to understand how it works. Because Samba had LAM module for IX and IBM didn't provide anything. So it's not really versus, it's together. They work together. So we have PIME and LAM. We can have application one using PIME and application two using LAM. It's flexibility. So we can have user one using PIME and user two using LAM on the same system. We can do everything we want. Every user has 50 attributes, different attributes we can configure. So it's not like in Linux where you have home directory password, user ID and so on. In IX you have 50 attributes. You must not use every attribute, but you can use it. And you can configure, you can have different password policy for different user based on different dictionaries and so on. But even in wars you can configure PIME to use LAM for authentication and use LAM to configure LAM to use PIME for authentication. So usually it's good that IX administrators don't know about this feature, but you can get into real, I don't know how to say it. So you will be waiting for authentication 20 years till it completes because LAM will consult PIME and PIME will consult LAM. So now let's go a little bit into details. So the first configuration we need to, we can choose, do we use standard authentication? It is LAM, loadable authentication model, or we use PIME authentication. So we configure it's just normal in config file and this is standard value, standard authentication. And in our case we leave it as standard authentication. So next in ETC security user we can configure different user attributes. And this is one attribute system which is usually tells us which loadable authentication modules we should use to authenticate the user. By default in IX there are two variants, files or compact, they are not very, really different, but you can install additional authentication modules and add much more. So it works with IX only functions authenticate, it is not POSIX, it's not single user specification, it's just IX. And you just get username from user and send it to this function. And functions read ETC security user, it somehow works with system which we configured and says, okay, let's go use this prompt to the user. So in this case it can be that user one uses LDAP and authenticate with free EPA for example. And other user uses Kerber's and authenticate with Microsoft Active Directory, the third user can use multi-factor authentication and authenticate through GitHub. So all on the same system. But it's not all because as I said sometimes documentation lacks some information. There is another function authenticate X in the newer version of IX. Yeah, IBM is like me in this case, so they have very big brain how to name functions. So I also use the variables name, R, Y, J, K, L, M and so on. But the difference is here with this state. So what should be used, I don't know. But one of these two functions is used by login and it then goes to loadable authentication modules. And we have a configuration for loadable authentication module in this file and this is standard way. As you see we have one module for 32 bits programs and another module for 64 bit programs. And this is again the problem for me personally because I like to use modern programming languages like Google Go or Rust. And they are 64 bit only on IX. But in this case you need also to have 32 bit program and it means you can use just C or that's it, nothing else. And this is what comes by default as you see IX delivers carbers and LDAP as default modules. They are just there, you need to configure them. But there are some pieces missing of information here. So if you want to use LDAP you must install IBM directory server LDAP file sets. They are delivered with IX, so but they are not installed by default. Similar if you want to use carbers you must install IBM network authentication service file sets. They are also delivered but not installed by default. In this particular case I use only LDAP but you can use also carbers for authentication and LDAP as directory service to list to users, to get directory for users. So do I need to do something on free EPASite? No, really no, no. You just install and use free PSU. Usually do. So I have such installation at customer site and they just installed, they didn't do anything special for IX on there. I usually do something, so in my test I usually do OK2oFS delegate too. It's not for LDAP indeed, so it's more for carbers single sign on which I don't use here but so it works. And I create separate IX for you because there are some gotchas with IX. So BESH is not on every IX installed by default. So just on newer versions of IX it is by default there. On older IX version we use con shell 93 and on IX standard user group has JID 1 and not 100 SM Linux. So that's why I do this using Vue. So and yes, just small ansible snippet what I do, what I told about and again thank you very much for ansible modules. So magnificent. So IX site. On IX site we just create second app configuration with this command and we specify here our IR server. The name which we use to log into free EPO and password you see I use the long very long password and cryptical so and where we can start searching. So it creates automatically configuration for loadable authentication module. It creates automatically LDAP client configuration and starts LDAP client. And if you see here it finds really finds where it can find users and groups on its own. Nothing, no magic, no rocket science. The only thing you can see here is RFC 2307. As in Linux you use usually RFC 2307 BESH and here there is no BESH. Why? Because as I told IX is a standard operating system. So and RFC 2307 since 20 years is a draft of a standard but not a standard. So sorry guys it will not be implemented. So this is official answer from my BMI gut because it's not a standard. So everything else looks good. So there are some configuration changes I see they must be done. So but in real life they should be done not must. So you want to have home directors for your users to be created when they log in. So I found so okay you say that by default all the users are in LDAP directory. So and I found that on the new US-TIX these two parameters and password policy plays not very nicely together with LDAP. Another feature the main list groups so by default IX use if user to IX comes from LDAP directory. It can have only LDAP groups. It can be in local groups and switching on these the main list groups I say okay the user can be also in a member in a local group. So and one more if you use views in FreeEpa like me so just don't forget to adhere IX views and restart LDAP client that's it. So everything is configured everything is working so it's not so interesting if something goes wrong you can use LDAP to check what FreeEpa brings to you or what IX sees on FreeEpa side. And if something goes completely wrong you can switch on debugging with such magical variables nowhere documented almost nowhere documented. So but be careful with them first of all they produce really a lot of output. Second you can find even your passwords in clear text in this output. So okay mapping sometimes you must change your mapping this is standard mapping what all IX uses but you may find situation I was for one bank another a little bit different mapping because they need some additional fields in their case. So and you can change the mapping here and it's rather easy it's no rocket science so does it work yes but there is always some but some so there are some wishes first of all as I told standards are so how. How developer implements the standard it's depends on developer and I extend that sees a little bit different password class change attribute as in FreeEpa so different formats of dates. So it's on IX side and I development tries to fix the problem since I think one year something like so it's when you walk with closed source operating system so you can't fix it on your own sorry. I don't I didn't find the way how to make HB a quacking. So I have the answer thank you. But I would like to have it working so yes free plays missing my favorite 50 IX use attributes and I would like to have them there it's really one of the things I love in our IX. We have even more IX specific like role based access control and trusted execution so if you switch it on you cannot execute some some binary on IX without it will check the signage of the binary and everything can be stored in held up so that's why it would be nice to have it in free part to. Yeah. And yes I exist missing 23 but I would like to have a native free Epa client which is not there. So and if you can help me feel free to ping and we will talk thank you very much for the time and. And. We have any. One minute. Yeah. Not a question just a side note. About stuff so. Well it's sort of catch 22 until you. Until RFC is implemented it can't become a standard for a proper standard. It is implemented by numerous. Yeah. So probably. Raise the issue to. IETF is not interested in finishing this work. The people who originally started this job not interested because everything is working for everyone. That's that's. Position. Okay.
Your web app is taking up too much RAM. Let's fix it!
Hello everyone. Can you guys hear me properly? Nice, perfect. Yeah, today I wanted to start your presentation with quite a bold claim. I would say that your web app is taking up too much RAM and we could fix it. And this like comes from a thing that I noticed recently and is that if you look at your Chrome browser, you can see that if you hover over the tab, you will see that the Chrome is now starting, at least in a while, to tell users the memory usage of your app. Which like, if you look at most applications such as for example g-tub even while looking at a pretty big diff, the memory usage is not that bad. Yeah, like I mean, 122 megabytes were a lot in the 2000 but like now it's not as much. But if you look at other websites that maybe are a bit more expensive such as Airbnb, you can see that if we load a pretty big page, the memory usage goes way up. Like we're talking about half a gig of RAM being used by the browser. And what I was wondering like is it our fault? Is it the browser? What's in that memory that is being used? And we can find out how much of that is actually used by the JavaScript virtual machine, by our variables, our functions and our code. And the way of doing that, it's by opening the DevTools and there is a special tab called memory and you can see for each JavaScript virtual machine that is running, you can see how much memory is it taking up right now. Which in case of Airbnb, it was like 111 megabytes which is like, it's not much but it starts to be quite a bit especially when GitHub was like 10 megabytes compared to that. And then maybe you look at some more extreme examples such as here I propose fully stress test notion by loading a quite big table and we went into 1.5 gigabytes of RAM used just by JavaScript variables and that was quite wild because if you think about it, that's a lot. That's a lot for a web page. And also there are even worse examples or I would say more difficult examples like the product that I'm currently building and I'm building this web-based tool called the flux which is a tool for designing electronics on your browser and it is quite complicated because electronics is made up by a lot of different parts and it's built using typeskipped React, 3JS, React refiber so we use a bunch of technologies and also a bunch of abstractions to make our life easier and that had an effect on us. In fact because we wanted to be able to render very complicated documents with a lot of different shapes and text and everything has to run at 60 FPS, you can see how holding a big project can take a lot of RAM and that's something that backfired a bit. Why? Well because originally we focused a lot on performance, we wanted to have everything load very quickly, we wanted the scroll to be fast and originally we just optimized for performance, we were like yeah memory is cheap, let's just use whatever all the memory that we have so we just optimized what the profiler said, not what the memory profiler said and actually we did this because there was this article that from a while ago that was talking about yeah if you're building React apps just memorize everything, just cache everything that you can because that is not going to be an issue in most cases. People did we know that we were one of those cases and yeah and in fact you can see how like if you load a pretty big document at least before this talk the app will take too much RAM but they can really hear someone says well okay I have 16, 32 weeks of RAM on my desktop on my computer, why do I care about memory users, we're not in 1999 anymore. Well there are still a couple of reasons why we really care about this now and one of the reasons is out-of-memory crashes. If you're not optimizing memory usage the browser will limit you. In most cases for example in Chrome if you go over four gigabytes you will get this, you will get an old snap error code 5 which is an out-of-memory and there is no way to catch it, no way to solve it, the only thing that you can do is just prevent this from happening in the first place because here you will need to refresh the page to fix it. And on iOS it's even worse because on iOS sometimes the limit goes down and no one is really clear about what the limit is. For example if you are on Safari UIOS sometimes the limit can go as low as 300 megabytes and this is what you get, you get your browser loading up the page, trying to load the page then going out of memory, refreshing the page and going in an infinite refresh loop which you will see your user report and that's when your product manager will come screaming into your office why is the application not loading on my phone and because you're using too much RAM, so yeah clients might have a lot of RAM but your browser doesn't care, it will not let you use it. And another thing is that we also care about the garbage collection performance, if the more that you allocate the more that you will need to deallocate later and that's a thing that you have to care about because in some cases the garbage collection connection times can really hurt your performance. This is a bit of an extreme case that's like one minute of garbage collection but like this is something that is a bit more realistic. We were debugging an event handler that was supposed to run on mouse move so something that was totally off path and the major garbage collector took 0.5 seconds which means that there was a sharp drop in the frame per second just because the garbage collector had to kick in and so that's another thing that you want to care if you care about performance. Also memory is part of your performance optimization strategy and another thing is that as I showed before now Chrome is showing the memory usage of your website to your users so if your users are using like 12 tabs or if you are insane like in my case you have 10 browser's opens with a thousand tabs each, yeah that should start closing them maybe tomorrow. The users will be able to see that it's your website that is taking up their entire RAM and they will not be happy with you so now they will know which one it is and so yeah we're into a situation and for example my situation how do we solve this like how we approach this problem in flux? Well first of all it's important to figure out what is occupying memory and once you do that like there are multiple strategies that you can use to kill it with fire and then we also want to make sure we're not doing the same mistake aka we can set up some checks in CI or we can set up some monitoring even with remote users to check that the memory usage is not that bad right now. Of course in the talk of today I wanted to focus more on the first point because that's already a lot of things to talk about. So before going into the tooling I wanted to introduce some ideas about memory usage so that we know what we're talking about. I like to have those distinctions while talking about memory usage this is something that I made up in my analysis and like I noticed that there is a pattern of having either static or transient memory usage. What are we talking about here? Static memory usage it's when you have variables that are taking up a lot of RAM but they are long lived, they are global variables, they are state that is staying there and it's not really changing throughout the run of your application and that's basically what you would find in a heap snapshot and it is that the easy thing for example the document that loads and it is taking up a lot of RAM but you don't necessarily have a situation like that sometimes you could have a transient peak of memory usage which means that for example the user clicks a button and that button triggers a very quick operation which allocates an array with one million elements you can see it as a peak in the memory usage at that point and that sometimes can be a bit more hard to the bug because you want to find that on a heap snapshot because a heap snapshot is just taking an image of what's in your RAM at that moment and a peak of memory that would be de-allocated immediately won't show up in that so there are different strategies depending if we have the first or the second type of memory problem. Another thing that I like to consider is the count and the size of stuff. Why? Because you might have a very happy situation the kind of analysis wise in which you're locating a 500 megabyte string or a 500 megabytes array that's very different than allocating millions of small objects and if you have the first or the second situation you need to use completely different approach to analyze that because while if you have a giant object it would just show up in the memory profiler immediately as a very big object if you have millions of small elements it would be much harder to analyze them because you will need to check what's inside those those four bytes objects and another thing that I like to bring up is the difference between shallow and retained size and these are things those are two terms that you will see in the memory profiler and the reason for that is because in JavaScript everything is a pointer so if you have an array of strings it's actually an array of pointers to strings so the array itself could be very small like in order of bytes but the stuff that is pointing to could be giant like it could be pointing to a lot of one megabyte strings so when we talk about shallow size we talk about the size of that allocation itself such as the array which is small but that array it's causing other memory to stay allocated because it's referring to those one megabyte string so the retained size is instead the total amount of memory that that object or that array is forcing to stay and that is preventing from being deallocated and there's another last topic that is also quite complicated which are allocation types and which means that in JavaScript there are multiple things that you can allocate you have different other types you have code you have strings just array you have typed arrays you have also closures and each one of those behaves differently in memory and one cool thing that you can get from this is that for example also functions are something that can take up memory if you're not careful enough because functions need to save all the variables that are around them so technically that a function is an object as well in javascript and this means that even for example if you are creating functions in a loop that could become a memory problem because it's the same thing as creating an array of objects so like sometimes you can just look up the v8 in Chrome documentation and find a lot of interesting things about how memory is used internally but that's another topic for another talk I would say I wanted to instead look into tooling like if we are in a situation in which we have a lot of memory usage what are some tools that we can use to try to start to analyze what is going on and how to solve the problem well the most famous one is the common memory compiler which is that memory tab that you probably saw next to performance in the Chrome DevTools and it's quite powerful because it can work in three different modes I think the most interesting ones are the heap snapshot and the allocation sampling which works in very different ways for different purposes but it is that with the heap snapshot you can take a big snapshot of everything that it's in your RAM everything that javascript is working with and like imagine that you created a lot of variables in your code with this you can just save all of them and look at what's inside of them which is really cool because you can even see the values that you have there and for each one of those for each allocation that you have you can also see what is the retainer that means what why is this being a memory who created it and who is holding references to it and that's useful to determine who is the thing like what is the function that caused that thing to stay in memory and the heap snapshot are very useful if you want to check stuff like static memory usage because it takes a snapshot in time if instead you're more interested into transient memory peaks as I said before there is this other tool called the allocation sampling which works by accumulating every allocation that happens this means that everything that is allocated is saved here but you don't get the allocations so which is you can't really measure how much RAM you're using you can just measure who is creating that RAM that's objects we had not too many of them but some of them were taking a lot of RAM like 89 megabytes that's a lot and we had one specific object that was taking a giant amount of memory like 80 megabytes and which is then by looking at the retainers we were able to immediately figure out what was the function that was allocating that stuff that was retaining that stuff and that was one of the very first optimization that we managed to do because this way we went into the code into that function and realized how we were basically creating a bunch of functions this is react code and a bunch of string UIDs and saving all of them in a map and apparently that's incredibly inefficient that's probably not code that you look at the first at the first glance that it seems inefficient but if you call it thousands of times this is apparently sticking up 80 megs of RAM and so how did we solve this we refactored it a bit by using a set instead of a map and so it's very experiment based and with this we were able to have like a 50 percent improvement of memory usage which was huge and this really made the difference between being able to load some project at all or projects that would just crush your browser like documents and so that was one of the first big wins that we had so we were like yeah okay let's continue this eventually we will reach zero megs of memory use right no immediately after that we hit pretty much a brick wall in which we were taking hit snapshots and we were seeing that we had two million objects that were taking a lot of space and it's not that that like we had one big object to optimize each one of them was a couple of bytes and the heat profiler really doesn't help you in those cases and that's interesting because that's pretty much the same situation that you will find if you try to profile that same notion that I've tested it before or even Airbnb as it's actually the same problem and unfortunately the answer is the problem is react kind of and like we are in the same situation also with notion we have is it two geeks of ground no that can't be that is just being occupied by a lot of small objects so yeah we hit a brick wall but what we do now like the heat profiler is very bad and analyzing those kind of stuff thankfully we can export from it we can export a giant five gigabytes json from from chrome and then we look at the json and we see that the json it's in a format that is pretty much unreadable but thankfully there is someone that did work for us the guys at meta and it is beautiful tool called memelab which it's a toolkit for exploring memory usage which is very focused on finding memory leaks it has like an entire automation for that but I think this is even more it's even cooler because it provides you a very powerful API for opening snapshots from chrome and analyzing them what you can do is that basically you can read the objects in memory and perform analytics on them for example we wanted to answer this question which type of objects are taking up the most space out of the two millions that we found in a snapshot well this is a some code that we wrote I think that we don't have time to go too much into it but I can publish it the idea is that we can load the snapshot so load the current state of memory and find all the object types like what are what is the like the type skip type of the object computed total shallow size for each type and then sort and print results and from these the results were very cool because the um which is we had the for each object type even including like the the keys of the object how much memory they were occupying and which is we were able to see that on the top two we have one object that is called fiber node which is from react and another node another object that had base q base state memo state what the next q what is that that is not something that came from our application that's react again that's the data structure that is used internally for keeping tracks of hooks and so like we went into react to me so that there was exactly that other structure which in most websites that are using react heavily nowadays is pretty much the thing that is occupying the most memory with enough so it is we figured out that keeping tracks of hooks is expensive and but are we supposed to just tear down the 400 000 lines of react that we have in our app right now like that's a bit too far into the development so we wanted to know precisely what we need to optimize so we use memlum again this time we uh we see like even more uh by looking at this fiber node data structure that is used by reactor and we need a lot of statistics on it to try to figure out what is the react component that is taking up the most memory so that we can optimize that specific component first and we managed to do this because this way we were able to divide the odd memory uses by react component and see each hook how much memory it was using and with this we were able to find out a specific react component that was using a lot of memory and we cut the memory users down again by 60 percent which was pretty nice so that's like memlum saved us with this because we were able to make our app properly working and it also made us possible to answer other questions like as out of all the strings that we have in our app how many of those are uids should we start optimizing uids and make them numbers well no because we used memlum to find all the uids and we found out it was like two megabytes in total so who cares so that's also nice to know what to not prematurely optimize so just to sum up everything that i said i think that we can all agree that memory analysis is actually difficult especially because it varies so much between application between framework between browsers but it's important even if even in a world like nowadays in which we have a lot of a lot of round because for some apps it really makes a difference it makes the difference between you being able to use the notion on your phone or the app constantly crashing and never loading your data and that thing is that the chrome profiler is cool but sometimes it's not enough but thankfully it can export so that at least you can perform your own analysis externally so thank you for listening to representation thank you are there any questions i see a question here yes you were talking about the shallow size versus retained size yeah when would you ever be interested in looking at the shallow size sounds like the more interesting one yeah he asked about when do we care about shallow size when we also have the retained size well i yeah we care a lot about shallow size in our case it was all about shallow size where to write our own custom plugging for memblab to just analyze shallow size why because if you are analyzing like very big objects there are thousands of lines and in that case you have to use tricks like even virtual scrolling if you know that you can have like instead of allocating all the DOM elements you keep reusing the same ones and you think about that that's like ejecting from react because you are creating something just with javascript and the DOM and then you are creating a reactive wrapper for it so that's another thing that it shows that yeah react is good at orchestrating stuff but when it comes to the performance critical things that you want to have inside your application then you need to start optimizing it differently just a small mark or we continue with the questions so please if there are spaces please try to squeeze and not leave spaces in the middle as you could see we have hundreds of people waiting outside and here as well and we cannot have that many people on the sides so please try to squeeze don't let free seats for your jackets or something put it on your lap thank you and since we're starting to be quite a lot if you're going to go out please try to go out from the right side and avoid going out from the left side so that it's easier for everyone thank you we have a question here first i've got more more as a comment instead of a question so the thing is that with this limitation of four gigabytes for memory this comes from the fact that like chrome like compresses pointers so that small objects take less space basically that's one thing second thing is that is this is like a security mitigation so that when there is some like back in v8 it's harder to exploit it but also i've read on like a chromium box tracker that there is for example 16 gigabytes limit for fixed arrays so there may be different limitations for different things like web assembly also has a different limitation and also supposedly like electron abs doesn't have limits so yeah yeah that that's very cool thank you i think that firefox has pretty much the same limitations oh ask me if we're also trying with other browser yeah i'm mostly working on firefox actually and firefox has very similar limitation and sometimes it's even worse because sometimes we notice that the upper randomly sometimes takes more memory in firefox for some reason or some things are more optimized in firefox other things are more optimized in chrome so that that's very complicated to answer unfortunately because it seems like that the answer is either you look deeply into the source code of the browsers which is i still haven't reached that point unfortunately or you do try an error ah no the tooling um you know firefox also has tooling around it which is actually if i remember correctly more focused around analyzing the memory users of the DOM elements and it also has some facilities for for analyzing ip snapshots but since like memlab users works with chrome ip snapshots we went with that immediately and how do you go about running this in ci? oh um yeah that's a complicated thing because running in ci it's pure pain like you can use memlab and run it in ci because it uses playwright i don't remember if it uses playwright or puppeteer i think puppeteer and with it you can like orchestrate some some tests that open a page it can even like use some machine learning algorithm to find memory leaks the problem with doing that is that it's fine if your app is small if your app starts to become bigger then uh you will need to have a ci machine that is powerful enough to be able to run your app and the profiler on top of it which for us it meant that the the ci time went like in 30 minutes which was unacceptable so eventually we removed it but you can do it are there questions as your question there so from the browser or something like that? yeah that's another complicated thing because if you are using chrome i don't think that firefox allows you that but chrome does you have a specific performance or memory i think variable that you can use and you can check both the maximum heap allowed size and you can also read an estimate of the current family usage in our case once we do that we are constantly like giving data to segment then we analyze in amplitude with which we can like keep track of memory usage and we are also doing that for like the performance timing the problem is that we notice that that data who very quickly becomes bogus because it depends a lot on what the user is doing and when the garbage collector kicks in because the garbage collector sometimes is like it goes up to four gigabytes and then no problem goes down to 500 megabytes so it's extremely difficult to capture memory usage because you don't have a precise memory a precise measure on how much of the total retained memory is active and how much is actually inactive and going to be garbage collected soon so we try to do it and we have some charts showing how much memory is being used but it's very hard to make sense of them unfortunately any other questions you still have around five minutes for questions
Codebase Conquest: How Nx Turbocharged Our React Workflow
Thank you all for being here and for waiting, sorry for that. So our next speaker is Nicolas, who is a staff engineer with a lot of experience and he is here to talk about Enix and an actual use case that he incurred during his time in Hazura. Thank you Nicolas, for your applause. So, does your build time keep getting longer? Well maybe we can extract some packages into overrack packages. But then the packages are extracted, started to explore the dev time to work and integrate in your app. And then it's hard to keep up with two versions? Yeah, at Hazura it was the same. The build time was like 15 minutes for the frontend. The dev reload time was like 5 minutes, so you make a change, you wait 5 minutes and then it's actually done. And tooling wasn't proper everywhere. So we had to make a change. And this is the story of this change. First who am I? I'm Nicolas, I'm a staff engineer at Pethitch. You can find my Twitter and my blog. This is also in the right hand form in my blog if you want to dig further. So let's get back to the topic. So what was the setup? We had two code bases, the open source one and the enterprise version. What we did was we extracted some of the code from the open source code base into a bundle through extra layers of webpack. And then we installed this into the enterprise application. It seems pretty standard, right? But then tooling wasn't the same everywhere. In one place we had touch scripts, yes, test, storybook, chromatic, Cypress, so very good dev experience, dev installation and everything. In the other side, which let's remember, enterprise clients pay for the other side, we had JavaScript, no touch scripts, yes, test, and that's it. No storybook, no entry and test, nothing else. Because it was so complex to work in this second part of the application, this was the end result setup. But that's not it. Get worse. We had one K-line of custom webpack config just to bundle part of the application into the other one. Log files management was hell when you change one thing in one place. You had to make sure the log file, not the package version, the log file was the same in the other place. Otherwise, things will crash in production and without end to end test, you only know when you're in production or when you test your dev environment. CI was very slow because of this whole system. So we wanted a Mono-repo tool. Let's have everything inside of a single Mono-repo, having them work better in union instead of isolation. We made a wish list for what we wanted to do. What we wanted in the Mono-repo tool was task orchestration, saying build this app before this one. We wanted to have dependency-graph-visualization because right now we have two packages, but in the future we'll have more. We want to see what the hell is going on without having to guess and looking at packages and digging through code. We wanted to have consistent tooling. Let's say we have just and the same config of just and the same version of just everywhere. Because yes, it wasn't the same version of just before. Fun to make with stuff. And we wanted to have contact constraints. And for example, the open source edition couldn't import the pro edition, because you don't want to give away things for free. Like companies get paid for. We wanted to have distributed task execution so that we could scale the CI by adding more runners and to say run those jobs in parallel and deal with how you want to do. And the bonus point is we wanted code generation so that scaffolding was baked into the tool so that in the end everything was done for us. So after this open X we went into the ecosystem, look at every tool that existed. And we checked every one of them. First a small disclaimer. This work happened about a year ago. New tools exist since now. Like Moon repo didn't exist back then. So if you want you can also look into Moon repo. And I also want to shout out every engineers working on those moon repo tools. They are amazing. If you have anything they are always willing to help. So kudos to them. So what did we look into? First one, Bazel. Bazel is made by Google to handle Google monopos. It's huge, complex, you can do a lot of things. But it's also very complex to use. We looked at Gretel because yes, Gretel can do other things than just Java. You can do whatever you want in Gretel. It's tailored to Java but you can do JavaScript, you can do Go, you can do whatever you want in it. We looked at Lerner which is the historical and classical tool to manage a moon repo in the old days of JavaScript. We looked at NX because I've used this in the past in the Angular days when NX was only an Angular plugin. And yes, this is a real monopo tool. We looked at Pence which is mainly used by IBM but also in other places. It turns out it's pretty good if you want to experiment and give it a try. We looked at Java repo because all the hype and trouble was solved and everything. So it was in the list. And so that was like the tool that we looked into. So let's see. We wanted Tasker-based acquisition. Well they could all do it so that's good. We wanted dependency graph visualization and Pence didn't support it. So those two are out. Then we wanted ecosystem tooling. Troubles didn't support it. Lerner neither. So we end up with either Bazel or NX. Project constraints, they both support it. Amazing. We wanted this task execution, they both support it. Cool. And congeneration, well Bazel didn't support it. While we could have added Bazel congeneration utilities and extra code, it was also simpler to set up than NX. Complex to set up than NX was way simpler to do. So Indian NX was the tool that met his needs that we had at Hazara. If you want to learn more about those tools, this is a great resource. It's open source and contributed by many of the maintainers of such monorapos where you have a graph of all the main things that make the monorapo features and each project is listed in here with what it can or cannot do. So we had with the tool NX. But turns out there is two flavors of NX. Integrated or package based. First let's go into package based. Package based is behave like a PNPM, such as NPM workspace. You have many packages, they all link together. It works pretty well. But it doesn't have consistent tooling. You can do whatever you want in your projects. The migration path is way here because you just slap an extra JSON at the root and it's done basically. But there is still a bit of step between the leaves. Let's remember why we are doing that because we want to make sure the build between libraries is way faster so that we don't reinvent the wheel every time. So then what is integrated? Integrated means that every tool in the workspace is unified and considered a monorapo as one unit. Every tool is consistent because every tool has the same version and the same configuration everywhere. You can train it in a specific project but the base is the same. But the migration is more thoughtful because you need to decide how you want to migrate. Do you want to align with NX context or do you want to bend NX to your wheel because you can do both. But thanks to this, we have optional build steps between libraries, which means we could solve all speed issues. But there is one more thing, plug-ins. But what the hell is a plug-in? A plug-in can do three things. It can generate generators that allows you to scaffold the bases. NX new library, done. NX new application, done. NX new storybook, done. It can execute it, which is wrapping the tool to make it simpler for you to consume. And the best part is automatic migration. For example, a new version of desk came up and you need to update your test to have a new configuration for the timer. NX will migrate your code for you automatically and it works 95% of the time. You won't have to do anything. This was really helpful for us because the code base was huge, like a million of code on those lines and it was hard to maintain. So that's all good and all, but we engineers, right? Tread-offs, not everything is green. Yeah, there is two big ones. First one is single version policy. We state that there may only be one version of a dependency and package inside of the monorepo. While it adds extra constraints, it's also recommended within any monorepo. Because if you have a library that is built using React 16 and another one with React 18, you cannot import the 16 into the 18 one. And the way I see single version policy for me is a bit like buy versus loan with interest. When you want to migrate React, if you buy, you just bite the bullet. You spend maybe a bit more time, but you do it everything at once and everything is a daily. Versus if you loan the migration, meaning you have to spend many times doing many packages one by one over time, every time you have to regain context, how do I migrate this again? How do I send this again? And every single time you want to migrate a new system, it takes way longer in the end. But it's a bigger investment up first. You pick. Buy enough tools is another constraint. You have to wait for the tools, meaning that, for example, like this version came up, you have to wait for NX to update in their setup so that it will automatically migrate the tools. In enterprise software, waiting for a day or week is not that big of a deal for a new test version, to be honest. And it's way better now because they work hand in hand with actual engineers working on those tools. And some of them actually work at NX now, so that helps a lot. And if you need it, there is plenty of escape hatches, so you can just do whatever you want in the case you may need. So we know what we want. We want to manoeuble. We want NX. We want integrated. How do we proceed? Because we're not going to say, we're going to freeze production for six months until we might get everything. That's never going to work. So the goal is to migrate incrementally without stopping the digital data work. And we add some requirements for this migration. First of all, we wanted to have no cost freeze during this migration. We had many engineers working on the code base, and we never wanted to say, stop working for half a day every week so that we can migrate stuff. That's not feasible. We wanted to have as little regression as possible. Nobody likes bugs. And neither of those customers. We wanted to adhere to NX. So that automatically migration what was as easy as possible. And which meant less maintenance in the end. And furthermore, if we have standard tools, then reusable skills. You can switch teams and everything is the same. So that's nice. Like companies that do loads of re-ogs, that's a big seller. And nice to keep. We had seven years of Githy story. Githy story is sometimes the only reason sometimes we can debug something because of the JavaScript and such. So we wanted to keep it. So here was the situation. We had our current code base. We then created a new NX workspace, like just create a new workspace. We import the code into the workspace. We build it. Is it working? Yeah, everything is done. Except not. Things broke, obviously, because our code had many issues. And so the next step is to identify a whiteboard and then break the current build. This way we can fix it in the current application. And then we can start over again. The good thing about this migration path is that at every step of the way we provided value to the actual developer working on the old system while preparing the new system. And at one point we identify some of migration we needed to make to NX. So every time we create a new workspace, we added a non-migration beforehand. And we did this cycle many times to make sure every step of the way it worked, we even had a crown to do on a weekly basis to make sure everything was good. And I mentioned we had to make tweak to NX. One thing we had to tweak was the JavaScript path because we had add slash. And in the monomaple, add slash means nothing because there is no root. There is only packages. But we tweaked it so we can make sure the migration was not blocking and require a lot of work on the previous code base. We had to include Node.js fanbots because even though no Node.js code should end up in the browser, we all have Node.js code in the browser, like HTTP and such. We had to make some specific changes to the web-config, like SVG and such. And we had to disable some ASN tools because, well, our code wasn't up to standard, obviously. So that's what we needed to do. What about our code, right? So first of all, we had CSS module without the .manual.tss extension. So there would be a VIN like CSS modules, but we didn't have the extensions. We had to fix it. We used an ability to pass in CSS in tabscript. And it shouldn't have worked, but somehow it did. So thanks, Webpack 3, I guess. But we had to change this so that it worked with Webpack 5. Path imports relied heavily on Webpack config, so we had to change that also. We had to update a test in tabscript to a version that is compliant with NX. We had to update the entry points so that they only export a component and not mount the application. And this was the kicker. Turns out, somehow, the build compiled with a lot of second-dependencies. Like a lot. Like 150 loops of second-dependencies within the codebase. And this was like one of the libraries, not just the bootstrap of it. So we had to dig through and fix our code, basically. And we went down through 95, and now Webpack was able to compile the application, and the browser was able to load it. So that was good. What it looked like in the end. We had our pro application that loads the pro library that imports the OSS library. And the OSS application that loads the OSS library and the end-to-end test that both imports the library and the application. Thanks to this, this was, by the way, generated by the NX graph of the workspace. We don't have to do anything. So all good, right? Everything is nearly ready. We just need to switch. And switching means keeping the Git history. So to keep it, we first made a commit to clean up the old workspace. Then we made a second commit to Git MV to the over place. Then we made an archive for OSS because, given we are open source product, we wanted to make sure a contribution went up broken because of this. Both commits, we applied the known tricks, and then we were in NX land. Thanks to this way, the second commit was able to identify into Git blame to make sure Git blame doesn't pick up this commit. So we still kept our Git history for whatever we wanted. In the end, the total freestime for this migration was three hours. From the beginning to the actual end of the migration, three hours total. It wasn't a fault lasting a few months. And the three hours is because of CI was slow to run on the four commits that I mentioned before. So all good and all right. What about the results? We want numbers for all users and all developers. First all users, zero bargain pollution. That was great. Because of this incremental approach that we took, we were able to see that every step of the way we didn't break something because otherwise we would have identified it in the app. The over surprise was that because everything is unified, the bonus rate decreased quite a lot from 43 megs to 13 megs. And funny thing is when you get a call from a service representative, thank you, Niko. I can finally use the app locally without being too slow to load. Thanks, I guess. It's a bit weird. You wouldn't before, but still. So this helped us at the low time. We have the application loading like five seconds faster thanks to this. Okay, that's good for devs and everything for users. What about devs? Well, 30x faster local devs. Because we didn't have to have build step every step of the way, we went from five minutes to ten seconds. This was life changing. Try to imagine when you debug something, you make a change where five minutes to see that the console you added show ups. Now it's like ten seconds in an instant for what we used to. And the CI was about 60% faster in the worst scenario. In the best case scenario, it's about 80% faster thanks to caching and things like that. All right, good. Is it the end? Are we done? We are now in Enixland. We have the packages. Are we good? It could be. It could be a step that we, you say is good enough. We don't want to go further. But you could. One of this area is architect of the coupling where you say I want to make sure that my open source doesn't import my enterprise code. And you can info that thanks to Linchwool in Enix. You have a Linchwool that's better than Debreche, but it basically says that a pro code can import shared and OSS and pro and that's about it. A shared can import shared. In a visual way, this looks like this. Where you ensure that libraries in the scope can only import within the scope or the scope they allow to go to. This helped us heavily to ensure that open source code stayed open source and the enterprise code stayed enterprise and open source couldn't import through the tooling production like a cloud enterprise code. Then the other thing we went further is to unify our tooling. While in this migration, we just add Enix, generate new entry and test. We add the new entry and test for our provision. And this costs us like 20 minutes to do. We now have a V-test in some of the new projects. And we also made our custom plugin because you could make your own plugin. It's relatively easy. And thanks to the plugin, we can create a new library. I want a library with this scope and this type and put it in the right folder for me. I don't care. Do it for me. And the naming would be automatic. Everything would be automatic. In those cases, you can say generate automatically like the code owner, update the CI if needs so and that. Because in the end, thanks to the plugin, you get the specificity of your tooling, all of the developer and engineers mind and into automation. Because we all know this documentation that is never updated. And a tooling is always updated because we use it regularly. So if we know it's all of it, we can look into it. So in the end, what I wanted to say is coding on a last code base shouldn't feel like this. You are not sure you're going to break something. You are not sure what you change with a fact. You have no idea what is going on. Instead it should feel like this. A happy dance. We just pass the ball around and have things moving in the right direction. Thank you for your attention. Are there any questions? So in this case, we didn't use NPM to share on the outside. However it's supported in NX to be able to release applications. And thanks to the NX plugin, it can understand your workspace and create a package for your library to be exported publicly on NPM. Next week there is a new launch event for NX and they are going to announce something that may be related to your question. Are there any other questions? Yes. Can you hear me well? Yes. My question is what was the main reason for such decrease of the bundle size? Is it because you are using all of these cycles in the code? One of the questions was why we end up with such a lower reduction in the bundle size, because what happened in the beginning of the talk, what happened before is that we had one package that we had bundled into a package. Sorry, there are a lot of slides. Anyhow, I think you remember close enough. So what we did before is we exported a large part of the application into a package and then we imported this package into the proper base. First change now is that Webpack now has a unified view of the whole system and has a way better tree taking. Because in this middle package right here, Webpack didn't understand what was actually imported into the end application and wasn't able to do as powerful tree taking as before. So that was one huge step that helped us on this. The second step was having updated Webpack configuration and tooling, which makes sure that we didn't need to target IE anymore. So that reduced like 5 megabits from the bundle. And so both things combined plus better CSS processing with like a unified view again of the whole system made that we had this decrease in bundle size. Yeah. So today I don't pay for it and I'm doing a similar migration using an X2. There is a new tool that I would investigate, which is called MoonRepo, which is similar in some cases to an X. However, through this day for an enterprise ready product, I will still use an X. Because the one thing they are moving towards to is to also have a way smarter CI. Because if your CI can understand your workspace, it can also understand better what to do and what not to do. And so for this day, an X would be still my choice. In the future, I will still investigate MoonRepo to see if it could make sense. But unless you have a huge scale like 10,000 engineers, Bazel would make sense. Because you could have a team of like 20 engineers working on Bazel. So yeah, that's my answer. Yeah. So just to make sure when you started with an X, you imported package by package. But you threw away the results in the ads. Yeah. And you redid it in two hours. Yeah. So this way, we made sure the old system was being updated to the change we needed to do. So this way, if for whatever reason we had to stop, we still provided value to the existing base. So on the question before, what do you think of TurboRepo? Yeah. So TurboRepo has some features that are integrated into an X in terms of a feature of parity. However, it lacks some of the larger system that is required for an enterprise project. You don't have distributed task execution, for example. You don't have unified tooling. You don't have generators. And this makes that, for me, TurboPo is a middle between learner and an X. It's like a middle ground where you have a bit better because you could have tasks like caching on the cloud, thanks to like Verso. But you don't have the full power of something like an X. So yeah. Yeah. If you compare TurboRepo with the other way of choosing the index, the first one, how would you compare it? So I'm going to have two answers for that. One which is related to next week announcement and one for today. For today, an X requires a bit more conscience and tooling when you set it up. But stay tuned because it will be even easier to adopt an X to an existing workspace because they are trying to change the fact that an X is smart and trying to understand what is your project. And you have less friction to adopt an X. Yeah. Did you have any non-Node.js applications or services that you needed to integrate in this migration or an X is only for Node.js related to nodes? Great question. So by default, an X is agnostic. There is an ecosystem of plugins that exists supported officially by an X that is very fund-electrified and circulated. However you could do whatever you want. There is community plugins for go, for .NET, for Java, inside of an X where for example for the Java project it will understand the POM.xml and try to understand whatever it can automatically. And one great thing about Polyglot repo like this is you can say when your backend change, we render end-to-end tests for the frontend because they are related. Because you can say your frontend like your SDK impulse is related to the backend because it is linked to the Open API spec. Then this, we trigger everything on the frontend. And this is where an X or a manual report shines is that it's one context even if it's for Polyglot. Unfortunately we don't have more time for questions so we'll begin with a close for you guys.
Can we simplify charting libraries?
Alexander has been a React developer since 2018 and he likes creating UI that is nice. And he's going to talk about how we can simplify charting libraries. So big round of applause for Alexander. Okay, so thank you very much everyone for joining. To give you a bit of context about what we will talk today, I'm currently working at the MUI, which if you don't know is providing a user interface component. You might know us because of this library. And a very kind of a tradition each year you ask a user what can we do for you, what can we improve. And the community is quite creative, which led to other libraries, for example, a base, which is a headless library. But they are very creative. For example, Toolpad is a no-code application we are trying to build. And then there is the team I'm working in, which is MUIX. And we create the most complex components, for example, a data grid, a date-time picker, which are a bit more complex than a button and select. And a year ago, we decided to start the chart sephor. And this talk is about how we proceed, what we found, explored, and our current conclusion. So from the question we asked to user, what they wanted is a nice documentation. That's the main stuff they complain about a chart library. And having a developer experience that match what we do usually, for example, for the data grid. So we'll see together if this is possible. Okay, so I started with just thinking, having a dream, what would be the perfect user developer experience I would want. So for me, the best one is you have a wrapper. You provide him information about what he needs to know, what is my size. And each time you want to add an element, just add a React element in it. It seems pretty basic. It should be okay. Up to the time you add more data. When you add more data, it overflows and it totally makes sense just because the x-axis will need to communicate with the plotting to say, hey, stop after 10. But if you put larger data, you have another overflow issue. And just because your line plot needs to communicate with a y-axis. So I started my journey with a dream and I ended up with an issue because components need to communicate in all of our direction. And that is just one example, but it's the main issue of charts. Data management is a pain. There is a second one, which is customization. Here you can see for a button, we kind of all agree about what it can be. You can customize a bit the color, does a background have a color or not. The most complex stuff you can do is adding icons. Most of the time it's at the beginning or at the end. But for charts, you have much more elements. And the creativity of designer and mathematician is endless about how you can add annotation to it. So we need much more flexibility. And currently, all our developer experience strategies does not allow to do that. So we have two main issues. It's time to have a look in the past. This library exists since more than 10 years, so they have a lot of experience to share with us. And it's a pleasure to work in open source, so that you can have a look at why they made a decision and how the code is working. So let's start with rich arts. As you can see, it's a composition. We'll just say at the beginning that composition is a pain. So how did they solve this data management issue? Basically, you have a wrapper, so the line chart. And he says he looks at his children. So children is just an array of components. And he says, OK, which one is an axis? And I will extract all the data from its props to know from which point to which one I can display stuff. Does the same with all the elements that are plotting data. So here, line, mark, areas, and stuff like that. And then you do a kind of an aggregation to render the components with the correct properties. The file that makes that is 1000 lines. It's very hard to read. I assume it might be hard to maintain too. And when you want to add your custom components, you don't really know where the information comes from, because there is this black magic aggregation that will provide you some data. And to the bug, it's a bit of a mess. But it allows a lot of flexibility. On the other side, you have a much simpler approach. It's a single component. So for example, you want to line, you do responsive line. And you provide data. You can configure all the axes look like and configure the tooltip, et cetera. Each element has its props and a lot of options. So as I said, it's very straightforward. So one chart is equal to one data. So that will change according to your user, plus a set of options. But you get two main issues. For example, mixing charts does not really make sense, because you have two single components. You cannot overlap them in an easy way. And you cannot modify the features, because it's a single component. You have a finite set of options. And the option is not available. You can go inside the source code to update it. For example, supporting different axes for the left and the right. So having multiple axes for line charts is not supported. And except modifying the source code, you cannot do that in Nivo. So very nice if you want a simple chart. But if you, once you go into a wall, there is no option. So for the charts, it's a pure Ligava script. So as you can see, you select an HTML element, for example, main, and you run the code. Of course, all the complexity is hidden here. And to give you a bit of flavor, they kind of fixed the issue we've seen just before. The series can be multiple types. So you can mix a line chart, a bar chart. You can even put a pie chart in the middle of a line chart. It does not make sense. But for the software, it's okay. And it's an old software, so there is a lot of options. So you can do most of the customization you want. Due to time, I will just skip this. So basically, this is all the pipeline for rendering a chart. And the main issue I see with each chart is this one. The only stuff you have access is still the option object. So basically, you can provide the data. You can customize the option. But as soon as you want to render a custom element, you know if you've tried to render SVG just using strings, it does not make a lot of sense, or you need to have the components. So, now, save time. Nice. Just to resume, so we have these two solutions, basically, single components or composition. And as we've seen, data sharing with composition is a nightmare. And you can work around, but you get into the black magic stuff. And for the developer experience, it's not good. And for adding elements, you need composition, because as soon as you get to view these options, you don't know how to insert something, for example, in a Nivo as an array that allows you to reorder the grid, the axis, the plotting. But you know that when you reach the state, when you need to pass an array to order your elements, you will be quickly limited. So, it's time to go to the proposal. So, basically, we started with a single component. So, it looks a bit like a Nivo. You want a line chart. You say line chart, and you provide data and options. But behind the hood, it's composition. So, you have a like for a rich art. You have a wrapper and all the rendering components. If you look closely, you might see that the way props are passed are not exactly the same as for rich arts, and there is a reason. Basically, all the data that need to be shared and aggregated, so the axis, the series, and so on, are passed to the container. The reason is basically that we want to do this aggregation stuff in a need and way to say, okay, you're using our components, trust us about all the axis and the series need to interact. You don't need to take care about that stuff. We'll do it for you. And then it's passed to providers, so a series provider. But take care about knowing what is a bar, a bar series, what is a line series, what is a pie series. Same for the axis and interaction provider. For example, we'll say to you, the series with this idea is currently highlighted by the most. So, displaying accordingly. So now we can create the rendering part. So, for example, the bar plot, we'll call the series provider and say, okay, give me the data about the series. If there is none in Render.0, if there is some, he asks to the axis provider, okay, I have this bar with a value of 24. Can you say me which have a coordinate I should associate to this value? So, he renders the rectangle, and he will communicate with the interaction provider to know if the bar needs to be fade out, highlighted, or just in a normal state. With the same logic, you can create whatever you want. So, other kind of series, other kind of components. So, for example, we created the axis, legend to tip the basics one. For the little story, the reference line has been created by a user, just using the provider because it was a bar so. And of course, you can create your own ones, and that's the main success of this approach. So, as a conclusion, a single component. For us, it was a need because most of the time, for example, you just want to put a sparkline. You want to put a bar chart in your application very quickly. So, you say bar charts. You get few options, but just what you need to get the correct bar chart, and you don't have to care about all this internal stuff, about all components communicate together. But as soon as you want to do something very custom, and the charts are part of the earth of your business model, you want it to be as the designer implemented it, or display very specific stuff. So, you need composition. The main failure of this experiment was the configuration feeling. I wanted absolutely to avoid this aspect of, I give you a bunch of options, deal with it. It's not possible because there is so much interaction between the axis and the series, but you cannot split them into the options where they are needed. For example, axis in axis and series in the series. You need to get them all together to do the computation. You get this feeling, but okay. And the success is to empower developers to create their own subcomponents. And that is something I've never seen before, except if you go very low level on how to make charts. And to give you a flavor about how easy it is. Okay. So, this is a line chart. And there is a custom component in the middle, this horizontal line, that shows you for your most position what is the value on the left and on the right. So, this component is not very useful, but it demonstrates interaction and axis management. And so, to create it, you need two stuff. First, the bounding box in red, the most position. That's easy stuff. And then you want what we call a scale. If you use D3, it's the same object that allows you to convert the value to a coordinate. And what will interest us is the coordinate to the value. So, let's start coding it. I promise it will be very quick. Use drawing area is calling the provider that retains. Where do you plot the data? So, you get just the bounding box. And you use Y scale, provided the idea of your scale. And it returns you the D3 scale. Very easy. And that's all. That's all you need. After, it's boring stuff. You save a state. And you do your use effect to almost move stuff like that. If you store null, just because you are outside of the SVG of the drawing area, so you're under nothing. Otherwise, you're under a path. So, quickly. You go from the left at the most position. So, single point. And you draw a line of the wave. So, that's come from the drawing area. And then you just have to use the axis scale invert to get the value from the coordinate. You display it. And that's all. So, you've created a component that is completely custom. And that interacts with your chart. And you can reuse it into any other kind of charts you build with us. Thank you very much for your attention. Thank you. Most of the time, people don't know, but there is an option on to force them to send a feedback about talks. If you have some, please don't hesitate. Otherwise, there are my contacts for later. Are there any questions? We have a few minutes for some questions. Yes. You mean rendering a custom? Can we use a render props to render custom sub-elements? But the issue is, for example, with SVG, you end up with the order of your components, impact which one is overflowing which one. And so the question is, where do you render this element? So, for example, this line, you can imagine that you put it on top of a line chart and below the mark plot of the line chart. And so you need to get access to the GSX level. How do you go from simple mode to complex? You can go from one component and if you need more advanced stuff, you can compose. There is a single component for all the basic charts, so line, bar, pie, and the scatter. And if you want, for example, to compose a bar chart with a line chart, you need to recreate it. So we provide all the basic stuff. So basically, if you open in GitHub, the line chart.tsx, you will see a chart container, different plotting, the access, and basically that's also you get five or between five and 10 components to create your own one. How do you use the rest of MI as financing? It's kind of standalone. We reuse the theme mostly to be linked with, for example, the tooltip so that it gets the same color as the background of your application. But otherwise, it's SVG, so there is not that much in common. There is no button, for example. There is no select. We don't really need those user interfaces. It's more of a theming and the way components are styled, for example. You set that it follows the same developer experience or you can override the styling. How do we create a reaction between what you have in the data pool? How is the performance? Have you checked how it behaves? Because how we have to prevent this is with reaction between real reaction and real data and real data. Have you seen how this affects whether we have a lot of points? No, we did not try mostly because we are currently using SVG. And so we know that there is at least a wall that is waiting for us at a certain level just the time to render the SVG. So we did not care that much. It's part of the next year roadmap. Thank you all for close for ads on.
Building your own JavaScript runtime with Rust
So our next speaker is Leo, who is a developer at Dino, and he's going to talk about how to create a JavaScript runtime with Rust. Big round of applause for Leo. Hello, I'm Leo. As I was just introduced, I work at Dino, and I do various Rust. At Dino we do a lot of Rust, and we create a JavaScript runtime, but we want other people to be able to use it as well and make their own stuff with it. So we will explain internals and how you can make a small JavaScript runtime by yourself. But first, what does Dino? Many people still don't know, so better explain. It's a JavaScript runtime similar to Node, maybe similar to Bonn if you've heard of Bonn, it focuses on security, web compatibility, typescript out of the box support, and just a lot of built-in tools like a formator, lint, dock generation. We also have compiling for single executable, and a bunch of other tools. We are also not 100% fully Node compatible, but we're getting closer and closer by the day, and it's getting quite well. And what matters to this presentation is the modular code base. We have a lot of building blocks that can be used individually to build your own JavaScript runtime just with these Rust crates or Rust libraries to make your own one. Without too much effort, actually, we simplified this a lot. Yes, so first off we need to explain the internal structure of Dino, which is everything is built on Dino Core. Dino Core is a layer above V8, which is the JavaScript engine that powers all the Chrome and Vowsers. And Dino Core is just a small wrapper around it that simplifies a lot of the utilities around it and makes it a bit more friendly to use. It's not always easy to directly use V8 by itself. And on top of that, we have various other functionality that's built on top of that. That's extensions. Extensions are individual libraries that can be used by themselves to implement individual APIs and functionality. For example, a specific web API, like let's say fetch or a fetch of variations on individual extension. We have HTTP server, KV, root loads. Basically everything is individual building blocks that can be not copied and pasted, but imported and just used without too much hassle. Like usually to add an extension is like three lines of code and then suddenly you have a massive amount of more APIs that you can just use. Then we have Dino Runtime, which is a library that is built on top of a bunch of extensions that adds a bit more capability to it, including permission system, which relates back to us being a secure runtime. We have various permission-based functionality and flags. And also another additional feature would be the fact that we do some definitions of various global scopes and the Dino namespace itself. And also web workers are only implemented in the Dino Runtime grade because it's just not possible to have it as an extension just because it needs to interrupt with the extensions themselves. And then we have the CLI, which is what we compile and what people use. And that's not great. Yeah. And CLI includes the TypeScript support, a bunch of other, like all the tools of the CLI, like the lint, the formator, et cetera. And also then we have the compile supplement, as I mentioned before, that has a compiler single executable and testing infrastructure, benchmarking infrastructure, and dock generation. We have a fully static HML dock generator that you can just use and will always give a relatively clean output. But what will we build today? We will build a JavaScript runtime that can compile TypeScript, has a functionality to make a HTTP request, a console.log, some files to migrations like read and write, and deleting a file, I think, as well. And it's all in less than 20, 30 lines of Rust and JavaScript. We will connect Rangias just because this will be a relatively technical topic, so there's going to be a lot of code. So you want. First let's explain extensions more in depth. Extensions have various fields and options that can be set. Arps, which I will explain in a moment, basically the Clif Rust functions that can be used in JavaScript, so you can just write a Rust function and that will then be callable out of JavaScript. ESM is the ES module, so you can use ES modules, import static imports, and dynamic imports. Work as well, I believe. Maybe not. JavaScript files are just scripts, so not ESM. To include it, all works differently under the hood, so we have these two separate options. And then depth is declarations of other extensions. This extension depends on. This is not necessarily needed. It's more of a safety harness. Just it makes sure that you actually initiated the extensions in the right order so you don't actually forget to initiate an extension that another extension relies on and then everything floats and then you don't know what's happening. And there's some other relevant, less relevant options like config. JS, as I mentioned, above is ravelly use nowadays. And then lazy loaded ESM state, and I'm not going to go into depth. It's just config lets you configure some options to a specific extension. If you want to have some special state, you can use the state option. lazy loaded ESM lets you lazy load extension code, but that's nothing that we're going to go into depth for you to look at this in this talk. And then ops. So ops are these functions that you can declare in Rust that then are used in JavaScript. You can just call it like a normal function in JavaScript. And it uses this up to macro. I hope that's not too problematic of an explanation that what a macro is. I hope everyone knows here. Not your problem. And then basically you define arguments and return types with these special macro attributes like the string or this string. And basically it infers then the right type to map it from JavaScript to Rust. And vice versa, depending on the attributes. And yeah, you just write a normal Rust function like for example in that we just use tokyo which is tokyo is the async executor that we use in Dino and most of the Rust ecosystem uses it and then we just read the string, we read the content of a file of the path specified and we just return it. And we return everything in ops as a result which is either an error or an acceptable value because you might want to throw an error for example and that just handles it under the hood tool. So you just return an error. There's other various types that can be specified in ops. We have some more ambiguous types like V8 value which is just a generic JavaScript value. You can pass that in, you can manually match and do some more specific handling if you need some weird function that does based on different types, something which usually we try to avoid. Rather have separate functions that do more specific things. But then we also have Boolean, I supported numbers, strings, as there, array buffers are supported as well. And yeah, you can return and accept array buffers and it all handles under hood without issue. It's all been simplified as much as possible to make it as user friendly or developer friendly as possible. So it's really easy to just create your own functionality without too much difficulty. There is also this async is defined up top. It makes sure that the function is actually async and that it does need to use async functionality when you define something as async. And if you don't do anything async, it will usually then error out during compile time. So because async, it's just more complication under the hood that makes it less performant to some degree. Then here comes the code. So for this example, we're gonna have to find a few ops or cross function clarifications that make the code from JavaScript. And we have read file, write file, fetch, set time out, and remove file. So in the read file, as we just saw, we read the file from the path given and return that with write file. We can get, we specify a path and the content both as a string and write that to file to disk. And we return nothing as per the empty type. And then the fetch one, which might be the most interesting out of all of these, is basically uses request, which is a rust grade for doing HTTP requests. That I guess if we wanna compare to something in the JavaScript ecosystem would be similar to Axios, I think. It's very similar, maybe not similar in API, but similar in functionality and simplicity. And yeah, we may just do a fetch request and get the content of the body via the text method and then just return the content. And then we have a set time out, which just puts the current thread to sleep. And for the specified duration, it's passed by the user via this function. And remove file, just remove file, this is given in the path. However, we use a whole system called v8snapshots and it's gonna be part, I apologize because it's a very complex topic. Not many people really know what it is or even how it works. But to very simplify, it's let's take the current state of the JavaScript execution and you can store it in a file and resume it later. That's the simplest way to explain it. It's not exactly like that, but for simplicity's sake, let's stay to that. So we need a build script because we first need to do some setup. So first we initialize our extension. We call it Vangias as we said earlier. And we have this ESM entry point, I did not mention that earlier. But basically, let's you specify the entry point that runtime will use when starting up. And we specify our files. We have this ESM option and we have this JavaScript file, which we'll see just in a second. And we have some path defined. We want to get the path of the current build script location, some more specific Rust shenanigans. But we get this path and we join it with this Vangias snapshot. It could be any path, just we need a common location where this build script outputs something that we can then retrieve during runtime. Now comes the fun part, which is this create snapshot utility function that we have made that does all the snapshotting logic under the hood, tries to simplify it as much as possible. And you have a few options, most of them can be completely ignored. The only two important, three important ones are the manifest there. We cannot infer this automatically. So we have users to always set this value to be this and micro call to the target manifest directory. The snapshot path is the variable we defined earlier for the where the output of the snapshot will be. And then we have extensions, which is the extension we created earlier above. And we just want to initiate the JavaScript code. We don't have just initiate ESM file, it's initiate ops and ESM. Here we have not defined ops because this is just doing the build script. We do not care about ops at this point in time. They will come into play in a moment. First, we also want to support TypeScript. And this is just a small snippet of the code. There's some more boilerplate that is not necessarily interesting. It's just getting the path of the file and the media type of the current file just to be sure that we actually transpile JavaScript and TypeScript to JavaScript and that file types are all correct. But for that we use this AST create, which is basically a wrapper around SWC. SWC is a Rust create a library that basically implements TypeScripting as per TypeScripts wants and needs since there's no real specification because TypeScript. But it takes some options, the specifier, which has created a path or the name of the file that we want to transpile, the source code. So the text info, we just create this structure from the code that we got earlier from this V2 string at the top from the path that was specified by this function. And some boilerplate, this media type that I just talked about. And then we just call Transfile and magically we get the Transfile TypeScript as a JavaScript. And we can just use it. And then we have to code and we just create a structure of module source, which is how it internally is represented and we just return it. And this is all in trade, I guess the best way if you're familiar with TypeScript is like an interface and we implement this trade and it has a few methods but only one method is really necessary and it is a load method, which is what this is it. There's a few more lines both above but again, that's just for media type and some smaller error handling that is not really out of too much interest for this scenario or for the simplicity. Then we have our, this is in the actual main script where we get the snapshot that we created earlier during the build script and include it into the binary itself. And then we have access to this runtime snapshot and we will use it later on. And then we have the extensions, we initiate again, but this time just with the ops that we defined earlier. And this time we don't need the yes modules because we defined them earlier and snapshot attempt so they're part of the runtime snapshot from above. And it seems I forgot a slide. I can quickly hopefully fix it. This is not well prepared and I apologize. I hope this, let's do it the easy way. This is the JavaScript file with the internals defined. And basically we input the inner core as a JavaScript as a JavaScript module that you can import. If you use the inner core that has some utility some functionalities again just like the Rust version. Just this is for interropping between the Rust and the JavaScript. And we have the structure to score into ops. Ops again is, this is an object that can be used to access the functions that we defined earlier as we see over here that I hope it's big enough actually. Can the people in the back read it? Wonderful. Is this big enough? Wait, then let's, okay. Just to quickly reiterate we have this input of the inner core and instruction to this ops object which is just down below here used to call this op read file which is the one we defined in the Rust file earlier. And all under the hood it converts the values to the correct type matching in the Rust. And then whatever it's returned from this op read file which will be a file content we then just return it from this function that we defined in this object constant. Over here above we also have the console definition which is, uses code of print which is a utility defined in the inner core again, a few more helpful tools. And we just get all the arguments and just use this arts to match it but we then define the part which will be just stringifies and joins all the values. We don't need anything too complex for this example and then just prints it to the console. And then we have the same forever which just sets at the end the true value which is for if it's an error or not. So above it's false and then below it's true. Then further down we have the other function definitions which are read file, write file, move file, fetch, maybe just all wrapper functions around these ops. Technically this async was not needed. So one of the side part whatever and then we have the set timeout which calls the set timeout and then calls the callback. So it's relatively identical to the web API that we know. And it's assigned this to global disk which is the global namespace and also we assign to the global disk also the console and we define a runjs function object which is the object we defined above with all these extra small functionalities. To go back to here, we defined this extension again and the runtime snapshot and then we have basically all the building blocks ready now we just need to actually use it. And for that we need the runtime. This is again a bit more complex but basically we define a function that takes a file path. The file path is the JavaScript file we want to execute with the user's code that they pass. We have some utilities in Unicode that resolves the path against the current directory and gives you a model specifier out because that's what internally it's used. The module loader is what is used to resolve a module and any imports in it from this user specified file and we have our TS module loader. This is the TypeScript transpiler that we built earlier that is just the structure that we defined but I did not show that because we've boilerplate. Startup snapshot is the snapshot that we got from earlier from the setup and then the extension we need to initiate the ops that are defined so that the Dino Core and the JavaScript file that we designed can actually access these functions and load them up. And we don't care about any of the other options and then we have the actual usage which is this load main module. The load main module, it loads the main module of the entry point. Let's say if you run Dino run test.ts it will, that would be the main module and then it will work through the entire module graph which is basically all the imports one by one on the recursively. And this is async, a lot of this operation async because ES modules are inherently async and yeah we evaluate the module so we basically run it and get if there was any output and then we want to run the event loop because there's going to be multiple pulls let's say with async functions you've got to do multiple async calls perhaps or just stuff. We have some options that are not of interest. We have Dino Core includes inspector, utilities and pump via message loop which is again not much interest at some point or another. We just await this event loop running and return value of this result that we were calling earlier so we just then get out of this run.js function we get the result which hopefully will be okay and there's not going to be any errors but there might always be some error. A user might have to find incorrect variable names or have invalid syntax or something like that. And then we can do a small demo where we, I hope this is going to be big enough again. That's definitely not. We have this example.js file. Here we just call the set timeout that we defined earlier in the global scope and then just come out and this can then just be, kind of make this bigger. No. So life demos never go perfectly well but hopefully this should be working. So we then just do congo run and we want to specify this input file which we called examples.js. And hopefully this will work. It first needs to compile and yep it prints the weight and then the hello world that we call here. Now this is just a set timeout that's not as interesting as for example fetch. So I mean we could just console log the fetch output so it would be run.js because we defined this global variable earlier as run.js in this run.js down here and then we want to call fetch. I think we could fetch HTTP example.com and since this is async we want to await it and again let's run this and hopefully we'll get an unreadable wall of text of HTML output from example.com. It's usually not that long and yes we did a fetch request to a remote server. And we had the file system operations so I could just call await runjs.readfile and let's read for example this file itself. And then my terminal quickly. And hopefully it should just print the same output because we're reading self file yep and it reads and then the deleting and writing of files will work as well. We're not going to go too depth into that. It's relatively self explanatory and yeah that's pretty much it. I know I went a bit fast. I hope people don't have questions. There's a QR code for the actual repository where we have this so if people are interested to check it out but also we always are trying to improve the ecosystem and common problems of the JavaScript ecosystem and we actually have had problems with the dependency ecosystem of JavaScript and NPM and we decided that someone needs to solve this and as such we also created a new general purpose JavaScript registry that will work in any runtime. This has been announced a few days ago by Ryan Moindepf and you can join the waitlist at the QR code or the URL. That's it. Are there any questions? Time for one or two questions. Yeah. Let's see that I hate this. Inside the Docker container. I have this input queue of jobs where I send the script that I want to run and then just execute it and the output is from it. Is there any downside as long as I am only sending one single script that it needs to execute? No. I don't see any issue with that whatsoever. It should just work. Again, I'm not too familiar with Docker though but that seems like a relatively normal thing to do. Any other questions? What have been your biggest challenges in writing this run project? This project has been going on since it was announced in 2018 and we have written our internals many times. For example, extensions were called other things multiple times in the past. We renamed and restructured not entire structure of the code base but it was just multiple rewrites just to be able to have more capability but also performance wise improvements. Overall, it has been a challenge but it was something we could always figure out. Rust itself has never been an issue. It's always been relatively good to use. It's not perfect. No programming language is perfect but previously Dino was initially started as a Go project but we switched quickly to Rust for performance benefits as well. I hope that answers the question. Anything else? Yes? Yes? Yes? Is it? On this one? Yes. Okay. This could technically have been just accepting U64 directly. This should actually have been U64 directly and just been passed and not casted but that was probably just some oversight while writing this code. We casted it because it was from Melissa, except only U64 but this is just oversight. I have one more question. Yes? How does the performance on the custom run times or extensions compare to the foreign functions in the past? I'm not too familiar with FFI but we have optimized both FFI and these extensions a lot more but extensions inherently are going to be more performance because it's not a foreign function. These ops, I guess if you really look at them foreign functions, since it's calling Rust functions out of JavaScript, there is some plumbing but these have been optimized so much over multiple years that I would say like sync functions are basically maybe not no cost but close to no cost. Sync functions have overhead due to...
All Things Astro
Hello everyone, so our next speaker is one of my very good friends, a BGS team member and an Astro core team member as well. His name is Elian and he's going to talk about guests. Astro. Hey, who wasn't that a surprise? Alright, let me check that I'm not in the screen. Okay, hello everyone. Hope you're doing good. I'm doing good, I'm just a little bit tired. I just flew in from Poland over Zurich because I had a conference yesterday as well. So if I sometimes struggle with words, I'm sorry I'm tired. Also I'm in Astro core. Astro is this framework. I'll talk about that in a minute. But also in the React Brussels and the BGS team, I don't know if you've ever been to our conferences. That's here in Belgium. I was actually born here in Brussels and now living in Ghent. But also those guys are the same ones that actually organized this dev room. So maybe let's give them a quick round of applause as well. Yes. And they actually both left so they have no idea. But that's good. I also do my own meetups in Ghent. So if you live in Ghent or in Belgium overall, you're always welcome at our meetups. They're free. If you want to follow me after this or want to ask some question that you didn't get time to, feel free to follow me online. It's at Alien Coats on all platforms. So that should be easy. Okay, let's address the elephants in the room. What is Astro? Who has heard of Astro? Oh, wow. That is a lot. I asked the same question yesterday. We're like three hands. Who has actually used Astro? Okay, that's also a lot. Who is on the latest release of Astro? Okay, still good, still good. And who is using Astro professionally? Nice. Okay. No, that's what I was expecting. That's fine. That's fine. Cool. Okay, so it's a personal experience probably. Okay, that's good. Okay, cool. So we call Astro the framework for content-driven development. There are a couple of reasons that we say that. And I hope that will be clear to you after the talk. See it as being a comparable framework to Next or Next. It's a meta framework as we sometimes call them. There is a lot of discussion over should we call them meta frameworks. But let's call it that for now. We can later discuss on Twitter if it's, well, on X, if it's actually called a meta framework or not. This is what it looks like. This is the Astro syntax. Basically, everything that you want to write in JavaScript or in TypeScript, we support TypeScript, goes in between the front, the top, the dashes. That's always server-side. I'll explain that a little bit later. But it's a very familiar syntax. It's basically JavaScript at the top or TypeScript if you prefer. And below it's just, JSX likes syntax. It's not really JSX. It looks like JSX. But you can use class. So it's an improved JSX. Why is it ideal for content-driven? Well, it is because it's better for SEO and for meta tags and all of that stuff. Because we ship zero kilobytes of JavaScript by default. There is a few catches with that. We were one of the first frameworks to take this approach. But by now, we're surely not the only one. And sometimes a better tool fits a better use case or there is a different tool that's totally fine. If you want to discuss that, we can totally do that after this talk. Think of your traditional framework application approach. You write something in, let's say, next JS or in next. It typically looks like this. It doesn't always, we have now, React Server components and stuff. But I'm not going to account for that. All of these components require JavaScript or TypeScript and compile to JavaScript. And that is actually really weird because there is a couple of stuff here that is completely static and doesn't need JavaScript. For instance, the footer. It's just basic A tags, whatever. The header, maybe it's just an A tag that refers to your home page or an image. Why do I need JavaScript to render an image? That doesn't make sense. So what we do with Astro is basically we compile it all down to static HTML, CSS and JavaScript if you want to. More on that later. Basically, you have to remember HTML first. So what if you need JavaScript? You probably want some interactivity, right? You probably want to add a button, a hamburger button, drop downs, all of that stuff. What if you need interactivity? Well, of course, that is possible. We have a directive for that. It's called client. And that gives you a few options on how to control interactivity and tell the compiler when and how to hydrate components. I listed a couple. There are more. But I'm going to quickly just go over these. Client only is very easy. It just skips our compiler completely and ships JavaScript, as you would in React. And client media will only hydrate a component when a given media query is met. Think of it like mobile only buttons, hamburger buttons, all of those don't require JavaScript to render on our desktop side because you don't even see them. We have client idle, which will only hydrate components when the main thread is idle, when it's doing nothing. So basically, free for your CPU. Client load will just say, hey, I need JavaScript to send it to me. Then we also have a couple others like client visible that will only hydrate when a component is actually in the viewport. That makes sense. So what we actually can do in Astro, think of this as the basic HTML page I was talking about earlier, we can ship JavaScript to just a couple of components. Maybe an image slider, we need some things there. Maybe we need some header links, whatever, that are dynamic. We can do that. Of course, we are an open source thing. So you can build your own stuff. You can put that into Astro. And of course, you all know, as developers, if you let them free, they will come up with weird shit. One of those is Astro client when it's raining in New York. This will basically, like it says in the name, it will hydrate your component, but only when it's raining in New York. Cool stuff. Ben built this. Ben of the Astro core team has an implementation to show off how it works. But it's possible. It's fun. It's cool. There is a lot of creativity to be explored here. We call that concept islands. Islands basically referring to a component that's completely isolated from your other components. But we come with one twist. We have seen the astro syntax. But the components that you want on your client side, you can actually build in other frameworks. You say, add react to my Astro website. And then you can use react components inside of your Astro website. Or you want to use view or you want to use felt or maybe both of them together. That is possible. I won't say that it's a recommended thing to do. My thing disconnected here. Okay. It's not a recommended thing to do, but it is possible. But by default, without the client hydration, if you use a react component on Astro, it will still compile down to static HTML at build time by default. That's basically what makes Astro fast. There is, of course, a lot more. What I show you now is basically only the static generation side of things. That's the default. But we have so much more. And just in 2023, that was a crazy year for us. We did a lot of stuff. We shipped three major versions. And we have reasons for that. And I'll go over them like very quickly. I'll show you what we did and how we improved the life of Astro developers. So in January, I did my first real international Astro talk at JS world. Amy, you were there, right? With Omar? Yes. We just shipped Astro too. Astro looked completely different from the Astro that it is now. What we shipped, we shipped more than just the features that I'm going to share. But basically, these are the important ones. We shipped the new CLI. RCLI? I think it's crazy. It's crazy good. It's super clear. It's really easy. We just asked you a couple of questions. And on those questions, we set up a template for you. A couple of questions are, of course, do you plan to use TypeScript? Yes. What kind of TS config do you want? Do you want strict, strictest, loose default? Whatever you call it. You can do all of that. But also, since we are so open source minded, we also have released that as a client library. Well, not a client library. CLI library on its own. That's called CLAC. That's built by Nate, one of our core members, built in a weekend. And now it's used in different projects and it's actually amazing. Cool to see that there is like a couple of different projects that came from Astro. We shipped content collections. That was actually one of the biggest ones. Content collections give you a type safe way of working with markdown, MDX and all of the other markdown flavors. Even markdoc, for instance. This is probably very familiar to you. This is Zod. And Zod is this client, well, not client. This is library that basically checks your types on Eskimo. That's what you do here. And because that's type safe, we can also error check way better that I'm going to show you in a minute. This is how it looks like. So you get all the intelligence goodies. You get all the auto completion and all of that good stuff. We added hybrid rendering. So I was speaking about, as super clear, you can instantly see what's wrong. In your blog, the astrotutorial.mdx frontmetter does not match collection scheme. You instantly know what's wrong. What file is it? Oh, its title is required in astrotutorial.mdx. You instantly know what's wrong and where it's wrong. You fix it, done. Then we launched astrotutorial.mdx3. I think that was in August, if I remember correctly. We shipped view transitions. View transitions are a super, super cool thing. Who has ever used view transitions? A couple of people, not too much. Who knows about view transitions? Okay, that's a couple more. What's the reason that you didn't use them? Yell something. Time. Okay. Okay, yes, browser support. I was going to expect that one. Yes, it's not supported by all browsers yet. But what we do with astrotutorial is we polyfill a little and then it works. At least the basics work. And what view transitions are for the ones that didn't put their hand up, it actually looks like this. So astrotutorial basically does this SSG MPA page. But they actually, with view transitions, you can make an MPA with basically all static HTML files feel like SPA with client-side navigation, even though you're not shipping that to the browser. The browser will always do this by its own. Basically, really simply explain it takes a screenshot and the screenshot of your next page and transitions in between both of them. But you can do crazy shit with that and about the demo with me. It's not built by me, but I have it with me. Can I do it like that? Okay, give me a second here. You can all see this? Okay, okay, okay. Switch page. Yes. So as I was saying, browser support is a hard thing, but you can do shit like this. So this is multi-page application. Still, when I press North, look what happens. Okay, let me fix that. I wasn't expecting that to happen actually. Will it work? Yes. Okay, now it's there. So if I go back to South page, it's basically South.html. Look what happens. All of that animation is coming from the browser. There's no client-side hydration happening here. This is insane. I don't know if you're excited as I am. Yes, some people. Okay, okay. Not too much. It's fine. It's fine. But still, it works also with like the navigation API. So if at the top, I don't know how well you know ARC, but at the top I have just the basic buttons forward, backwards. That also should work. Yes. That's amazing. Okay, okay. Now let me go back to the presentation if I can get that back here. Okay, okay. There we are. And connecting. Yes. The craziest about all of that is actually from you as an end user. Well, end users are typically the clients that use the website. I mean, as a developer that will use that feature. It's only two lines of code. It's really easy to implement and we make it so easy for you to ensure that you have the best developer experience possible. A couple of other things, of course, if you think statically, you don't have middleware, you don't have all of this edge stuff. We added that as well. And the good thing is you can always create faster responses for your users anywhere in the world, wherever they are. But those are always like the catch words with edge stuff, right? It's also a little bit of a smaller runtime. So it's a little bit more difficult than that. But you get the point. Image optimizing. Images are hard. Can be hard. Can be really hard like in the browser sometimes. What we did is actually we released a virtual module, actually, which is astroassets. And you basically just import your image, just like you would do with a component, then use it as a source and it will automatically output a optimized WebP image. But of course, a lot of people came complaining and were like, where is picture? We need picture. We brought picture. And then actually you can do formats with it. So if you want to use Aviv, because that's even faster and actually not supported in all browsers, but you have a fallback to WebP, which is supported in all browsers, then we'll take care of that for you. So it's really easy for you to define and optimize the small bits of your website that are lacking behind. That's at least how you get. Also, we did a major refactoring of our internals, the JSX internals. And because of that, we also got another 75 performance improvement, which is great. We also brought this. I don't know how many of you are familiar with fast refresh. It's amazing. If you don't see what's happening here, that's good because then you're living a good life. What actually happens is, does anyone ever like build a dialogue, for instance? You click on it, you have the dialogue goes open, then you change some text and suddenly it's gone again and you have to re go through the whole flow again. That's the problem with state. Actually, what fast refresh does for all JSX in our, in our case, it will actually remind the state. So while you're typing, the state will update and you won't like have to go through the flow all over again. So it's basically quality of life upgrade for you as a developer. Page partials. It wasn't intentionally built for it, but of course we have all the HTMLX hype. And actually, this is possible now with Astro because of page partials. You just ship one thing, no HTML tag, no head tag, no body tag, just what you wrote in HTML and that makes using HTMLX in Astro possible. Then we'll have Starlight. Who has heard of Starlight? Less people than Astro. Okay. But there were a lot of people about Astro. What is one thing that you can name about Astro that is good? Documentation. I know you were going to say that. I just said it for you. Starlight is actually a, I want to say theme slash library slash framework. It's basically a great theme for Astro. But one important thing is that it actually ships everything that we have learned from writing docs for Astro and brought that to a framework for other people. And I was actually talking backstage a little bit earlier with Nicholas and he's using Starlight at work a lot and says it's amazing. Like you have all these built-in features that take care for you like the search. You can change that with Page Search or Algolia or anything you want. Really it's very pluggable. It's really good. And of course you have all the Astro goodies. You can use React, you can use felt, you can compile everything to native languages. You can do anything you want. But then we launched Astro 4. Astro 4 is cool. Why? We have a DevTool bar now and DevTool bars are something underrated sometimes. In our case, you can see your islands. You can see where your JSX is located. You click on the file, it will open. You can see that in this case it's not hydrated or it is hydrated. What's the text? How does it work? You can see all of that just in the browser like without leaving the browser. But also we shipped accessibility tools. Accessibility is getting more and more and more important and it is. And that's why we integrated that. So basically you click on the audit tool and it will tell you oh an image alt tag is missing. Oh these are misconfigured area roles. All of that will just show you. Really easy. But also it's super pluggable. So open source first. You can just write your own DevTool bar plugin and build it. For instance we have the Astro Tailwind Config Viewer which is basically you can see your whole Tailwind configuration inside of your Astro website or inside the DevTool bar. So basically if you do this well or there is a lot of more features you can actually just build everything inside the browser and never leave it except for developing code. Then we have built incremental content caching. A question I got yesterday for instance was what if I want to use Astro with thousands of pages? Where are the paints? And there are some of course like if you want to use SSG and you're constantly pushing new files then your build pipeline will just be very slow because it's always building and it's always building all of those pages. Even though sometimes they never change. If you change one file while building all the others basically that's why what incremental content caching does. It sees one file has changed and will only change that file. That makes sense right? But with doing that just for our own documentation it's still experimental but we tested it of course. We had a performance gain just for our documentation which is like 3,000 pages of 80% gain. That's a lot. The improvement is insanely good. And then we also redesigned our documentation to Starlight. Now it looks like this. I don't know if you've ever seen the previous one. It was also good. It was also like kind of built very hacky. We didn't have internationalization support before and such. We all have that now in Starlight and such in Astro docs. It's really great doc footing for both projects. Then we announced the ecosystem fund. It's a really cool thing that I'm very proud of. Actually what we do is we have dedicated the funding that we get as in GitHub sponsors and such things like that. We dedicated a hundred thousand dollars of that to give to other open source projects that are empowering Astro users. For instance one of those that got the grant was LuciaAuth. You know if you've ever done LuciaAuth it's basically an authentication library. That's also framework agnostic. But also they enable a lot of Astro users to build cool websites with authentication. And for that they deserve an award. Well they deserve at least some money to keep working on it. For instance we also gave 10,000 dollars to a team builder. They create themes for Astro and they output like one team per month or something. But that means that a lot of users get drawn to Astro because there are so many themes. So that's really makes it work. Of course that's not all of it. This was just like basically it was a ramble of features and how it works. There is more and there is more to it. And the question I always get is but what is next? What is the next thing that we are going to ship? Well I don't know. We have an open roadmap. So basically you decide. Our users decide. We have an open GitHub repository which is just a roadmap. And you can just make an issue there. We'll comment on it. We'll discuss about it. And then we'll get into an RFC. It's accepted. And then we'll actually build a feature. And if you can help in that, that's awesome. Cool. If you want to stay updated you can go to Astro.Build which is the website. If you want to join our Discord where we are very active both in development but also in support and questions you have if you can't pose them here today. Go there. There is probably someone super eager to help you out there. And Astro.Build says chat. And we also built a newsletter like actually was launched this week or last week. And that's Astro.Build slash newsletter. Cool. Thank you. Questions or is that another thing? If there are none I did a good job. Did you try creating... Is it hydrate only when it's raining in Brussels? Yeah. Because it's always hydrates. That would just be client side. I didn't. But I should. You should. It would be easy. It's just an equals truth. Big round of applause for Elio. Thank you.
How to Win 1st Place in the Kernel Patch Statistics - Tools and Workflows
First talk is by Uwe. How to win first place in the kernel patch statistics. Good morning. Soundcheck seems still good. I'll talk to you about how to get many patches in the kernel. The starter for the talk is the LWN patch statistics. That is presented after each kernel release where you get statistics. But actually this shouldn't be your motivation to get patches in the kernel. This is just a nice side effect. But it was a good starter for the talk. First about me and my employer. I'm Uwe Kleinekunig. I work at Pangotronics as a kernel engineer since 2008. I have several jobs in the kernel. I'm a PWM maintainer. But I already contributed patches through all the kernel subsystems. You can reach me via IRC and PGP if you have questions after the talk. If you are interested in the tools I present, I didn't create a repository for that. If you have questions or want to use the tools, just contact me. My email address isn't listed here, but you should be able to Google it. Pangotronics is a company that exists a bit longer than I'm with them. We're doing embedded Linux consulting, mostly for German industrial customers. In the kernel, my colleagues and me, we have several, we're listed several times in the maintainer's file. So we're working with our customers also in the mainlining business. We're selling them that mainlining is a good idea. Yeah. So if you have a good idea of what to change in the kernel, this is the process you have to work with. You put your changes in the end in a mail and send them to subsystem specific mailing list. Then ideally you get prompt review by the maintainers who are responsible for the code. Then the patches are picked up and sent to Linux Torward through in the end, creates a release from it. If you have a big series, you have to apply the same things. You have to do for single patches too. This is the usual or a short list of things you have to care for. These are not very hard rules, but this is what I think is the sensible set I use next as a base. Next is the integration tree for the upcoming kernel release. This is a good idea because if you send patches based on what is in Linux Torward's tree, you often get feedback that there is already some development happening and that your patch doesn't apply, so you have to rebase. If you use next, this is minimized. Even if you think you are a good kernel developer and you don't do the beginners mistake, use check patch. This is a small Perl tool that catches the obvious errors you can do with your patches. You forget your sign off or there are spelling mistakes and things. It's much nicer to get these things said to you by check patch then if you send the patches out and people tell it to you. The same applies for build testing. Do build tests ideally on several architectures because even for trivial patches, it's quite easy to break the build. The same reasoning as with check patch. For single patches, it's good to describe the change as well. The idea is that you want the maintainer to understand your motivation and the things you are changing. You want to make it easy for them to apply the patches to understand the benefit. This is still more important if you do massive patch sanding because you are adding much more burden to the maintainers. Also, addressing the right people. You don't want to miss the important people obviously, but you also don't want to annoy the others. I once sent a 600 patch series to the kernel mailing list and several people were annoyed. Don't repeat that. To get a big project, you have to pick something that applies to many drivers. What I did in the past is the remove callback for SPI drivers returned an integer, but that value is ignored by the core, which resulted in many drivers returning an error code in the expectation that there is some error handling in the upper layers, but which is wrong and this resulted in several resource leaks. The same for platform devices. This is my current quest, which is a bit more massive because there are more than 2,000 platform drivers that I have to touch. I am in the middle approximately, so there are still a few more patches to come. I have a few further ideas, but I will come to that when I am gone with this quest because doing more than one such quest at one time is really hard. Usually it is not hard to find something new to patch. If you touched all platform device drivers, you have seen quite some stuff and there is always something you can fix. What is very helpful to generate the patches is the tool Coxinell. It allows to describe a patch in a very high level form where you can... For example, this is a small version of a patch where I first try to identify... ...platform drivers that have a remove function that does not return zero, which is the first step before converting them to return void. Maybe I can... The syntax is just that you say, OK, I have any expression that is not zero and I just want to patch that in all remove functions of a platform driver, changes the return value from that non-zero value to zero. This is just to find the drivers that are affected by the quest. It is very hard to create a Coxinell patch that does the right thing for all drivers already. There is always some handwork that you have to find. For example, for indention, which is usually get wrong by Coxinell. With Coxinell, you don't have a tree where all drivers are adopted in the end. If you have 2,000 affected files, you don't want to commit by hand. You have to apply some shell scripting to make a commit for each file, which I think is the right thing. For some maintainers, they prefer to convert all their drivers in their subsystem in a single patch. But at least for sending it out and for review, it's easier to have one patch per driver. What I then do is I iterate overall changed files and commit it. The challenge here is to pick the right subject prefix. In the first approach, I just put the file name here. Then I go several times over my branch and use Filter Branch to adapt the subject prefix. This is depending on the subsystem, how they want it, if they want a capital or a small letter here, and if the separator is a colon or a hyphen, you have to check all the commits for the subsystem to get this right. I have a script that I keep in a scratch file. You see a short part of it where for some common drivers, I can adapt the subject prefix accordingly. This is much quicker than doing it by hand. Then here comes my usual workflow for formatting the patches in a mail and sending them out. This is the usual format patch call. I always put it in a sub-directory that I always call w. I don't know what it stands for. Then I have a script that I pass all my patches. I'll come to that in a moment. I edit the cover letter, which is a quite important part of patch theories, where you have to describe the overall idea of what you want to do and to show the benefit of the patch theories. This is the, I think, or I hope, the first thing that people will read about my patch theories. This has to be a good description to, again, make it easy for the maintainers to pick it up. Then I edit the list of recipes. I edit the recipes to the individual patches and send it out. Then what is a critical thing for tracking later, if I note for every patch that I send out in the commit, the message ID I used to send the patch out. This is important later. If your patch doesn't get applied, I can quickly find the conversation in my mail client to send a ping or to ask what's up. Then I put the commits I sent out in a dedicated branch near the top to track all the patches I have already sent out. This is the L file. This is generated using the getMaintainer script, which helps you to identify the interested persons for a given patch. This is, well, in the end, it's shell script. Usually you really have to adapt the list of people. For example, if I send out a patch series adapting several SPI drivers, I usually want that the SPI maintainer takes the patch series as a whole. What getMaintainer gives you, however, is that for the, or in this case, for the upmail PWM driver, that the upmail maintainers are listed as contacts. What I do then here, of course, one step back, this address append is another script that takes a list of patches and adds the people listed with minus t to two, to the two header in these patches, and the addresses on persons with a pass to option c to cc. So what I usually do for, well, this is a PWM series now, what I usually do is I replace with editor magic all minus t by minus c to first have them all on the cc list, and then I change individual lines back to minus t to just address the maintainer. And then I have a longer VIM command here to fix the syntax because currently it doesn't work. I start a quote here and I have the descriptions from getMaintainers here in the end, and this command above just throws away the parented expression and adds the closing quote. And then I can execute it and have all the people in the right mails. And what I'm doing here also is for each patch, I add the cover letter here to ensure that each person or each list that gets a patch also receives the cover letter to give the right context. This is also important if you have a patch series where you have dependencies where I introduce a helper in the first patch, and then this is used in the second. It's a good idea to also at least carbon copy the recipe for the second patch with the patch that introduces the helper such that they can easily understand the second patch. Here is a short snippet of my git config which is important for sending out or which I rely on. One is I blind carbon copy me all patches to make sure that I have all patches I send out in my index to be able to reply later to it. This is a good idea if you get sent email. This makes git send email ask before sending out each mail, which is if you have a big folder of patches, you don't want to accidentally send it out. So this gives you a chance to look again over the list of recipes and abort maybe if there is a problem. This setting is important for the nodes I added to the commits I had here. If I rebase them to be included in my tracking branch, the information doesn't get lost, so the nodes are copied on rebase. For sending patch series out and addressing the right people, it's beneficial to add one series per subsystem. That means not less. Don't mix several subsystems in a single series and also don't send several series with a similar or the same topic to the same subsystem. This is maybe a bit subjective. Some people, for example, NetDev is an example, they say don't send big series if you have say 30 patches, better use two or three series. It's a bit of experience to know this, but in general it's a good idea to do one series per subsystem. To save time and communication overhead, it's a good idea to be explicit about the expectations, how your patch series should be merged. For example, you can write into it, I expect this series to be taken by the SPI maintainer as a whole, even if there are maybe one or two patches which doesn't fix this topic. This doesn't have to be fixed and people can disagree, but this is better than if you have, you get no feedback and you don't get your series applied, and then you have to ask who will apply. So state your idea and, yeah, such that people can know what you think should be the best path. Another good idea is a slow start. What I mean by that is if you have a patch quest and you have to address drivers in 50 subsystems, don't send them all out at once. Start with the first one, pick something actively maintained, and then take the feedback to improve what you send to the other subsystems. So first send out one and then you can slightly increase your speed. But the effect is that you get better descriptions, people ask questions what they don't understand, and you can improve on what you write to the next maintainers. Good. I already presented you, I have a branch for all patches in my quest. I base this on the latest RC1 release. This is a bit smoother than basing it on next, where there's much more movement and it's easier to rebase from one RC1 to the next RC1, because this is all linear and you know what patches are really in. Occasionally it happens that you get a patch into next and it is dropped again, and in such cases you would lose patches, because they fall out of your tracking branch as they are included below, and then if you rebase the next time they are just missing. My tracking patch looks as follows, it's somewhere down below, there is the RC1 release, and then I have all the patches I sent out, and the few top commits is a collection of the remaining drivers that I have to adopt. This is one commit for all remaining patches, for all remaining drivers. In this case it's two such commits, because some drivers are a bit more complicated, they are not correctly adapted by Coxinell, so I track them separately to be able to take the necessary care. The top commit is where I rely on all platform drivers being converted, where I change the remove callback to actually return void, which is only possible after all changes are made. So it's the top commit to have the series still bisectable. What is really useful is the cherry command line parameter to lock, which marks all patches with a plus or an equal sign, and the difference is that the patches marked with an equal sign are already included in the left hand side of the expression here. So the mailbox patches were already applied in next, but not in the last RC1 yet, so they are still included in my branch, and the Macintosh patches are not yet included, so they are plus the work in progress patches, obviously I also have a plus bit, but that's less important. There's a similar option cherry pick, which has the effect that it lists only the patches that are marked with a plus in this syntax. This is the one I usually go through if I want to track which patches need some more care, which need a pin to make the maintainer act on them. I have this below each patch, ideally I have this marking I added that I already talked about in an earlier slide, and with not much, which is a full text mail indexer, it's quite easy to open a mailbox that contains the mail with the given message ID, and all the mails in the same thread. So if I open the virtual mailbox, well the thread belongs here actually, this is broken in a strange way, I see the patches I sent out, and in this case I can see, okay there was no reply, maybe it fell through the cracks at the maintainer or I added the wrong person, in this case I see it's also nearly a month old, so maybe it's time to send a ping and ask are there any problems or what's the state of the series. This is very useful to have an easy connection from the git commit to your mail, and not much integration of much really helps here. Occasionally it happens that you get feedback where you have to adapt things where things are not so optimal. In this case B4 is a great tool that I really recommend to use, even if you're not a maintainer, it's quite handy to collect the reviewed by and the act by text. Occasionally it happens that you already did some restructuring of your branch, and then git range div is very useful where you can compare the two different histories, the one that you already adopted, and the one recreated by B4 based on your previous submission, where you can see the difference where are texts, where are no texts, where did you change the code, and this really helps to create a single series that has all the improvements that you created on both sides. This is what I wanted to present to you. If you have questions either here in the forum or later, don't hesitate to ask me, or after FOSDEM contact me by email or ISE, send me your questions. I'm happy to help you for your next quest. Thank you. We have time for questions. I don't know who was first. I was looking at my send emails and have seen a lot of my patches sent to you because of GitMaintenors collecting your address often. Is it a challenge for you now to deal with all these emails because GitMaintenors will do to your many commits everywhere, often collect your email address? This is indeed an effect I wasn't aware of. If you attached all 2,500 platform drivers, you'll get a massive amount of patches in the next few releases. It's not very helpful to... If you submit it, it's not helpful to send patches to the person who just do cleanup on the driver who don't have the real interest in this driver, which also applies to me. I don't have interest in some obscure IDE driver that I just touched because it happens to be a platform driver and I changed the remove callback. On the other hand, it's also really hard to separate the list of people you get from GitMaintenors. Don't hesitate to keep me in the list. I'm very good at ignoring emails. I just archived some and it's quite usual to... If you send patches to, say, 10 people, that you don't get feedback from at least 9 of them. So this is life and I have a very big mailbox, but I usually can handle it. Thank you for your presentation. You have described your send workflow and you're not using B4. Have you talked to Konstantin, who has developed B4, because you have some special needs about CC and tool handling and cover letter and so on? No, I didn't. Mark already knows. I don't use B4 because with B4 you cannot individually change the recipients for the patches in the series. So what I like to do is if I have a series that touches, again, SPI drivers, I don't want to send the patch touching the iMix SPI driver to the Atmel SPI driver maintainer. So the list of persons is really hand-picked, which patch is sent to which parties. And with B4, at least last time I checked, you can only define the recipients globally. So you have to send all patches to the same set of people. No, I didn't talk to Konstantin. I have little motivation to do that because my patchwork works and I think it's a bit special for these big series and I'm not sure that there's a big benefit for extending B4 to that because for most people it's the right thing what B4 does. And the added flexibility for my use case results in a complication for tracking and for usage for all people which is questionable, I think.
Streamlining kernel hacking with mkosi-kernel
I'm very excited about this because this is actually the tool that I'm using to build kernels for a while now and it's made my life a lot easier. So thank you for that. Dan? Thank you. Yeah, so let's talk about kernel hacking. First a little bit about me. I'm Dan. I work at META on the Linux user space team and I'm a system demaintainer and I also maintain the tool that I'll be talking about today which is a METCO sign. So quick motivation for the stock. A little while ago I started looking into running system dejournality which I work on for individual users instead of just on a per system basis. But to make this work I actually needed a BPF feature for Unix tokens that wasn't available yet. So I looked at the kernel source code and I figured this is probably doable to do myself. So I got into kernel hacking. One site figured out the code and were written up my first batch. I of course had to test it but there wasn't really a clear way like this is how you test your Linux kernel batch. So I started looking into what I could do to do this. The first thing of course that I needed to fix is like if you have your batch you can't test your compiled kernel on your host machine of course because if it's broken then you suddenly lose your system. So you need a virtual machine or something to avoid breaking your machine. I also wanted to make sure that this setup is quickly replicable to any different machine because I started on my laptop because that's what I do for system dejournality and it works great. The kernel is quite a bit bigger than system dejournality and it also compiles a lot slower. So I was quickly looking for a bigger machine with a lot more cores so that my kernels could compile quicker. So it would be very nice if that I could replicate the setup very quickly to another machine. And ideally I'm not too reliant on whatever the host distribution of that machine is because well I work at Meta and we can get like very big beefy servers that have a lot of cores that we can work on but they might also be running some old version of CentOS with all the not all the latest tools available. So ideally I still get those but on the big beefy server with the old Linux distribution running. Of course I want it all to be fast so that I have a quick turnaround time so I can fix bugs or notice bugs, fix bugs, recompile everything and boot again without waiting too long. Like everyone knows the XKCD with like compiling and two dudes are fighting. So I wanted to avoid that. And then of course when you hack on the kernel these days it's not just the kernel that you're working on. There's very often some user space projects involved as well. A good example for the file system service XFS tests which is a separate project. I also wanted to be able to compile all those things and get them available in the virtual machine so that I can run them. So of course because I work on system D and we use Mako aside to do all of this for system D because system D suffers from the same problems. You also can't really test system D on your system because if it's broken then you can't use your system anymore. So Mako aside is basically my hammer and like kernel hacking is just another nail that I wanted to slam in. So what is Mako aside specifically? It's a tool that the Linux puttering developed to simplify its hacking on system D. So he had all the same issues so he developed Mako aside to fix it. What Mako aside does is it builds you a Linux image. So it invokes a package manager and then it installs packages. It packages that up as like in one of various formats and then it allows you to boot it either in a virtual machine or in a container and then you can do whatever testing you want and when you're done you just exit the virtual machine and it's like nothing ever happened. So Mako aside supports like it has a general execution flow. So of course like we have CLI options, configuration all that. We install packages for the distribution. So this is invoking DNF apps, what zipper, Pacman for all the distributions that we support. Optionally we set up a boot loader and all that if you're doing it bootable disk image. We run various system D tools that are helpful to configure an image. If needed we build an NNRMFS. This is again when you're doing bootable stuff. We generate what's called a unified kernel image. This is this new system D thing that allows you to combine the kernel command line, kernel image NNRMFS all in a single file and then boot that from that in UEFI. Then we package up the entire thing as a disk image and then optionally of course you can boot it in QM or container with this in the N spawn. So how do you get started with Mako aside? This is not like the kernel hacking specific stuff but this is just like if you want to make a side image you specify which distribution you want, you specify the packages you want in this case is in the NLINX and we're running on ARCH. We have an auto log an option to basically automatically get a root shell in the virtual machine and then you say I want to boot this in QM. That gives you something like this. So we support this for Debian, Santos, OpenSuzi, ARCH, Fedora and Ubuntu. And there's a few other distributions but they're all derivatives of these. So everything can be specified via CLI, the settings as you can see here but of course we also have configuration. So this is like the system, the init, any files and things that we all know and love. So we more or less do the same stuff so you can also specify it all in the configuration file. Using Mako aside for kernel development and development in general. So what I showed previously just in soft packages from the distribution of course that doesn't really help us. We want to build stuff from source either system D or in this case the kernel. So you can specify a build script. The build script is responsible for building your software. Canonically we call this Mako aside.build. So when you define that Mako aside will pick it up and it just contains the instructions to build your software. So either make for the kernel or mess on for system D. You can specify build packages which are just the packages that are needed to run the build script. So compiler, build system and all that. You can specify a build directory so that everything is cached. This is important so that your incremental builds are fast. With the build directory we have to build cached but we don't have the image cached yet so we have the incremental setting for that which will install all the packages once cached result and then reuse that on the next builds so that our image builds are fast as well. And then we have various settings that you can use to configure the image without invalidating the cache. So you can add extra files for testing or to configure your shell in the image or basically anything you might want that configures the environment to your liking. You can do with the extra trees and the post installation script so that the testing environment is the way you want it. Whatever customization you want you can pretty much do it. And then we have the runtime trees to basic which we use Fertile.ufsd then to mount extra directories into the virtual machine so you can make the XFScast source code for example available for running XFS tests or you can make your home directory available in the VM if you want that. Whatever you want with runtime trees. You can modify the kernel command line and whatever way you want. And we want to specify the output format as a directory so that we don't have to build a disk image but we can just boot from the directory itself also using Fertile.ufsd. Why do we want to do that? Because it's faster. It takes time. And we're looking for this quick turnaround time so we try to make everything go as fast as possible. So make OSI kernel is really nothing more than a make OSI configuration in the separate repository that's specific to hacking on the kernel. So we have a build script for the kernel and then we have various other modules that are all just build scripts for user space projects that are related to kernel development. So as of this moment we have of course a module for the kernel and then we have other modules for better FES procs because well I work at Meta and Meta work some better FES. The Linux test project which I added for Christian and then some other testing projects like block tests and BP filter which is a Quentance project for hacking on firewalls. So I added that as well. So you basically specify which modules you want and then all those get included. So getting started with MakeOSI kernel more or less looks like this. You clone the repository. MakeOSI is pretty easy to install. You can also install it from your package manager of course but it's a pretty fast moving project so in this case we install it from source so you just clone the repository. You sim link the script to somewhere in your path and then that's all you need. You can then run it. We download if you want, well by default for the MakeOSI kernel we download all the other tools we need on demand. So the only stuff it needs is Python and bubblewrap and of course the package manager and then that's enough to get started. Then we clone the MakeOSI kernel repository which contains the kernel configuration, a specific configuration and then you can write a local configuration file that basically says what distribution do I want to use to test or to use MakeOSI kernel. So we support Fedora, CentOS and Dabian at this point but it's easy to add more. The only thing that's distribution specific is basically which packages you need to do kernel development. So you just define the list of packages to build a kernel and to boot the system and that's sufficient to add a new distribution. So it would be very easy to add Arch Linux here as well. And then of course finally we specify the modules and we specify where our kernel sources live. So this is what the build source is setting. So your kernel can be checked out anywhere on your system and then you use the build source setting to specify here's the source location and then the target directory where it should be mounted when we run the build script. So this should always be kernel, the target directory. Of course the directory can be anything and it will be mounted in the right place and then we run MakeOSI and it will do its thing. So I hope this works with the internet here but I made a video. This is with everything cached so otherwise it would take a little bit too long for the stock but when we run QM we see the images cached and then we start running make. Kernel build is of course cached as well otherwise it would take forever. So not too much happening but we get a new kernel image packaged up. Then we make OSI does its thing and then we boot and then you're running in a VM that's running the kernel compile from source and you can do whatever testing you want and then we shut down again. So of course to build the kernel we need kernel configuration. We ship a default kernel config in MakeOSI itself. This is just with the minimal amount of stuff enabled to do to test various things and to the necessary drivers to be able to boot in a virtual machine. So we keep the drivers to a minimum and the features to a maximum. Anything like that's related to kernel development can be enabled so that it's available and then you can use it for testing. We also enable a few debugging things so that it's easier to figure out what's going on. For example also the kernel command line we configure it to panic on oops and stuff like that so that when something goes wrong when you're testing you immediately see and you don't have to go to the message to figure out if something went wrong with stuff like that. We also allow configuring to build a self test if you want that and specifically which self test so you can specify targets or you can specify to skip specific targets. For example the BPF self test because those take absolutely forever to build. You can specify your own K config if you want so you don't have to use MakeOSI's default one you can specify your own and the interesting way that we basically use this minimal config file is by using the all death config make command which basically says take the config file that we specify with K config all config use everything from that and set every other option K config option to its default value. So we specify what we want and we give everything else a default value. And then finally while I said that MakeOSI can build an inner ramifest for you building an inner ramifest is again more work which means slower which means slower turnaround time so in this case because we're building our own kernel anyway we simply build the virtual aero fests driver right into the kernel and that removes the entire need for even needing an inner ramifest so we just skip that step completely. As I already mentioned there's a few useful settings to like runtime trees and extra trees to customize the image. Another one that's useful for file system development is the QMU drives in the QMU arcs setting. So to add extra devices block devices to a VM with QMU you need both a drive which is the host facing side of it and then of course a device which is the guest facing side of it so MakeOSI can allocate the drive for you using a file that it creates itself on the file system which then removes when the VM shuts down so that's what you can do with the QMU drives and you can specify the serial or the drive ID you can specify the size and you can specify all the extra QMU options you might want in this case we specify that asynchronous IO should be done using IO U-ring and then of course you need to attach the drive to an actual QMU device so in this case we specify an NVMe device and we give it a better RFS serial and we specify that the drive should be better RFS which is the same as the ID we gave to drive. Like I said we can configure the kernel command line and if you want to do bootloader stuff you might want to hack on the EFI stuff or stuff like that you can also specify that we should boot in UEFI environment so that you can basically hack on the EFI stuff code or anything related to that. Well all this stuff I mentioned works like usually what you do with QMU is you have your dash kernel argument and your dash append and your dash init RD which you use for kernel development but when you start doing UEFI you might not have all of that available anymore. Now what MakeOSI does is it basically sets things up so that even if you're booting in a UEFI environment everything really works the same even though we don't directly might not directly use dash kernel anymore we might be booting from a disk image we can still like append to the kernel command line and all that it's all still all supported. You can get some extra shells in the image as well so of course you get the serial console but you might want extra shells so you can do that with MakeOSI SSH. You have to also enable the SSH option to make sure that the image gets configured for this but we do that by default in MakeOSI kernel. There's a very complicated diagram here that basically shows how we implement this in systemd but the interesting thing about what MakeOSI SSH is that you don't need your VM to have a configured network to be able to do this. So for VMs there's this alternative socket family which is called AFVSoc which allows for inter-VM communication that doesn't rely on the network interface being up and running and configured. So using a bunch of new systemd features what we're able to do is at runtime provision the virtual machine with your SSH public key so we can put it in the authorized keys file for the root user and then if there's a VSOC device attached to the VM in the next systemd release systemd you'll basically be able to automatically detect that a VSOC device is attached and if so it will generate a socket unit that will run SSHD on port 22 of the AFVSoc family and this allows you to connect to the VM over VSOC from the host without needing a network. We can also do install a drop-in file for the SSH on the host configuration which SSH support now as you can do drop-in configuration for SSH and we can use the SSH proxy protocol to basically take possession of the UNIX and the VSOC host name prefixes so that you can use those to connect to VSOC enabled VMs. So with all this setup you can basically do SSH VSOC slash the VSOC connection ID to connect to that specific virtual machine all without going over the network. So we don't use this stuff yet and may go aside we have our own version because the systemd thing is very recent but we'll be moving to this in the future once this is available everywhere. Running tests manually is all good and fine but you want to move from manual testing to automatic testing of course so we also support this when you want to do automatic testing you want to run the test and you want to get an exit status usually this is very simple with a process you just run the process in your shallow whatever and you get the exit status from the kernel when you run the test in a VM this gets a bit harder there's not really an easy way to get the exit status of a process that's run in the VM and transfer it back to the host. If you're running a directory from a directory with FURTAOFS you can just write some files to the directory and retrieve all the information that way if you want but if you're doing testing from a disk image then you have to mount the disk image once the VM shuts down to access the information and of course to mount the disk image on Linux you need root privileges so you have to start entering your password and stuff so it all becomes a bit more complicated so what we added instead is a way again using the VSOC stuff to have the VM when it shuts down and you use these two unit settings success action equals exit and failure action equals exit in the systemd unit when that unit exits the VM will also system they will also shut down so the VM will shut down but it will use the SD notify protocol which is a some systemd thing to send notifications to send the exit status over VSOC from the VM to the host and make a way so I can pick up on this and exit with that exit status so seems pretty trivial to get the exit status but there's a bit of work involved to get it out of the VM and then of course what we also want is the locks so this isn't actually upstream yet but we're looking to have add another forwarding mode to systemd-journally so that again using VSOC it can forward locks over an AFV socket and then we can listen on the host receive those forwarded locks with systemd-journally remote and write all the locks to a local directory and that means we can access the locks on the host without needing root privileges we don't have to mount the image we just have to the locks locally we run internal cuddle on it and we can access the locks see what went wrong with the test and debug further of course I'm not the only project in the space we do have some competition so the latest product in this space is Furtmianji so I thought I'd mention it as well because I don't want to claim everything for myself like there's more tools than just to make a side kernel so definitely take a look at Furtmianji as well Furtmianji is very focused on kernel development so it has a lot more options to for example use the kernel from the host and various other options but it's very specific to kernel development it also has its own in its system that runs in the VM which allows it to boot very fast but you don't get all the you don't get a regular Linux system like you would when you well I mean I don't want to like say that system D is regular but you don't get system D so if you wanted to start doing stuff with devices or something like that you definitely won't be running so it gives you a bit more limited environment so depending on what you're doing one or the other might be more useful yeah that's more or less it on the comparison if you want to know more about this like come talk to me afterwards or something and I can say a bit more about the differences between it too of course I'll end with some reactions from users so of course like Christian already said he was using it so it's very nice as well his reaction to it and then Joseph from Meta the better FS fast system maintainer is also using it so and he's also very happy with it so I hope it can be more useful for more than just them so please give it a try and I'm happy to answer any questions or implement more features if needed thanks for listening hello thanks for the thanks for talk or two quick questions so one what about cross compiling so that works we don't like we don't have like a specific environment rival in the build script yet that allows you to specify cross compile but we can simply add that but I already tried it like just by hacking the build script and saying cross like changing the architecture to compile for arm 64 and that works Christian also or I'm not sure who added it but we also had the support for compiling with LLVM if you want to and the second small question maybe I missed that because I was late for the talk what gets into the unit MFS so what about from the all in the last half so make was I kernel by default doesn't boot with an around fast when we do the third I your FAS stuff if you do a disk image then the inner MFS is built with make or side cell so I actually have another talk about this in the distributions there from but yeah we just install regular RPM packages or whatever into the inner MFS and then by default we just copy all the girl modules and firmware from the host but we have a suite of settings to basically include and exclude whatever you want and we also have like the stuff that in the ramfs generators to to include everything that's loaded on the host if you want that so you can configure a bit which firmware and then drivers you want we'll also make sure that we when if you specify these drivers modules to be included we pick up all the pen the dependencies as well so we make sure that all that is set up correctly and included I'm using the inner drum FAS stuff like I'm building full images and I'm not using the QMU part I'm using a different virtual machine manager for this and it works really nice because that was the biggest the biggest thing for me that it wasn't easy to build an inner drum FS especially if you want to do it destroy independent which was really annoying it's this also useful if you want to run a mainline on a new device where there's only some heavy patched in the window kernel so you want to test if your drivers work and so you need to you want to test it but don't want to touch any non-volatile memory just started somehow without like this fast boot boot or something like this sorry I don't think I completely here to her to question so this was all about mainly so if you want to test if the kernel works on a new device so where there's only vendor kernels are known to boot so you want to you don't want to destroy the user space there but first test it there before you touch your space and you want to boot it only from ram can it also be used in that way so it's very focused on virtual machines at the moment while make or size kernel specifically is but make or size can build your images that you can then deploy on another device so like you can run the stuff that is produced by make or sign on your laptop or you can flash it to your disk and it will it will boot but specifically without destroying the user space we don't have anything specifically to make that work you could take the kernel produced and then keep the user space the same but it's not something I've really I've looked at before so it probably won't work all right I think if there are no more questions and thanks for your talk thanks for the tool you
Converting filesystems to support idmapped mounts
Hello, my name is Alex, I work for Fogunonical. I have a pleasure to work on Lexi project and do a lot of container stuff in the kernel and user space. We have been working on that new stuff about ADMapetMal and support for some file systems together with Stefan and with Christian. So today I'm gonna talk about the problems that we faced when we started to actually look into the network based file systems and how to support ADMapetMal for them because it's kinda hard sometimes. First of all, I'm not sure that everyone knows everything about that stuff, so I want to give some intro about how it works currently. And yeah, if anyone, if anybody there, we were listening to our previous talk about isolated user space stuff, please forget that for the next 30 minutes because that's a new feature. But this stuff is about stable API that we have in the kernel since I guess 5.11 or something. So that's more about supporting more file systems. So we don't do these isolated user space stuff in here. First of all, we need to understand that we have three types of ID mappings in the kernel. First one is the callers ID mapping, which effectively taken from the user namespace and from the current user namespace. User namespace attached, you can get the pointer to user namespace from the struct cred and you can get the pointer to struct cred from the task struct. Right, so if you're calling any kind of syscall in Linux kernel, you get a current task and so you can get a current user namespace. So we have a macro in the kernel to get that. And even if you're not doing any kind of container stuff, even if you're not using user namespaces, you're always invisibly using that because you're using the default mapping, which looks like zero zero in this big number, which is effectively the largest unsigned integer. And what does this means? The first number is the user ID inside user namespace. Second number is the user ID outside of the user namespace and effectively the length of this mapping. So this mapping is the identity mapping, which means that we effectively map zero to zero, one to one and so forth. Next thing that we have when we are working with any kind of EFS stuff is the file systems ID mapping. It's also represented as the user namespace because it's the thing that we are attaching to the super block of the file system. So when you're creating a new mount, let's say for example, for X4 file system, you have a block device, you're creating a new mount and if it is the first mount for this file system, not a bind mount, I mean, then the super block gets allocated and on the super block structure, we have a field called SC user NS and this field gets filled with the current user NS. So when you do a mount, it takes the current user namespace from your current task and puts that into the super block. And that's the file systems ID mapping, which means that if you're, let's say, inside the container with some user namespace and you do a mount, so your super block will get this user namespace effectively from your containers user namespace. And that's a pretty old stuff actually because I believe that it was from the beginning of the when the user namespace is very introduced many years ago. And third thing about we are talking today is the mounts ID mapping. Mounts ID mapping is the concept a little bit more high level because instead of being attached to the super block, we have the mount, we have the ID mappings attached to the mount. So it means that you can, for example, create X for file system on top of some block device, then do a bind mount and you can do this bind mount with some ID mapping attached to it. And once you get any kind of IO through this ID mapit mount, you will get some extra translate, UIDJ ID translation layer inside the VFS, inside the generic VFS code. And then this, all of that goes through the file system in mapping and then all of that gets written to the disk. So that's how it works. So important to mention that all the time when you're interacting with the kernel from the user space and if you use any kind of C-scores like start get UID, get sock opt for instance with the option so peer create which allows you to get the PID and UIDJ ID of the peer socket, you will get these values mapped in accordance with your current user space. So the callers ID mapping always get, always taken into account everywhere in the kernel. And for example, if you, so yeah, that's effectively all the examples and also we have the same in proc PID status file and all that stuff. So let's take a look what happens when you for example, take the get UID C-score which is probably the simplest one. Inside the kernel we have a few helpers to convert between the user space, user ID that we can work with inside the user space and with the internal representation of user ID inside the kernel because inside the kernel we have two types, UID T and K UID T. UID T is the user space one effectively because it's just a 32 bit thing. And K UID T is also 32 bit thing, it's the same in size, usually they contain the same value but K UID T is the value that represents the user ID always in the initial user space. Which means that for example, if you are inside the container with user name space, you have the, let's say user ID inside the container zero and if you have the corresponding user ID on the host, let's say 1000, then K UID will have the value 1000 always. But once you call the get UID C-score from the context of the process of the task that runs inside the container, inside this user name space, this function called from K UID Monct will be called. And the first argument of this function is the current user name space which effectively the time thing that represents the UID mapping. And second argument is the current UID which will be the K UID T value which is equal to 1000. And this function called from K UID Monct will try to effectively remap this host visible value 1000 to the appropriate value inside this specific user name space. It will be zero in our case because as I have explained in this case, let's say we have like mapping of zero inside the container to 1000 on the host. And so you will get the zero finally, yeah? And this function has a pair function called from K UID. And the difference between these two functions is that from K UID is like more like internal one. If we fail to represent the internal K UID in terms of some user name space UID range, the from K UID function returns minus one which means that something terribly wrong. We can't really represent that ID inside this user name space which is possible. For example, if you have the username space with that maps only like 1000 to zero and if you have the user ID let's say 2000 on the host you can't really represent that is any reasonable value inside, right? And if you call the from K UID it will return minus one. But function from K UID month it does the trick. If the from K UID returns minus one, it takes the overflow UID and returns that. That explains why we have these interesting stuff with like if you have the, if you try to access the, for example, the container file system from the host or that has another ID mapping and you will see this strange nobody user. That's because this function is used everywhere because we can't really give the user space with this minus one. We always, the user space always expects us to give the normal user ID, reasonable user ID. And also we have a helper called make K UID which effectively does the opposite thing. It takes the user space UID and creates the internal representation of it for the kernel. The same, we need to give, plug the current user space, current ID mapping to this helper and give the user space value. And that's what happens inside the set UID cisco. If you plug the let's say value zero, let's say one inside the user space to that cisco. Inside the container it will go like make K UID current user space one. It will go to the UID map and it will try to find that okay, this one is for what? And if it fails to do that, then okay, we get gain well. And so the set UID will not allow us to set this UID because it's not mapped. But if you have a mapping like zero, 1002 which means that you have mapped zero and one, then they succeed because the end K UID for that thing will be 1001 on the kernel and everywhere it will be represented like that. But once you, until you do the get UID or something like that. For file systems, what we have for file systems? For file systems we have, it's about super block ID mapping, right? We have two important helpers. One helper effectively takes the I node and tries to get the user space visible UID so the normal UID. This function called I UID read, but in fact it called on the right path. There is no mistake, that's perfectly fine because we are reading the I UID value from the I node. That's why it read because we read this value from I node. But of course it's called on the right path because when the file system driver wants to write the UID on disk or let's say send it over the wire, in that for file systems like this. We need to call this to get properly mapped to remapped user ID that we can then send over the wire, put on the disk and forget. And we have a second helper called the I UID write which does the opposite. It takes the I node, it takes the user space visible, normal classical UID that we supposed to work with and does the same as we have seen in the set UID system. It calls the helper called make I UID, but instead of taking the current username space, it takes the username space from super block. And second argument is the value. So let's say if you create a file on the file system at first from the user ID like one, so you will get that. Like it will take the value one and plug in there and so. This K UID will be written into the I node I UID field. And finally we're getting to the point when we can take a look on the whole picture like how it works together with the amounts ID mapping. Okay, imagine that we have the caller UID 1000. And this caller wants to create the file on the ID mapped mount. And we have these three ID mappings in place. We have the caller's ID mapping which is okay, which is something that we have been discussing right now. We have file system in mapping which is the, in this specific example, which is the identity ID mapping that does the zero maps zero to zero one to one, two to two and so on. And we have a new thing, amounts ID mapping, which maps effectively zero to 10,000 and has the length 10,000. So we have like 10,000 UIDs mapped with this shift. So the second thing is that effectively the shift value. So the zero goes to 10,000, one to go to 10,000, one and so. And what will happen in the kernel in this case once we try to create the file? First of all, we will create the internal representation for the user ID 1000, which will be 11,000, right? Small remark is that effectively in the kernel, to be honest, we all the time work only with this KUID thing. So it means that technically, when you calling the file system, CIS calls like let's say open with OcreateFlock, the first step is not gonna happen because we already have these values on the struct cred, but it's easier to think about it like that just to understand how much different mappings we have in this place, right? And second thing is that we need to, we need to apply this new concept, mount id mapping, right? We need to take the mounted mapping and perform effectively the reverse operation. We call the front KUID, we take the value that we've got from the collars ID mapping, and then we do this mapping in accordance with this this definition that we have. In this case, we are mapping the KUID 11,000, remap it, and what we get, we get 1000, right? Which is obvious. And then once we want to create the file on the disk, we need to get the IUIDT back, right? So we need to go through the file systems ID mapping which is attached to the super block to get the IUID that will be written on the disk. And so in our case, fortunately, we have the identity file system ID mapping which means that okay, we have user ID 1000, it goes to 1000, that's all. But let's think about another example, if we have the, for example, mapping like U0 K1000, in this case, we can remap that value, right? Because if it goes like U0 K1000, we fail because this U1000 is not in the range of this mapping, but for the second one, U1000 K0, we can remap because the corresponding user ID will be zero, but in first place, we can't. And what happens if the VFS generic code realizes that it cannot remap the value? It will give you the E overflow error. So that's the reason why you can get E overflow error when you're working with ID mapping, not only. Even if you're not using ID mapping, if you're using just normal mouse, what you're trying to, for example, to write to this mount from the another user space with another color ID mapping which is incompatible in terms of ranges of user IDs with this mount file systems in mapping, you can get this E overflow error. So that's the really complicated behavior, but that's how it works. We have no alternatives, actually, right? So you can create ID mapping mounts using these effectively two options. We already have the new feature that allows you to use the classical util Linux mount utility to create ID mapping mount, but in most distros, I don't think that it actually works right now because it's too recent, it's like one year or something like that. So I'm always using the Christian utility for to create ID mapping mounts. And internally, it just uses the syscall called mount setutter to set the ID mapping on the mount. And so you can, you always need to specify this attribute with the username space file descriptor. So we're always getting the, at least these days, we're always getting the IUID mappings and GUID mappings from the username space because username space, we have the way to actually set user ID mappings and JD mappings to the user space, from user space using the proc files, right, that's the reason. So currently we have support for all of these file systems, but if you take a look on the list closer, you will notice that most of them are local ones, so it's like the X4, better FS, XFS and so on. And recently we have been working with Christian and Stefan on the CIF support. Christian did the major work a few years ago, created the first implementation of that, but unfortunately it get lost in discussions and it wasn't merged, so I asked the permission to continue work on that because it was kind of important for our containers applications. And I get some rebate stuff and also we decided to use a little bit another approach to make it work. I will explain that a little bit later. So starting from 6.7 you can use the ID map mounts which is CIFFS, and yeah, CIFFS is the only network file system in this list, so. How to port the file system? The very naive way to do that is to just go through the file systems code, find all the places where we have like no M&T map, which means that this file system id mapping is not defined, so there is no id mapping. Replace it with the id map identifier, which is passed almost to all the VFS API functions from the generic VFS code. And then also replace the current FSUID, which gives you the KUID from the current user. And with the mapped FSUID, which does the same, but takes into account the id mapping. And also raise the FSUID map flag on the file systems definition. But no, that's not that simple because you need to be really, really careful with that stuff, otherwise you can really break things and or even open to some vulnerabilities or something like that. So the reason for that is that, okay, I would suggest that if you want to try to try and porting some file system to support id mapping, especially the network one, you need to go through the code of X4 as a really, really good example because X4 file system is like very complex one. It has many features. For example, you can do the overlay FFS on top of, and use the X4 as a one of the layers for overlay FFS. And for example, the rename callback on the X4 supports really interesting rename mode called rename whiteout, which effectively when you rename the file, usually it disappears on the previous place and appears on the new place, right? But in this case, on the old place where a file supposed to disappear, it creates the so-called whiteout thing. So this is effectively the share character device with the major and minor numbers zero. And that mode is enabled only when you call the rename from the overlay FFS. And I guess that only for that reason, this rename callback and VFS takes the id mapping as an argument because in all the other file systems where we have no support for that, we can't really use this id mapping in any case because we don't need one. Yeah, also you need to pay attention in the getutter because getutter what it does, it's effectively what is getting called in the file system driver when you call the statistical because getutter reads the attributes, fills the case, utter structure in the kernel with all the data like size, like user ID, JD stuff and all that. And you will definitely need to take id mapping into account in this place to get proper user IDs and JDs reported to the user space, right? Also there is a permission callback which effectively does all the permission checking unique spike in the kernel. So you need to also properly pass the id mapping in there. If you use, if the file system that you want to convert uses the generic permission helper, then you just need to pass the id mapping, check that everything really works and that's pretty much all. But sometimes it's not the case because some file systems will see that later, use really, really weird machinery to check the permissions. And also get ACL stuff and that's pretty much all for read code pass, but for write pass, the most important pieces is the, obviously the places where are we creating the new inodes, right? So that's the MK node, sim link, MK dear, atomic open and create. So we need to take into ID mapping into account in all of these places because we actually write the UIDs and JIDs. That's it. And set other which is getting code from, for example, challenges call, right? So you need to, as the challenges call takes the user IDs and JIDs from the user space, you need to properly remap them and write to the attributes. So that's, so for local file systems, as I said, you really need to take the X4 or better first or something, just carefully read the code. Be absolutely sure that you understand how it works and then go for the other philosophy that you want to support. Which problems we can have and we really have. First of all, some file systems, especially in Torque Ones, they, obviously in Torque Ones, they do the permission checking on the server side, which is really bad because what we want is to ID map it, map it mounts, is the local feature of Linux kernel. We don't want to tell the file systems remote server to be aware about that we have this crazy, interesting Linux specific stuff because the theoretically user may be from another operating system, right? So if we want to, if the file system does some UID, JID based permission checks on the server side, it means that we need to extend the on wire protocol, pass all of this ID map stuff over the network, write some logic in there, so that's not work usually. Effectively for a few file system, which is not the network one, but it almost the same as network ones, right? Because you have the user space demon, you have the kernel, kernel is effectively the client, and user space demon is effectively the file system, and the client, the kernel just takes the information from Cisco, does something with that information, produces the request, send it over the fuse device, and the user space read that, and so if we want to do all the permission checks on the user space side, and if you want to support ID map it mounts, we need to pass these ID mappings over the, so we need to extend the protocol that we use between the user space and kernel space for fuse, right? Also some file systems, it's also about fuse effectively, some file systems can do, some can allow you to completely disable the standard permission hook permissions, so effectively implemented almost like an empty thing, that just allows everything, and then do all the permission checks on the level of the I know operations, and the problem is that, I can remember that I have seen that in the while I was working on Ceph, is that in Ceph it's possible to set the configuration based on the path to the file, and specify the user IDs and JDs that actually allowed to read the sub directory, it means that you have the combination of permissions checking on the Linux kernel side, then you have some permission checking on the kernel side, on the server side, I'm sorry, which is the remote server with another kernel, which does not know anything about this stuff, right? And they do checks almost everywhere, even for lookup, and why it's bad for lookup? First of all, because lookup, I know the operation does not have ID mapp argument, and it's not obvious why it doesn't have, but the reason for that is that the usually lookup operation is getting called from the slow lookup pass in the kernel, right? If you have the pre-cached to dentaries for some pass, then we won't go to this lookup callback, instead we will just take the dentary, and it means that if you have the permission checks inside the lookup, then everything will depend on that, if you have this dentary already or not. So if you have not, then you go to the lookup, then you do the permission checks. If you have this dentary cached already for some reason, for example, if this dentary was accessed from another mount with another user, then these permission checks won't happen really, that's bad, right? That's why we want to have all the checks in one place, ideally, for this stuff. And of course, some of you can say that, okay, in this case we can do some permission checks and derevalidate helper, which is always getting called, yeah, to derevalidate, but not, because we don't want to do that, I guess. So, yeah, and also, third case that I've almost forgotten about is that some file systems has the local feature, really, really close ideologically that what we have in Linux, that does some UID-JD mappings on the level of the file system itself. And that's also a problem because I personally don't understand how to combine all of that together to make it work properly. Yeah, in third case, what I have found is that we have the combination, effectively, of the classical permission checks and the server side checks. Speaking honestly, we decided to forget about that because we just decided that if someone uses the IDMAPitMounds, we clearly say that, okay, you don't want to use the server side permission checks in this case, just disable that, just trust the kernel, just trust the client because, Ceph really trusts the client. If you have the key to interact with the MDS server, you can do anything. So there is no real reason to do some additional checks because you can, if you have the user ID checks on the server side and if you have a client, this client can give you any UID, right? So it makes no sense to check that because this information is not like, trustworthy so. So in third case, we have this lookup problem which is okay because it's only actual for this case when you have some additional setup, some additional configuration. And the third one is that for some reason, most, I guess historically, is that Ceph uses current FSUID everywhere to get the current user ID. Yeah, thanks. To get the current user ID, but what we want usually, we want usually to take the credential structure from the file because when you open, when you are opening the file descriptor, the credential structure from your current task gets stashed to the struct file structure. And then we expect that if you do, for example, the right syscall or itsyscall on this file descriptor, then everything, all the permission checks will be done in the relevant to this credential structure that we have on the file. And you may ask me why it's so important. It's important if you want to pass the file descriptor over the Unix socket or if you, for example, opening the file descriptor while you are privileged, but then you do some capabilities, drop things, or set your idea or something, and you lose your privilege effectively, privileges effectively, and so that can be a problem. But I was, to be honest, I decided not to send fixes for that because I don't want to break any real user space application. I don't know, maybe someone relies on that. So that's technically not ideally correct, but you will see. So, yeah, I effectively covered that. Yeah, what we decided to do, we just ignored these problems with the server side permission checks because we can't really do anything with that. And we were asked by the CFFS folks, CFFS maintainers, thanks, by the way, thanks to them for help, for reviews, to Viennkischenkartus, Huboli for helping with that because they were reviewing that stuff, especially the user space one, because I was forced to extend the on-wire CFFS protocol and add some extra UID and JID fields for the Inode creation operations. And of course, all of that was done in the backward, forward, anyhow compatible way, not to break anything. Yep, and what we are doing right now, we're currently working on Fuse, I have already sent a series of patches that enables support for Fuse. Unfortunately, only for the mode when we have the default permission set, because as I said, if you have the Fuse mount without this flag called default permissions, then effectively the permission callback is almost empty, it just allows everything. And in this case, Fuse file system expects that the user space will do all the permission checks in the user space, which is a problem because we can't handle that properly. And also, obviously Fuse protocol that between the user space and kernel play was extended to send these UIDs and JIDs over the wire, let's say. Yep, also in addition to this series, I wanted to be absolutely sure that this really works properly, so I have taken the three not random, really not random file systems. Overlay Fuse Fuse just as a good and relatively simple example for this specific case, it's not simple at all. Overlay Fuse Fuse, SEPA Fuse Fuse because I was already familiar with Fuse a little bit while I was working with the, so and GlusterFS, which is the new one. For GlusterFS, it's not an ideal implementation because I found, I unexpectedly found that GlusterFS also likes to do all the permission checks by default in the user space. And so that, a bit painful, but I found some special configuration option that allows to disable that and enable the default permission thing for that file system and it allows us to make it work. So to do, in our plan to go further with the Fuse series to make it fully like tested covered to be absolutely sure that everything is fine, then we want to convert the nine PFS and virtualFS, which can be useful if you do some nesting stuff like virtual machine with some shared director from the host and then the container inside, for example, which is not a rare case. And yeah, that's all. Questions? Thank you. Hello, thank you for your talk. Is there any caveats with ID mappings and interaction with Alasams? So like if you're doing some checks in Alasams, like what kind of UI did we get there? Because I was confused. That's a good question to be honest, because all of these ID mappings works is done by Christian, thanks to him, because he did all of these great API in the kernel, all of these preparation stuff. I mean that our isolated user space work and how we managed to make it work with the file systems is all, it became so small in terms of lines of code that were modified just because Christian did all of these crazy complex hard stuff in the kernel a few years ago, because he effectively provided us with the two functions in the kernel that we can patch easily, relatively easily. And so we get the ID mappings supported for some like new crazy case, right? And to be honest, I don't know much about Alasams, so I guess that it should be integrated. So when I did the original work, I went through all of the Alasams. And so for example, Alasams like SA Linux don't fuck with UIDs and GIDs, don't care about this at all. So most of these Alasam functions don't get past the path or UID and GID value at all. The only hooks are relevant, like security file open and so on. And then it's mostly Tomoyo and possibly some app armor stuff and they are all patched to take the ID mapping into account, although one caveat is I once tried to do some additional fixes inside of Tomoyo itself because it kind of does weird stuff, but the maintainer said, no, we don't care. I mostly care about like BPF Alasam because the hook doesn't get the UID, but like you can extract it from something. Oh yeah, they are aware of that. I talked to them. So yeah, well, for example, if you do a BPF Alasam and in hooks like security file open, you get the relevant ID mapping provided. And in other hooks where you only have the inode, yeah, then you don't have access, but that's also for example, not feasible. Like no, there is no security hook in lookup, but there is certainly locations where we have security hooks where you, for example, in the dentry cache, where you don't have any of that information available and it's impossible to make that work. Like you mentioned the lookup stuff, the lookup stuff itself, like it was two reasons why we didn't do it this way. First of all, because in lookup you initialize an inode and that always needs to be take the global UID and GID into account, the one that you see everywhere. Otherwise you end up with inode aliases in a way because if you can't cache an inode per mount, that's the one thing. And the other thing that lookup is called deep from within the dentry cache, which would have meant then suddenly you would have like, have to pass mount information more or less because it's mount information through the dentry cache. It doesn't make any sense. Also L would have killed me. But I mean, that's another thing why in these locations we don't want to have this. But for example, BPF Alasams, if they need that sort of information in specific hooks and is doable, then we can easily extend the hooks. Like I don't have a problem with this, like sort of more of a LSM question if they're ready to do this. It's, I think for most LASMs hooks, it simply hasn't been done because the LASMs that didn't implement that this specific hook didn't want this information. So it didn't make sense to provide it. If you have an LASM that wants this information, it's easy to extend it. Well, I think the other point is the LASMs should use the code behind the way because it seems this one is not the LASM. I think it's a little faster. And for always tricky when you provide a policy from users based on the current idea, you don't need to translate it to the LASM. Question? Yeah, you mentioned an FS real quick. How does it work with an FS? If I remember correctly, there's an upcall through the Linux Curing, right? So you get the translated.
What is Linux kernel keystore and why you should use it in your next application
All right, so the next talk is going to be about the Linux kernel key store and why you should be using it in your next application. Thank you. Hello, my name is Ignat. I work for Cloud for and today we're going to talk about Linux key store. By the way, how many people here know that Linux has a key store? Cool, many hands. Because like James earlier showed us that it has a key store but probably not everyone knows that Linux actually has a key store. So, yeah, a little bit about myself. I do Linux at Cloud for. I'm passionate about system security and performance. I'm like Lolo programming, Linux, but loaders, drivers and other stuff written in scary and safe languages. And I'm a hard Linux fan. That's why I'm presenting from a Mac. And probably like most of you here, I'm a fugitive programmer because NSA banned writing C and C++ languages and enterprises. And why is that? And there are many reasons but one of them is regarding application keys and memory. And by the way, here is the brand that NSA recommends that organization use memory safe languages whenever possible. So what is the problem with application key? Regarding keys, we're like talking about cryptographic keys, right? So to dig into that, let's review the Linux address namespace, isolation concept. So yeah, you have these many processes running on your systems because Linux is a multi-threaded, multi-process system. But what these processes have inside, right? So usually it's kind of like your code, like compiled code, your business logic. Some libraries, shared libraries, if your application uses shared libraries, some data, like global data stack. And yeah, I have the stack box separately. So it's like data heap and global variable with mStacks, right? And then you have the kernel, right? Everything runs in the kernel. In the kernel also you have the core code. You have static and dynamic data. You have the drivers which you load modules. And also you have stack or stacks if you have different threads, right? And the idea regarding the address spaces is within the process, each process, and even within the kernel, everything can access everything, right? So it's like one global space, whereas you can't access the memory of another process from one process and you also can't access the memory of the kernel. Like it's separated. This is Linux address space isolation. If we zoom in into the main process, into one of the processes, right? Like let's actually review what can be here and what can be in your data. And it can be like some internal state. So you have global variables, like applications can keep some internal state in the data. Yeah, your process can have user or customer data if it processes some external inputs and does stuff. Right? And the most important thing is cryptographic keys. If your application does some sort of level of encryption, it probably has some keys in the process address space. And what if like suddenly your application becomes compromised, so either through your main application logic or through a library, well, it means because it's all in the same address space, it means all your data section is compromised, right? But not all data is created equal. So, well, yeah. So yeah, well, like if your application internal state is compromised, well, it can be good or bad, right? It depends. Like depends on your logic. Of course, it can be bad if the attacker has control of some kind of data which can, for example, change the control flow of your application. If you're verifying a password, you can flip back like true or false or you can put some authenticated flag on and yeah, this can be bad, but sometimes it's not as bad depends on if your application is simple, but it can lead to further compromise. Well, if your user customer data is compromised, then like it's much, much more now. And yesterday also mentioned Equifox, my favorite company. Yeah, if you're a user customer data leak, it's a big problem because kind of it creates a lot of pressure on the company and you have to pay a lot of fines, but it's very, very bad but still more or less recoverably. Equifox is still in business to this day, unfortunately. But what about cryptographic key compromise? And this is like a total game over, right? So like if your identity key is leaked, that's what anyone can be as you. If you're like the main data encryption key is leaked, everyone knows your data. So it's a data integrity compromise, full security compromise and total identity take over. So what are the, well, 1000 feet view level of methods you can leak your application keys, right? Well, first of all, untrusted inputs and out of bound memory access. So imagine you have stuff in your memory written somewhere, right? And it may be that like near that stuff, you can have like a cryptographic key also in the same memory. And the normal application logic should allow you only to read stuff. But like what happened, for example, in hard bleed, if you can make the application read past the buffer boundary, you can also read the cryptographic key, right? And this is what happened to hard bleed. Everyone remembers hard bleed. Well, if your application have arbitrary remote code execution, like what else to discuss there is game over, right? So like attacker can control the execution of your binary and they can read, and due to say everything being in the same process space, so they can read everything and as to write everything. Not much to discuss there, but in the example was recent one, lock for shell. Everyone remembers lock for shell. Who patched lock for shell? Should have asked yesterday here, Java, right? Well, buffer use can be a sort of problem for leaking a key. So for example, this is a very, of course, this is a simplified program, but specifically tailored to leak the key, but like it illustrates the example. So for example, it has to function and crypt and log, right? And oh no, we forgot to initialize the logging message in the log function. And if you actually execute it, you will see that it kind of actually leaks the cryptographic key. So what happens is you have the process as thread stack, you have your main logic. For example, you call the decrypt or encrypt data function, which will get the key from somewhere and may put it on the stack depending on the implementation. But if you then the function exits, but if it doesn't clean it up the stack with the key, the next function can take it over and actually has an example, sorry, has an access to that cryptographic key, right? This is why all the compliance and security folks will tell you you always need to zero memory after key use. Like you have to clean up. Which is hard to do in many high level programming languages, especially if in garbage collected languages, right? Finally, you have the debugging tools. If you have a logging can accidentally leak your keys like core dumps, like GDB, Ptrace, everything that can access the memory of the application can leak a secret. Yeah, well let's make our applications don't crash and fix all the problems, right? We obviously can't fix all the bugs, so we have to do something about it. And probably we can't do a completely secure application, but what can we do specifically for cryptographic keys? Because they are the highest, most valuable data in our process address space. What some applications do, well, they try to leverage the operating system address space isolation, so they basically create another process, right? It will have a different data section and you can just move the cryptographic keys over to a different process and you write some very basic, very simple, which is unlikely to have bugs, a cryptographic logic to handle these keys on behalf of the main process. And then you create some kind of well-defined, tightened user interface between two processes, right? So we call it the key agent model. So you have two processes, one, the main process and the helper agent. The main process does not have the cryptographic material in the address space and the main communicates with the agent through a well-defined interface to perform cryptographic operation on its behalf. And agent is usually doesn't process untrusted input, like it's not connected to the network and is usually, and more scrutiny goes into that review. And some of the example of these we all use every day. So who here uses SSH? Who here doesn't use SSH agent? You don't? Yeah. Yeah, so SSH agent, GP agent, stuff like that. But there are drawbacks to this approach, right? So we need to develop and maintain two programs. We need to design this well-defined interface. We need to add communication. Like we need to think about how these processes communicate. Should we use Unix, talk, shared memory, something else, HTTP. And probably it's a good to somehow enforce and authenticate the main process from the agent. And not if the agent is kind of like this thing that performs cryptographic operations, we don't want anything in our system talking to it and being able to do signatures with our keys. This is where we go to Linux kernel key store. And the official name is Linux kernel key retention service. I call it the key store. Some people say it's a key ring, but actually, like key store has many key rings. So I think the key store is kind of the most applicable technology. And what it does is basically it takes this agent model and instead of process two, it replaces it with a kernel, right? And the well-defined interface is just system calls. Easy. So in a nutshell, Linux kernel key retention service stores cryptographic keys as kernel object. And this gives us some flexibility. So it was initially actually designed to share keys with the kernel services itself. So like for disk encryption, for example, you pass a key to the kernel and the kernel uses it. But eventually it was extended to user space. And the advantages that keys are now stored outside of the process address space, you have already have a well-defined system call interface to access and use the keys. And keys are becoming kernel objects so you can have associated access control lists, permission checks. Like you have on files or some other kernel objects itself. And the nice thing about it is like the key life cycle can be implicitly bound to the code life cycle. For example, security deleting a key even if the process terminates abruptly. And for a kernel feature, it surprisingly has a quite good documentation. So what does the key store look like? So it's a collection of key rings and keys. So a key ring can have links to other key rings and keys can contain other key rings or contain keys. So you can get this like a tree like structure. So keys are just objects that contain actual cryptographic material or a pointer treat. They can be read and written to and used to perform cryptographic operations. There are several key types which I go on later. You have user, logon, asymmetric encrypted and trusted keys. And they're kind of similar to a file system but unlike the file which can be on the in one directory, like if you don't take into account the weird bind mounts or some kind of hard links, keys can be part of many key rings at once. And key rings, they, it's a collection of links to the keys. And basically they enforce the life cycle of a key. If a particular key is not linked to a key ring, like it gets automatically destructed. And they can be explicitly created key rings or implicit special, a thread process, user and session. And they do enforce the key lifetime and they are kind of similar to a directory in the file system. So let's see an example. And by the way, all the examples I'm showing, I copied it from a real terminal. So it's a demo which doesn't fail. So in this example, here I'm creating a new key ring and linking it to my implicit user key ring. And each key or key ring is designated by a serial number which you can see. So it's kind of a unique number of the object inside the kernel. And once I created the key ring, I can add a key there with some secret contents Hunter 2 to my key ring. Basically I can then show, kind of, KCTL show shows my key ring and key tree. So we have the session ring, the user ring, my ring and my key there. Yeah. And basically you can see that the serial numbers match so what we just created. And also like because I just created the key, I have access to it so I can read the cryptographic material back and get the secret. And I think one of the examples you can use is like secret sharing between two users. So you have Alice and Bob to users on the system and you may notice they don't have anything in common. So they have separate groups, separate IDs, everything is separate. No common groups or permissions. For example, and Alice can create a secret with Hunter 2 and put it in their user key ring. What Bob can do, for example, it can create a new key ring called from others, a recipient key ring. And Bob can actually set permissions on that key ring so it allows everyone to write there. Write means putting links to other keys. So then if Bob communicates the serial number to Alice, Alice can just move that key to the Bob's key ring and then we now see that Alice doesn't have the key anymore in their possession and Bob can actually now read the cryptographic material because Bob now possesses that key. Simple. There are special key ring types. And these special key ring times determine the life cycle of a key ring. So there are session key rings which are available to all the current process and all these children. So for example, if you are system D and you put a key in the session key ring, it will be available to every process on the system which is spawned by system D. The process key ring is private to a particular process. So like every process has their own implicit key ring which they can use to store process specific credentials. And there is also a sweat key ring which is specific to a particular thread. Then let's say you write a web server which serves several websites and each website has a different TLS key. And you can, if you serve a website per thread, for example, so you can kind of securely store a TLS key for that thread, for that website without other threads even having access to that key, which is really cool. There are also user key rings which are bound to the life cycle of a user. So it's a key ring which is shared between all the processes with the same user ID and there is a user session key ring which is similar to user but not important in this context. There is also a type called persistent key rings which the name is a little bit confusing because they are not actually persisting the keys on the desk. It has nothing to do with it. It's just the life cycle of these key rings are different. They're not bound to a process or a user. So it's kind of time bound. So if you basically don't access the key ring for a time out, it gets automatically destroyed. It's useful, for example, in Chrome jobs where you can't really bind, for example, a key ring to a user because that user appears and disappears from the system but you can put a time bound and while your Chrome job is running, your key ring will be available. If for some reason your Chrome job stops running, the key will be eventually destroyed. So let's see a session key ring example. So let me add my favorite Hunter 2 secret to my session key. And basically, I imagine I'm on a SSH session to this particular machine. I can see that my key exists, right, and I can see its ID and it's linked to the session key ring. What I can do now is, for example, in another terminal I can put a BPF probe on a user destroy function which is responsible for securely destroying keys from the kernel key store. And if now I just exit my SSH session, I log out, I can see that the probe works and my key was automatically destroyed because my session ended, so my session key ring got destroyed and all the keys are linked to it got automatically destroyed as well. And if I re-log in back, I can see that technically my session key ring changed. It was destroyed and recreated automatically and I don't have the key anymore. So what it helps is, like, if you select the appropriate key ring type, you can ensure that keys will be securely destroyed when not needed. And you don't have to explicitly clear the memory. It will happen if you're out. For example, if you bound to a process key ring, if the process dies, the key will get destroyed. And regardless how the process dies, if it's successful exit, if it crashed, if it cordoned, whatever, like the keys will be gone. Okay, so now let's consider, like, some different key types. So we check the key ring types, the key types, the simplest one is the user key, which we just saw. So you have the cryptographic material, you put it inside the kernel, and then eventually either this process or the other process, which has relevant permissions, can read that secret back. There is also, like, a special type called logon key, which you can put inside the kernel, but you can never read back. And this is where this type is primarily used to share secrets with the kernel for disk encryption or eCryptFS. So if you're in a relatively recent Linux distribution, if you dump your dmCrypt setup, you will see that some of your keys are actually coming from the kernel key ring instead of, like, you will see the bytes directly. There is also an asymmetric key type, which only supports RSA currently. So you put an RSA key inside the kernel, and technically you don't read it back, but you can perform some operations with this key, like you can instruct the kernel to sign data or decrypt something with the key. So for example, this is a simple example, it was open SSL, so we can generate an RSA private key. Kernel understands only pkcs8 format for unencrypted pkcs8 private keys, so we have to convert it to pkcs8 format, and then we can actually add it to the kernel, and then we can ask the kernel to sign something, and basically we can then verify that the signature is valid with OpenSSL. Which is very useful, so all the things I'm describing today, and more is describing Cloud for a blog post, and there we have an example where we completely replace SSL, it's like a proof-of-concept patch, but we patched OpenSSH and replaced the SSH agent with the kernel key store, so instead of SSH add, you do SSH add our bash script, which puts your private SSH key into the kernel key store, and if you run the patched SSH client, it will actually work the same as it would communicate with an agent, but you don't need any agents running on the assist. Cool, this is all well and good, this is how you can use it, but surprisingly key store can be very useful as a big corporate key management building model, but the question here remains, in all the previous examples you just saw, that we still need to put the keys into the kernel, so we don't want the secrets to be in the application address space, but we still need the application to put it inside the kernel, so even though if the application cleans up after itself, there is a small window of opportunity where application has the plain text secret in its address space, so how can we provision application keys without cryptographic material ever being exposed to the user space at all? So for this we have two other interesting key types, one is called encrypted key, and in this case the process has not the plain text key material, but encrypted key material with some other key, and the kernel has a wrapping key, so when the process inserts that key inside the kernel, the kernel automatically unwraps the key, and if we try to read it back, it gets automatically wrapped by the kernel again. But here we have the chicken and egg problem like how do you then provision the wrap key, right? So, still things, so what James showed earlier today in his demo is you can technically replace this with a TPM, and then you have a thing called a trusted key, so again you have the wrap key, but wrap to a particular TPM, you can insert in the kernel and TPM will automatically unwrap it, and again if you read it back, it gets wrapped. But this schema is not really great because as James mentioned TPMs are slow and there is as much as you can do with these operations, so like if you have thousands of keys you don't want to continuously poke the TPM to unwrap them, so you can do some kind of a combined approach where basically you have some kind of provision, right? So, and you have some kind of HSM in the cloud or on-prem, whatever which does your cryptographic keys, and then you provision a root key first, so you basically wrap the root key to a particular machine to its TPM, and then you insert it and the TPM unwraps it, but all the other thousand keys are encrypted with this root key, so the process received the wrap key and then it puts inside the kernel and then you don't go to TPM, you already have the root key which is a software implementation, can easily unwrap all the other thousand keys. But there are still problems with this approach, even though the application never sees the cryptographic material in this process address phase, but applications are still responsible for receiving this wrapped cryptographical material from this centralized KMS HSM service to wrap their keys, and so basically each application needs, who here uses Vault? Yeah, some people, right? So like it's, you kind of like know what, need to know what your Vault address endpoint is, right? You need to speak the Vault protocol or AWS KMS protocol, you need to basically integrate all this crap in your code, and there is little administrative control if like you're managing fleet of machines of the created kernel key object, so applications when inserting the key can set invalid permissions, so like anyone can, for example, if you set improper permissions on your RSA private key, any application, even malicious on your system, can use it to encrypt or sign data, right? And ideally like you also want authentication here, so KMS or HSM, that remote service, needs to somehow authenticate each requesting application if it can provide the wrapped cryptographic material. So how the kernel tries to solve that problem, it has two set of system calls. So far we've been using the at key system call with a key CTL utility, so it adds the key to the specified, key ring with the specified payload. So basically the application is responsible for the payload itself, so it's either plain text or in case of trusted or encrypted key, the encrypted payload, it gets it from somewhere and it sorts it into the kernel. And the payload is interpreted according to the key type, it's like no interpretation happens for user logon keys, because those are mostly symmetric keys which are random strings, it's a private public key for asymmetric cryptos or wrapped for encrypted and trusted. But there is another interesting API in the kernel called request key, so instead of applications inserting the payload directly what applications can do, they can ask the kernel, just give me my key, give me my key and give it an arbitrary string as an identifier. And it's on the kernel to actually satisfy that request, and obviously the kernel has no idea of everyone set up, like where should it take the key from, so it's one of the examples where the kernel can then make a user space callback and with a special helper program which you can then configure to actually deliver your keys, right? But it's a more centralized and transparent API to the kernel system, so how it works, so you have the process instead of adding key, so the process requests the key from the kernel and provides the identifier, so like give me my cloud app key one, so the kernel creates a placeholder, then it creates a special process, a callout process, helper process in user space called request key, and this one you can configure and you can specify different routes for different key types, for example if I requested the cloud app key one, it will go to the cloud sub-module and you can write these sub-modules in any programming language by the way, it doesn't have to be C, so you can write them in Go, it can be just simple batch scripts as well, which are basically responsible for if the path is cloud, it can contact your cloud HSM, get the wrapped cryptographic material, put it back inside the kernel, the kernel will then instantiate the keys and then the application will get its key back. So with request key advantages, you have a single centralized operating system API to request key from the application, so there are no KMS or HSM connection strings, you arise in your configuration form, just a freeform ID string, and it kind of fully decouples, your application is fully decoupled from key storage backend, so it doesn't care where the keys are stored and how they are distributed, and it's a more secure way to instantiate the keys in the kernel, so this special call-out process which is created by the kernel is very special in the sense that it has a special credential enforced by the kernel, so even if you launch the same helper process yourself as root, it will not be able to instantiate the requested key because it doesn't have a specific token from the kernel to do it. And this also call-out process is very useful, in fact it can be trustworthy, so you can perform additional security checks, you can implement arbitrary policies there, so you can check the requestor, user ID, group ID, executable pass, package name, whatever you suppose, is this application even allowed to request the key in the first place, and you can immediately deny that request. And you can support multiple key storage backends, you have local storage, you have a TPM backend, cloud HSM backend, whatever, and you can even swap these backends transparently, like if you, for example, migrated from on-prem HSM to a cloud HSM, all you have to do is just modify this helper process config file and applications will not notice. And then you have the nice thing that you need to only authenticate this single helper process on your backend. And yeah, as I mentioned, the backend connectors can be written in any language, so very easy to extend. But the nice thing about that with request key, the key management and distribution becomes a core service operating of the operating system itself as it should be, versus like every application has to deal with it on its own. That's basically it for today. Here are some links to some kernel documentation, to some key ring man pages, as well as the last link. Again, everything I told you today and even more is described in the cloud for our blog post, which is linked at the end. Thank you and I'm happy to talk to you. Thank you for the great talk. So I recall there was an API in the producer space to protect memory from kernel space. So the, like a given page was unmapped from the kernel. So if you had an out of bounds in the kernel, you couldn't access the memory, but of course the kernel could remap the page back again. My question is, are the keys protected in such a way in the kernel? And do you think it would make sense to do it? I mean, it would potentially minimize the exposure in theory at least. The default, I don't, I'm not sure about the implement, but I would say no. I think the keys are not like more protected. So the guy who wrote it is right there. And what was the question? If you put a key of the user space process into these areas, they will be more protected than otherwise. It still doesn't guarantee like 100% My point is the kernel could also do it so that it would protect those keys from itself as well. And it would only remap the page back again when it actually, when you do the request key for it. But what's the point then? If kernel needs the keys, it has to have access anyway and remapping and mapping is costly. The other thing is the key store API internally is also extendable. You can write other modules and this is what I asked for. James earlier, that you can technically write an asymmetric key implementation backed by the TPM. So the keys will not be even inside the kernel. It will be in the TPM, but then each operation will have to touch TPM in the first place. Or if you like design some kind of crypto chip or you can like design like an arm like a truss zone back here. So like whatever you want. There was some effort. I don't remember exactly which areas it touched to do this sort of separation between subsystems. But I only learned about it once. I don't know what they say this. No, no. Well in kernel it's still like the old, you mean in the kernel subsystems? I don't like it's still like a flat address space at this point. I don't, unless you're again using like arm trance zone or enclaves or whatever. My question is, so you mentioned that we can do RSA operations. Not everybody is using RSA. Are there any efforts to introduce other kinds of asymmetric keys? In particular, I'd like to see an explicated stuff. So, yes. So the kernel currently also supports ECDSA, but only for signature verification. It was added for kernel modules. I send like patches to actually support signatures through for the Q-Stone API twice. I didn't get any traction on them. I'll send it one more time maybe. Because I also know that the kernel has its own internal crypto API and has support for all of these operations. They're just not exposed through the key store. Well, specifically for RSA, for ECDSA, no. The kernel crypto API doesn't have crypto for ECDSA signatures for generating the signature. So my patch set included both the crypto subsystem and the key store subsystem. The kernel can do ECDSA signatures, but also this code is reachable through the key store API. Okay, thank you. Very interesting talk. Thank you. I have basically the same question. But also, wouldn't there be an urgency to get some PQ crypto in there? Maybe, but we have to fix ECDSA first before we have to learn to walk before the run, right? James, can you pass it to the next? So if we now add the trouson to the picture, does the kernel have any kind of API to interact? I mean, the key store itself, would it interact with the trouson to get the key or we need to still go to the user space to the helper and then the helper will just go through a normal way of communicating with the trouson and secure monitor call and get back the result and then the key back to the kernel. For the trouson, I think there is some code, because I never tested on an ARM system like similar to what we have, the trusted keys for the TPM back trusted keys. There is an implementation for trusted keys for the ARM trouson, the open source one. I saw the code, I never tried it, but it's there. So there is some reference of the application, right? Yes. The GNS and there is internal support for that. Yes. OPTE. Yes, OPTE. Alright, anything else? Oh, yeah. If you shout, I'll just repeat. It's just wondering which version I need to use this. Sorry? Which version is it available from? The kernel key store. I mean, it's quite all, I guess. What we did, I think from 6.1, again, we mentioned the crypto subsystem, the key store subsystem. It was really handy to insert the RSA key into operation with it, but you didn't have any ability to do the same with the symmetric key. So what we extended is like the crypto user space socket API to be initialized from the user or logon key. So now you can do it from 6.1, you can insert a symmetric key and then you can create a crypto socket based on that key to perform like AS encryption with that key without exposing the key to user space. Back to you, does this not want to... So if I recall correctly, you said that the persistent keys can expire after some time of being unused. Does listing the keys also count as using them? That's my first question. My second question is like, what's the time out time for it to expire? I haven't used them like so widely to have those specifics. I think the time out is configurable, definitely, but listing, I don't know if listing the keys actually reset the timer. I just want to answer the question from over here. It looks like the API has been available since 2.6.10, which feels old. Yeah. There is one person over there which... Maybe you shout, I repeat. As a certified micro-configuration enthusiast, is there a reason why this approach is taking rather than planning APIs for the value of duty and so on, that you need the space and have the same benefits? The question was why we didn't do it in user space, but... How do you add extra functionality to the kernel to give you the same benefits? I kind of don't quite understand the question. The whole point is not to expose cryptographic material to user space. You're saying the benefits are, for example, if a process dies, then you can immediately wipe the key from memory and that sort of thing. You could also add functionality to add consistals to the normal database processes that have that sort of benefits. Why didn't you do that rather than sticking extra things into the kernel? Because you can retrace the processing of user space, but you cannot retrace the kernel. Just saying. Anyway, we are out of time. Thank you very much. I'm sure you can get two infocations.
The new Swiss Open Source Law: "Public Money Public Code" by default
Okay. Let's welcome our next speakers on the news with open source law. Yes, good evening everybody. It's a great honor to be here at this conference at the FOSSTEM. It was many years ago when I was here last time, but now I'm glad to be back and I'm very happy to present together with Rika Koch our new law that we achieved basically getting in Switzerland. It has been a long journey and it's great that we can now present this and we are very interested also for your feedback in the end if something similar is existing in other countries or how we also have to continue in this journey. So briefly just our background. We are academics from Berlin, Switzerland, but from my side I'm also an activist since almost 20 years when I wrote my master thesis about open source community building and I'm very glad that we can now present this to you and Rika will start. Good afternoon from my side as well. My name is Rika Koch. As Matias mentioned, I'm a law professor at the Benefach Hochschule in Bern and I want to speak to you today about the regulation, the legal side of open source software in Switzerland for the public sector. Here it is again. So in the beginning when we talk about regulation of open source in the public sector there was literally nothing. And I wrote here dark past but we're not talking about the so far past. We're talking about 10, 20 or even two years back. Although there was a strategy of the Swiss federal government that said well basically open source software would be nice because it's economically efficient and it produces good quality but there was nothing in the law. We had a strategy. So when there is nothing regulated you don't really know whether you can do it, whether it is allowed for the public sector to develop their software open source, licensing it open source or not, which is called in legal terms there was a lot of legal uncertainty. So developers or the private sector who offered their software to the public sector didn't really know whether it was possible or not. So the crucial question here is can the public sector develop and also distribute open source software or do they have to do it closed source? Well if you ask the IT experts probably all of them tell you yes of course they can but in comes the spoilers, the legal persons, the lawyers and they say wait a second it's not that easy. So the Swiss government was in this situation where the IT experts please do open source software and the lawyers say no no it's complicated please don't and what do they do? They ask they pay a lot of money for of course the legal experts for a legal opinion. Oh sorry I hand over to Matias to explain first why the pressure arose. Thank you so basically from a historical point of view this is interesting because it was being done open source by several state agencies for many years so from the IT side obviously there were lots of open source activities on GitHub from different agencies however as Rika said it was not legally allowed and so when actually we started from our group of parliamentarians it's called parliamentarian group for digital sustainability, Paul Deghi we started basically lobbying for open source release from the Swiss government in 2011 so it was back then when we had our first initiative and asking politicians to actually support the release of open source software by the government and it seemed like a natural thing although the federal government rejected this that there will be no additional support it was basically clear to us that it will take place anyway and even more actually the Swiss federal court is very open source minded for many years they actually are completely based on open source software they host their all entire stack it's they use LibreOffice back then it was open office and all the federal court lawyer federal judges they use LibreOffice for their activities so and they even wanted to open source their back then their court management system called OpenEustizia it still exists OpenEustizia.ch it actually is still online this website but something happened very little Swiss Bernese law company IT law company they actually objected to against this release of open source software by the federal court because their business was jeopardized basically they were afraid of this government competition because they actually had a market of local courts where they sold their proprietary court management systems too and when actually federal court the big court actually releases open source they were obviously afraid that this government would destroy their market and what they did actually they asked another politicians to hand in to hand in actually another question on political side that would ask what is basically the aim of the federal court would they want to destroy the market for IT companies and then basically compete with small companies and this basically actually was the beginning of this legal dispute for last 10 years and now I hand over to the lawyer again and she will explain to you what the lawyers said about this story. Yeah maybe you know the saying we have in German two lawyers three opinions and it was exactly that so we had a first legal opinion issued in 2014 with the crucial question can the government develop and distribute open source software and this legal opinion said basically no why there is in the Swiss constitution and I'm sure in other constitutions as well the principle of competitive neutrality this means that the government should not mess with the private sector to the contrary that the government should make so-called favorable conditions for the private sector and they said that this is not the case when the government publishes open instead of closed source software. Now it gets a bit legal they said that distribution of software by the public sector is a so-called economic acts that would per se distort the free market. This would actually be allowed but only if there is a sound legal basis and if it's proportionate which means also necessary and these two quite old law professors they said no it's not written in the law okay that's that was true at that time but it's not necessary at all because everything you want to do with public software or software by the public sector you can might as well do with closed software it's even more it's even better suited suited to reach the public tasks so it's private basically but luckily some persons thought okay let's just ask another lawyer and paid for another legal opinion three years later and I when I say some persons I look at Matias so this second legal opinion said well we're not too sure whether publishing open instead of closed source software really is an economic and distorted act per se we might call it also an auxiliary services so you can't do the software because it serves the fulfillment of a public task and whether you do it with closed or open source software it does not really change the fact that you have the legitimation to do so so they said it's an auxiliary service does not distort the market and we do not even need a legal basis with this in mind this was government after long consultation they issued a law although you do not really need a legal basis they thought yeah we might as well make legal basis better saved than sorry and they negotiated the so-called federal law on the use of electronics means for the fulfillment of governmental tasks and there they did not only put the possibility to make open source software but a mandatory requirement this is the text we will look at this now and I will speed it up a bit so here it says the public bodies subject to this law shall disclose the source code of software that they develop or have Dilla developed by third parties unless the right of third parties or security related reasons preclude this so we have the first sentence is the principle that opens our software is by law mandatory for software that the public sector develops or for developments that they buy via third parties and the private market so not for software that already exists so if they they can still buy pre-existing software on the free market so and there's the rule and there's the exception the exception is you do not have to make it open for software if third right parties would preclude that that's clear that's usually intellectual property rights closed licenses maybe or for security related reasons and I personally really don't know what this should mean I've been told by people with more IT knowledge than I have that they do not need her so if someone of you knows what this could be please please raise your hand afterwards and give us this input and this was paragraph one the principle of mandatory open source software there are a lot of other principle paragraphs I don't delve deeper into that but just to show you that we've come from open source software is really distortive and against market neutrality of the Swiss government to also this and this is paragraph four and five and they said so the public sector they can develop the open source software and they can also offer supply services other services to other governmental bodies for if they charge for it so they have to take money usually but they can also also make provide other services which is then not deemed market distortive thank you Rika so basically here we have the issue about is the federal court this are the background of this whole clauses is the federal clause a court allowed to actually make community building basically help other courts to use their software and maybe answer some questions and also participate in the community and this is actually now also a legal basis for community building on the federal level now we have the law so what does this mean so a law is not helpful if it's not implemented it's and this is basically the activities of my next 10 years to make this law alive so one aspect and this is where also Rika comes in because she's a specialist in public procurement now we hope actually to find in the next few years more and more public tenders where actually the government procuring IT solutions will have to include different criteria which actually support the open source the release of open source software and the community building so this is actually a real excerpt from one of them public tenders it's actually not with the new law but I think it could actually serve as a good example from my point of view because it says that first one of the issues is that the software which is being built by the company is providing the solution has to be open sourced under an open source license on the GitHub account of the city of burden then actually they have to use not just any license but a copy left license including you PLH EPL so we heard about that thank you Bradley and we also have then the need for companies with experience in open source software development and community building and actually the community management is also one of the services provided by this IT distributor so from my point of view it looks quite nice that it's really something which should be in the future being used as one of them open source role models or good practices at least for for IT procurement nevertheless we have not much not as much activities in Switzerland compared to other countries especially Germany and we have a few activities so one thing was during pandemia that there was the federal IT administration which released the COVID certificate app as open source software which was then actually used by the Austrian government and used for their national COVID certificate and this is I think a very nice example of how governments can also interchange source code another example includes the Swiss map agencies Vistopo they produced open they supported open layers in the past and collaborated with some institutionalized crowdfunding with other agencies the first development of open layers another aspect is from a company initiated in swissland at Finis they actually started an open source project Kaluma and some workflow component framework and they supported now several canton several local departments of swissland to use this company to use this software and they founded this in osco community and this is also another good example which I think shows that even swiss people are able to produce open source software in a good way and the last the last thing you have heard all the railway activities open rail association today this was also partly driven by the swiss federal railway association I think it could also be a good example how swiss government or government companies can collaborate with others so hope there's still hope for switzerland and their open source activities and maybe with and especially if they'll you know there should be more activity soon there's one monitoring thing which we do osasbenchmark.com this is a very hobby pet project of mine which where we collect basically the the open source repositories of organizations from swiss government and companies and look how many repositories are released by which kind of company and here again you can see basically of 150 agencies and institutions how much open source they are already providing on github now what I hope in the future will also help us is the the the the also the high level political environment around digital sovereignty and digital sustainability so a few years ago we created the report on data colonialism where we pointed out this is danger of the big tech companies appropriating and privatization of data and also obviously of software and nowadays I'm working on a new report for the digital sovereignty strategy for the government where they have to actually release some some new recommendation by the end of this year and so we hope that this again also will help open source software development in switzerland now we are very interested in your feedback for the for discussion because first of all we would be interested on do other countries have similar laws in this way where it actually is not just allowed to open source software but it's really actually the default second question what is the potential and the challenges of this new swiss law what do you think could be also something in other countries being implemented and from the operational and the implementational point of view we know of several activities of other countries but so what in your opinion would be the best to do next in switzerland because now we have the law now we do need to do other things is our parliamentary group okay so that's it for the moment and very very interested in your feedback thank you very much I was a little bit irritated that you mentioned within the tender that you're demanding copy left while for some organizations within their own code if they want to contribute it for this to reduce their costs having a copy left may actually be exclusionary because some people using code that under a different license more permissive license may not be willing to put their stuff under a copy left in other words what's wrong with actually saying a copy left or a apache or a bsd or one of these other more permissive licenses so this basically as if I understand you correctly so the question is why is copy left license being recommended which actually excludes a number of organizations who may want to develop software with a much more permissive license than copy left well it doesn't exclude them well it can be integrated into copy left and final product so maybe someone else can add on this but um do you want to are you responding to that Bradley yeah please so so what it sounds like you're saying is that if it if it has a must on copy left then they might just have to upgrade the license that's under you know under mit to be copy left when they put the solution forward right yes so as far as I understand if you have the end product the final product this is include permissive license software and but you still can use the the less permissive the less restrictive license software thank you for the great talk you just mentioned that there was this argument of that providing open source software does market distortion that somebody gave that argument like the right of a private enterprise to make money on licensing fees is above other things shouldn't there be an argument that the government should make best use of taxpayer money rather than blowing taxpayer money on licensing fees and isn't this a good enough argument to reject that yeah absolutely I mean I did not understand myself the argument that why this should I mean just enable some companies to pursue a business model in a certain ways does not mean to be market competitive so I think we to the contrary I mean to enable the government to make a software that has the best return of use money for value value for money sorry for the taxpayers should be the first public interest so and that's how the interpretation of term competitive neutrality changed luckily okay you mentioned that the slow is forced in 2017 am I right sorry the law was the law like six years ago no no the law started on February January 1st 2024 this this year okay because I've got the question that's how often can be used that's security reason because there are you mentioned there are two reasons like it's third party and the second is security reason when there is exception so when it comes for like minister of defense the software could be perhaps proprietary software but if it's like pretty new so so I don't know if it's off or no yeah so this is this is exactly the point so the law is very new so we don't know yet how strongly will be implemented and fulfilled we we know that the government is kind of behind on releasing some guidelines and they're right now actually providing some guidelines but it will still take a few months or maybe years to really get it going on okay so on on the security issue I've so it should be used very sparingly right but there there are arguments that can't be made like Bradley mentioned when you're looking for people who're trying to evade like taxes there you can make an argument that people knowing how the government looks for tax evaders makes it easier to actually like beat these algorithms right so I think it's a good thing that's in the law while at the same time security by obscurity is a very bad thing and it should be like we should use that very sparingly but I think it's good that it's in the law so just from a like cyber security guy point of view and then the actual question do you know what the because we have the same argument in Austria where it says like the government may not publish open source because it it like distorts the market of proprietary companies profiting of proprietary software can you summarize what like the other side of the argument was and what what claims the lawyers made why this is not an issue actually you mean the pro side the like how was that how did they debunk that hopefully like you know ps yeah a good thing that you're from Austria so I can just send it to you the little penny but um to summarize they just said only real economic acts can distort the market so they compared it if it's open source or closed source is like if you write using a pen or your laptop it's just a means to help you doing your work so an auxiliary service and even if it was market distorted then you just have to have the legal basis okay but but then it's still necessary for to have better quality so very quickly I've seen other laws in other countries that were open source by default failing so it's good that you have already clarified that for instance security can be a way to circumvent this I wonder if there are some regional I'd say regional car valves like this is federal can a city or a canton or whatever avoid to apply this law because it doesn't apply to them uh yes so I have to say this is only for the federal government this is binding only for the federal government sub federal governments can do whatever still but what I would would also expect that usually in Switzerland at least the what the federal government is doing the canton's and also the non let's say the non federal players are also looking at so when they see there are benefits and there are obviously benefits otherwise it wouldn't be here so then I hope that people will be more used to actually procure open source software services and build communities okay let's let's thank you Mattias and Rita
Welcome to the LLVM dev room
Welcome everyone to the LLVM Dev Room. I hope the microphone is working. This year we have three organizers. We'd just like to very briefly introduce ourselves. My name is Christoph Bales. My name is Peter Smith. And my name is Marius Brila. We thought we'd use the first five minutes to give a little bit of general information. It's an anniversary this year. This is the 10th LLVM Dev Room. The first one got started in 2014. Every year we were here, except for 2021, we couldn't find volunteers to organize. And there's been quite a few different people who helped with the organization over the years. I've put in a few names on the slides. Not going to call them out. And I'm pretty sure I probably forgot someone. My apologies. This year is the first time. There's also a GC Dev Room. And I'm very happy that we're running it back to back. So I'm hoping that enables some cross-pollination of IDs across the two communities. So that is very nice to see. Maybe a few words on if you're interested in participating in the LLVM project that you're not entirely sure exactly where to start or if you're a newcomer. I've put a few links here on the slides. I'll very briefly go over them. Most of the communication in the LLVM project happens on this course, which is a forum or discord. If you want to have the links, go to the FOSDEM schedule page. You can download the slides there and just click on the links there. The LLVM project has office hours and online sync-ups. Office hours, it's where an individual expert on something in the LLVM makes themselves available on a regular schedule. You can dial in and any question goes as long as it's on topic. You can just follow the link as I think about a dozen different experts who volunteer to do that. If you're an expert yourself and you think this is a good idea, please consider volunteering some of your time, too. Online sync-ups are regular sync-ups, simply on a very specific topic. They're also all documented on the website. We have a community calendar. I have a screenshot on the left there. You can't read what's in there, but it just gives an indication of pretty much any day of the week. There's at least something going on where people, sometimes on a specific topic, can come together to have an interactive discussion. All the ways to get started is have a look at getting started issues in the issue tracker. This morning, I were 148 open. We're now three hours later, so I'm not sure if that count is exactly correct still. There's a link getting involved with HTML, which gives you lots of starters on the technical details. LLVM does take part in Google Summer of Code, also in Outreachy. If you would like to work on LLVM and get paid for it, there's always quite a few different companies looking, having job openings to work on LLVM. That's all.
elfconv: AOT compiler that translates Linux/AArch64 ELF binary to LLVM bitcode targeting WebAssembly
that transits Reax-664L binary to LNBM bit code targeting web assembly. So first, I will explain what is web assembly or wasn't for short and why we use wasn't. And wasn't with a virtual machine instruction set and currently this is used on servers or as well as browsers in production environments. And compared to existing applications, there are mainly two features, portability and security. And portability wasn't enabled us to run applications on both browsers and servers without modification. And of course, wasn't dependent on CPU architecture so that we can run wasn't applications on computers with various CPU architectures without modification. And in security, in the case of outside browsers, wasn't is highly isolated from the host kernel by Washi. And Washi is an API that provides access to similar OSR like features. For example, for systems so gets and so on. Yes, yes, and, yes, and Washi is implemented by Washi at times. For example, wasn't time and wasn't H and so on. And was was was Harvard, Harvard architecture designs and so the memory of the wasn't was wasn't instance is which clearly separated into right now data memory and memory and code of was can access only right now. Data memory and which include increases security. However, there are some limitations in the capability of applications. And first, wasn't can, wasn't can jump to only the course that are determined at compare time and in other words, it is impossible to indirectly jump to the code generated in the data memory. And second, was she implementation doesn't cover all projects API, for example, folk and exec and so on. So when you develop wasn't applications, you should consider the limitations. And now, many programming languages support was, for example, C C plus plus plus and go and so on. And however, it it isn't easy to build was in some cases as follows. And mainly there are three cases for us the programming language that you want to use doesn't completely support wasn't. And, and currently many major languages have begun to support wasn't, but only limited number of languages are available in production environments now. Second, binary is available, but source codes of the binaries are not available. And recently the number of op source, op source programs has increased, but several, several programs are still not published. And third, the case of time consuming to building the environment. And if the dependent libraries of the target program are not maintained, you might be not able to build the libraries. And in such a case, it might take much time to build. So next, I show existing projects that run in X binary on wasn't. And the first, and the first project is tiny mu. And this is the X86 and describe emulator available on the browser. And, and the next kernel can run on the browser. And so, and the second project is counter to wasn't. And this enables us to run the X kernel and counter, run time with emulators compared to wasn't. For example, tiny mu. And, and, and this can, can run, well, counter us without modification, but it can run with the same amount of time. And, and, and this can, can run, well, counter us with modification both on the browsers and wash-around times. And, but, however, these projects, these, these projects use PM on emulator that compiles a, a, relax, describe 32 L binary to a, a, several binary formats. So next, I will show the demo of L, L comp. Can you see? Watch. Okay. Okay. Thank you. So, and, well, well, I have prepared the counter image for the L comp project. And, and, and now in, in this terminal, the container of L comp has already started. And the target sample L binary to be converted is examples, L-sensitive, L-thousand. And, and this program outputs 100, 100 prime numbers in ascending order. And, okay. So we, we try to compare this L binary to was with L comp. And in the directory, there's a one file L comp.sh and L comp.sh is used to try to L comp to compile. So, and, okay. Okay. So, and target this was browser and L comp. And target this was browser and L comp. And target this L-sensitive L-thousand. Now L comp comp. Okay. Great. And, and serial files are generated and we can execute the was application with MS Gryffin. So, run. The browser. Okay. So the cyber was the was application has started now. And, okay. Wait. You can see the, I'll put this correct in the browser. Okay. So, okay. So now let's return to the presentation. So, so, so in compiling L binary to L and B and B code two, two modules are used. Okay. First is L comp lifter. This process L binary and maps every section and operates the next module. And, and is a library for lifting machine code to L and B and B code. And, as this figure shows L comp comp, L binary to L and B, L and B code with these two modules. And next, I will explain how L comp comp, L binary to L and B code and was binary. And, and ramming converter function to one L, B, M, I, R function. For example, as you can see, the, the, the, the, the, the square function one of the machine code is combative to the underscore function one and the square lift function is a L, M, B, M, I, R. Yeah. So, and also one CPU instruction is combative to one L, B, M, I, R block. And as you can see, the machine, the instruction of move X to X zero of the machine code is combative to one underscore move. Yeah. Okay. Okay. So next, I will explain the details of the combative, combative L, B, M, I, R block from CPU instruction. And there are three steps in the combative L, B, M, I, R block. And the first step is the program counter calculation. And this here shows percent 29 is a program counter of this instruction. And the next piece is updated to the next program counter. Yeah. And the second step is open calculation. And in this here, this, this instruction uses X seven and X three, X seven and X three registers. And in the open calculation, the X seven and X three is loaded. Okay. So, and the third step is calling the, calling the function of the instruction-specific operation. For each CPU instruction, RAMU generates a function that performs the instruction-specific operation. And the corresponding function is called at the end of the LM, I, R block. At this end. So next, as I explained in the beginning, the code of the quantum can indirectly jump to only the code that is determinable at compare time. And this figure shows how to deal with indirect jump VR instruction. So in this figure with VR X seven, indirectly jump to the instruction of move X eight and X nine. And in the error, the VR instruction. And the address to jump is 30% IDR and jumps the block of percent error on the square IDR. This is the VR instruction. Okay. So, and after jumping, R on the square IDR, we get the target rubber by calling the getR function and, and, and start to pass the VR. And after that, with the VR instruction, we jump to the target block. And also in that VR instruction, it requires all candidate labels as the argument. And this is, yeah, and this array consists of all labels in this function. And, well, but in the current design, the array of candidate labels includes only the labels within this function. So, and, and Elf Comp doesn't, doesn't support the jump and long jump now. And that is a future task. And next in converting the LNB bit code to a wasn't statically links the LNB bit code and Elf Comp runtime. And Elf Comp runtime includes the mapped, mapped memory of the original error binary. And that is stuck in the heap area of the error binary. So, and also Elf Comp runtime includes the program of the system called the emulation. And existing compiler, for example, M script and Washi CK compiles these two modules to wasn't. Okay. So, and in the React system called emulation, there are two ways of implementing the emulation. And the way of implementing depends on the RibBushy implementation. And in case one, if the RibBushy implements the tag system call, Elf Comp just uses the RibBushy function as shown in this figure of the right system call. Okay. And in case two, if the RibBushy doesn't implements the target, target system call, Elf Comp should implements the system call. And as shown in this figure of the not used PRK function in this code. So, it implements the system call. Just watch out. Okay. So, next, I will show the parameters of the generated binary, parameters evaluation. Okay. So, and the target sample F binary is a simple prime number calculator. And this program computes all prime numbers, lessens the input to integer. So, and one thing to notice here is that in this program, the evaluation are using X H6 under square 64 binary instead of the wasn't binary. Because in the current implementation, the system call emulation for wash-down time is insufficient. So, we use X H6 under square 64 as the output binary for benchmark test. I'm sorry. So, and the comparison method is QM emulation X64 to X H6 and square 64. So, we compare QM emulation with binary LD compilation. So, and I measure the power months in two cases. In the first case, the input integer is 10 million. And the second case is 15 million. So, and the power month evaluation is as follows. So, and as you can see, in both cases, one and the case two, LD compiling by LFCOM is 4,000 QM emulation. And therefore, we can say that LD compiling is 4,000 QM emulation, at least in some cases. So, okay, so last, I will show future works. And first, we will support the output of other binary formats. And currently, LFCOM supports the output binary of only Wazm and LFX H664 binary now. So, we will support the output of other binary as output of mine. So, second, we will never as compiles LF binary of other CPU architectures. Now, LFCOM can compile H664 LF binary. Yeah, so, yes. In the future, we will support other input binary. Okay, so, and third, we will, we will, we will append system calls emulation. And now LFCOM implements a part of system calls and a lot of system calls are not implemented. So, and specifically, when targeting Wazm as the output binary formats, some system calls are difficult to implement. For example, for exec and so on. Yeah, so, so, so, I think that implementing that system calls is very variable. So, and fourth, supporting dynamic linking. And now LFCOM can compile the static, static, link, LFU binary. So, and that's where that dynamic linking is an important function and will support in the future. Yeah. Okay, so, and fifth is the promise analysis of the Wazm targets. Yeah. Now, I measure the promise evaluation under the H664. So, I shoot promise at the binary of the Wazm targets. Okay. Okay, so, and the sixth is making LNB bit calls generated more efficient. Yeah, and, and so, yeah, okay. So, in the current implemented. I translate that to Wazm 32. Sorry, well, I, sorry. I, the 32 bit x86 platform. So, I think that the H664 L binary is mainly used in the world. So, I think the support of the L binary H664 is a big influence. I think that's. Yeah. I take the top. And you consider using it instead of Remila revenge, if you know revenge. I'm a core developer of revenge, disclaimer. So, I'm sorry. Could you, could you, if you have a question, more sorry. Yeah. Remila is a tool to leave from executable code to bit code. Yeah. There's another tool which we developed, which is called revenge that does something similar. Maybe have you considered that? Are you interested in that? I don't know. Sorry, I could use a more, sorry, sorry, sorry. Yeah. Is it an alternative to, revenge is an alternative library to, Remila. Have you, have you heard of that? Well, have you heard of the revenge library? Sorry. It was just saying, if you'd heard of the revenge library, does something similar to Remila. It sounds like you haven't heard of it. That was, that was my interpretation of the question. I think that will fly. I'm sorry. Yeah. When you measured, you did a performance between Kwemu and Elfkong. Yeah. Was that like the, what did you measure there? I didn't understand was it the compilation or running? What did you measure in that performance? Well, comparison performance. Yes. So, and, okay. Yes. Basically the, component performance is of Elfkong. It's very long. But in this project, in this program of the sample, sample F binary, about, about it takes one minute for the compiling. Oh, yes. It's the compilation that is faster. Or is it the running of the thing that is faster? Is it like, I don't, I don't understand. We're measuring the, the running, like the produced results for the compilation. Like which? Like which, I'm sorry. And, I guess that is for like running. That, yes. So, by Kwemu, it runs, it runs like the, it has the, the G that turns into the native code. Like it has the, ahead of time, or you have ahead of time compilation for that, what, what, what's that you run on a browser, right? Later. So, are you looking at the performance running on the browser here and comparing that to Kwemu? Or are we looking at that, like some compilation item, just understand like what are we comparing? Sorry, so, could you ask after the presentation again? Sorry. I'm sorry. Thank you. Yeah. So, you compare the performance with like emulated ARH 64 versus a X86 binary, binary. Have you also tried, like after converting this with, it was Alphcon to convert it back to ARH 64 and benchmark that against the original source? So, like what is the overhead of one, like lifting it and? So, the question is overhead of the lifting, binary lifting. Oh, yeah. So, yeah, so, and in the program of this performance evaluation, the performance overhead of the lifting is very small. And maybe it takes maybe three or four seconds to compare the lift, the binary to L and B bitcalls. Yeah, but what I meant is like, if you combine the big bitcalls back to the original architecture, so how is the performance of that binary compared to the original binary? So, you say that from L binary to the target architecture binary, the performance overhead in the LB bitcalls to target binary. Oh, sorry. I just follow up on that. So, if you just, I will just drop in directly, but from experience.
Map LLVM values to corresponding source-level expressions
Yeah, it's done. Yeah, well, we're about to start, so meet your Euro. Yeah, really. You're already on it? Really? I don't think so. Already on it? Yeah. You think it's on it, right? Yeah. It's on where do you want it? Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Hi, everybody. My name is Shivam and I work for KDAB and I also work this summer, Google Summer of Code with LLVM, working on this project, this mapping LLVM values to the corresponding source level expressions. And, but why? So the challenge of understanding the compiler optimizations. So, so compilers are basically performing different sorts of optimizations and it's not always possible that it's going to be optimized or basically vectorized. So, we basically our motivation was vectorization for the first because we wanted to include those optimization remarks in for the vectorization part. So, our motivation was vectorization. So, it's not always possible. So, your compiler can always vectorize your code. So, there's some sort of data dependencies there where that's why your compilers cannot actually vectorize all the time. So, on that cases you have to emit a good remarks and I'll let you know what currently clang actually generates as a remark. So, understanding why and how these optimization occurs is not always straightforward. Even the authors or vectorizer don't know what's going on if the vectorization didn't happen. So, consider this example. So, you can see there is a data dependencies between A of i and A of i plus 3. So, this loop clang will not be not able to vectorize this code. Okay. So, see this remark that produced with the clang which is that loop not vectorize. You can use pragma loop distribute. So, you can compile the tries to distribute the loop and it might be able to vectorize in some sense. But, just see the remarks. It's not clear that what actually gone wrong here and where's the data dependency. It's not telling you that where's the data dependency actually was and so you can improve the code itself. Right. So, it just that remarks and you can see this not actually clear. Right. So, it's a bit unclear and if you can have such this remarks nothing much just two expressions are the dependent source and the dependence destination for example. So, you know that okay there is a data dependencies between two to these two locations and so if you are aware of the code so you are going to modify your code and you might be able to modify in the way so you know that it will be possible for the compiler to vectorize now. So, you can modify the code by looking at this these expressions for example. So, yeah so it's going to surely enhance the remarks include the exact so if it includes the exact source and destination of the dependencies within the error for example and it will pinpoint the lines of those dependencies and let's look at the impact of these enhanced remarks so it would be clarity for the developers so they can quickly look where the dependencies are actually occurring and so they can improve their code and probably make it vectorizable and efficiency in the terms of that they can save time and improve efficiency by reducing the need for the deep debugging that where was the actual data dependencies so you can just look at the optimization remarks and you get the quite a lot of information that okay there is the data dependencies between two load and stores. So, let's look at the approach that we took for solve this problem or this project so approach was very simple to just utilize the debug information that available into the intermediate representation right so to recreate the variable and the function names lost during the optimization so the optimizations are actually a problem in our case because we currently don't know how to build those for example instructions that's lost into the optimization for example so if you see a multiplier if you see a MUL instruction in the IR so compiler might optimize that into shift left so the MUL was the original information that was actually available in the source code but right now we have shift left so we just lose the context that what was the actual source level information so that's still a problem for us and we have different approach for that but we didn't see to be so we see that it was not much a performance good so it was very bad so we wanted to clone the module so we have so we can look that what's happened after each optimizations so we can have the clone of the original IR and we can see that what's going on after every pass of optimization or every implementation passes so if you look at the different transformation pass and you have to look over that what's the thing that gets changed and anything that you have stored that okay there was the original instruction as MUL but right now it gets changed to shift left so you see that okay the MUL gets changed to shift left so you have to cache the expressions basically to reload the things in a new way so if there is no need for that so we will proceed after it so let's see how to utilize that information that available in the IR so LLVM uses a small set of infernsic functions if you are aware of so those are provided for the debug information so they contains different metadata so they have different arguments so these infernsic functions are named prefixed by this LLVM.debug and these things helps you to track the variables through the optimizations and code generation so if you look in the IR so if you dump your IR with the dash G symbol so you will get to know about the LLVM.debug.value or .declare so those contains everything related to the source level things so they contains the metadata information and the metadata are there for that so they can give you a lot more information about what was actually in the source so for example like variable names so when you trace the metadata so you can get the variable name from the actual source so for us these two infernsic functions were very important the debug.declare and the debug.value and let's try to understand a bit so if you can see the I is allocated and just below it you can see the call to the infernsic function which is .debug.declare and these you can see the three arguments in the infernsic function the first represents the first will always represent the address of the variable the second metadata always pointing to the for example you can see the DI local variable so which contains this is a metadata node and it contains the variable name so what was the actual name so you can see the actual name was I in the actual source expression so you can when you trace back the information so you can retrieve the name so these are these infernsic functions so the second can really help us a lot and the second was actually the source just the source information like name and the third argument is DI expression and generally DI expression is useful as a complex expression so you have expressions like for example int a is equals to b plus c a DI expression can hold that stuffs yeah and yeah so debug declare is used for that and the second is debug.value so .value is very similar to it it's just that when I is gets updated when a value is updated so debug value can up and so everything goes in the debug value for the same so we now have enough information to at least give a try to build the source expressions and only if the code is compiled with the debug info on so it's compiled using the dash g symbol so we use the we are infernsic as a bridge so our focus was on memory access and vectorization as I said so importance of the memory access pattern is so we really want this project for vectorization at first and then we also have a plan to give it a push into the debugger so debuggers can utilize this information to provide better remarks but the main goal initially was the vectorization pass vectorization is a transformation pass so a transformation pass can always query an analysis pass and what our work is our work was an analysis pass so this vectorization pass in LLVM can always query the analysis pass this transformation pass okay so project contribution is actually that we have we have built an analysis pass that can generate these mappings and provides a better remarks for the case of vectorization or any other things that requires it let's look at the implementation detail so for us the point of interest is load and store instruction because of the vectorization because we want to analyze the memory access patterns to actually emit in the vectorization so which is useful for the remark and for example just take a look at this C code and if you compile this with clag or to dash g and if you emit the LLVM just for showing you that what's going on so I think it should be visible so you can see the call to in intrinsic functions debug.7 so we can build these expressions from them so as I said so if you look these were the first is to multiply n1 but and we compile it with the optimization on so the multiply instruction gone away and it just updated by shift lift operator okay so that's why you can see the shift lift operator here and not multiply so that's a problem so that's a problem of accuracy of the expressions because we still not have a good plan of how to accurately generate the expressions because a lot many times these things gone away because of the optimizations on and it's always been a hard problem of how to actually debug in the case of when optimizations are actually gone so it's a classic problem which we still have to look so you can see the we can build these expressions from the instructions so yeah you can see this from example that computing the equivalent source expressions of interest will involve walking through the LLVM IR and using the information that provided in the debug in forensics so even though our current interest is load and store but we still have to build for every instructions because those load and store can contain any sort of an instruction when you trace back to them it's maybe any binary instruction it contains gap instructions so it might be contain any instructions and we have to we still have to build for them too and as I said that optimizations make it impossible to recover the original source expressions so as you see that for example the 2 into N1 is optimized to N1 left shift 1 so which is recovering the original expression may not be possible every time so let's look at how we proceed it's just a basic algorithm that I just want to go through so we started by traversing the IR so we have we started with the traversing the IR we identify the operations that were there for example load and store source or main so current interest is load and store instructions so specifically look for those instructions in the IR and then trace those operands it might be any other instructions it might be inside any metadata so we can retrieve any information like name and utilizing those metadata information building the source expressions and then we reconstruct the expressions with all the information that we get so that's a bit all but just look at the current state so it's still not yet upstream to LLVM the PR is here so what I need from you or anyone you have experience or anyone you have active in that zone of optimizations or for example analysis passes or transformation passes in LLVM so I do like you to have a look at the patch if you have some experience try to review the code as well and give some feedback so we can proceed with much detail because it's still it's still a new analysis and still need a lot of work for struct as well so as I mentioned we need more review on the patch and some active work from me as well and if any of you interested please reach out and as I said the struct pose a unique challenge to when we try to build the expression for struct for example it was not quite easy it was very difficult to build the expression for struct because they pose a unique challenge if you see them in the intermediate representation it's very weird to look at them because I don't know how they actually gets there in the IR how they represent it it's not as simple as the giving expressions for an array so struct is still a problem accurate source level expression in case of optimization it's still a problem and there isn't always one to one mapping between the source code and IR so if you see that we still don't know what to do in these cases if you see there's a pointer and the PTR 0 so IR can generate the same code for these two patterns and we don't know which have to pick so that's still a problem one solution for this was that debug information also contain the information about the file so there is a DI file in the debug info so what was actually we were discussing is so we have still information about the file path so what we can do is we can actually open the file and just go on that line and retrieve the information what was the actual ground truth just look at that second thing was fall back and anything because we don't know what was there so just fall back to any of them so but the DI file was actually quite easy but it's not good performance wise if you see open the file and just retrieving just going on that line and retrieving that so it's not good performance wise so yeah that's it for the talk and thank you for listening and if you are interested in letting knowing more about this project and the algorithm please reach out to me on mail or for example discord so yeah thank you any questions yes why do you need to rebuild the entire sequence of expressions for each of the values why not the specific value of the value it is what the dependencies and the production line from the file just like after a year ago can you just free us the question so you know like when you admit after marks there's a tool called the operator that just put everything in line so between that and what you have here it seems that you know you would get excellent results in terms of debubbability if you just did what the operator does plus you specify which of the values are causing the dependence that what you said and what the reason for the failed optimization okay so the question is basically about using opt viewer right yeah I was just admitting a more limited view as you have here and not trying to reconstruct everything so we still not reconstructing everything for example so we still we still not focusing on creating whole I at or just mapping whole I at the expressions we still focusing on those loads and stores as I said so we still focusing on that right yeah yes yes yes yeah so we we still we still picking up the load and stores so if we see that is there any gap instructions because gap instruction actually contains a chain of instructions so so but we still have to build the loop for loads and stores and opt viewer still not good at emitting those remarks it's still very abstract in that sense if I remember it correctly so I'm not sure how to go with opt viewer but we still making it for load and stores and tracing back the information yeah yep yeah yes nope nope not sure but one thing I can guess that so basically opening a file is not something which is very good performance wise and and just choosing and just going on that line down because you can see that there could be a multiple lines of code in code base so you have to go on that particular line so it would need it would need it would be very bad in performance was I think okay and there is no and there is no theory about if it would be more beneficial to tell the programmer the error that or the sub optimal choice that you made was between line 27 28 compared to generating some arbitrary complex expression that might not be representative of what the program originally wrote I'm not sure okay okay yeah I think it would be fine then if you if you're choosing for emitting a remarks then then you know that this is not good performance right so if you want to look at the performance if you want to look at the actual correct remarks so you have to going deep in the performance thing then yeah then it would be possible and we have also we also talking about preserving the metadata in LVM as they go through but but in LVM metadata is designed in a way so we can drop we can drop at any time so we still cannot preserve the metadata information so it's still a challenge this lot move what going on on this side so yeah yeah yeah okay yeah thank you yeah okay thank you for joining when you leave make sure to take everything yeah yeah yeah yeah
The Matrix State of the Union
Okay, so this is the Matrix Dev Room on FOSDEM 24, I guess, in case you are in the wrong room then take the chance now and leave. And we have an afternoon packed full of information about Matrix. It's only an afternoon, so if you want to look up more information in the air is an internet full of it. But if you are lazy and don't want to collect all the information yourself then there is this wonderful people, they collected the information for you and will give you a presentation now about the state of the union. Matthew and Amadin give them a warm welcome and the stage is yours. Thank you Jan. So we honestly weren't sure what to talk about because if folks came to the main stage talking Jan's in this morning, basically the first 25 minutes was the state of the union of Matrix. So we have a bit of a question mark on the subject here. Also Jan just promised that we will transfer the contents of the internet into your brains which we also hadn't really prepared for. Anyway, basically if you don't know who we are, I am Matthew, I am the technical co-founder side of Matrix, day job CEO at Element. Amadin, the non-technical co-founder side of Matrix and day job COO at Element. But we would like to at least try to tell you something new about what's going on here. And actually realized that we have never done a brief history of Matrix which begins before many of you were born in the year 2000 and free. Now seriously, the actual backstory is a bunch of us were at the University together at Cambridge and we were messing around and instant messaging on a project called Project Foxtrot. Now the idea of Foxtrot is that it was written in Java 1.3, new fresh off the books or something at that point in the late 90s. And what it did was to serialize hunks of Java, put it over TCP sockets except it was end to end encrypted using manually written Diffie Hellman RSA exchanges. So that is where I at least got the bug for Matrix and instant messaging and after we either got kicked out or left or graduated from Cambridge we ended up working at the little company doing APIs for the PSTM. So that's 2003. Fast forward rapidly to 2010. Well, my company were doing mobile app development and Matthew's company both got acquired about a month apart by a big telco vendor. You would find them in the depth of AT&T doing all their billing systems. So small startups having fun getting into a very big company. After a few years of rattling around inside Amdox, I'm not sure why we're not mentioning Amdox by name but it was Amdox, we discovered a new found desire to burn the phone network to the ground and annihilate it and replace it with something that would be open and decentralized and federated that anybody can join rather than the cabal of the phone companies where it's almost impossible to connect into them. And so that was where the idea of Matrix came from. We basically took the combined folks in Hren and London went to Amdox and said, hey, a little bit of a crazy idea but what if we build an entirely new communications protocol and if we pull it off, then you, my friends at Amdox, can go and sell it to AT&T and many other big telcos and you can replace the PSTN. And meanwhile, at the same time, the rest of the world would get a big benefit of the existence of Matrix. And amazingly, they said yes, with no strings attached, they allowed us to go and switch the business unit from selling clones of WhatsApp and Skype to telcos to instead building out Matrix. And that's what we did from starting depressingly in May 2014. So we are a couple of months off having been doing this for 10 years. Not sure whether that's something to celebrate or not in the grand scheme of things. What happened in 2014? So 2014, we all gathered in Hren and sat down and had a big brainstorm on how this thing would look like and ended up with mostly what Matrix is today. Not much has changed in terms of overall idea and architecture and these sort of things. We started in May and the goal was, September 2014, we're going to launch this. Like four months to figure out a high-level working Matrix proof of concept. And we did it. Yeah, it was a disaster really because we rushed incredible speeds. It was like the best gig possible that your day job is suddenly told you and all your mates at work that you can suddenly go wild to create something like Matrix and everybody's sprinted in slightly different directions naming no names. We might have ended up with three different versions of synapse at first. We had the bit that taught the client server API. We had the bit that spoke the server to server API and we had the bit in the middle that sort of meant to funnel stuff around the place. Each one had a different database schema. Each one had a different object model. They were all written in Python which was honestly a win but it's possible we might have sprinted a little too over enthusiastically into this and spent about six years playing off the technical debt that we accumulated in those three months and then ran up to launching synapse. Worth noting the end-to-end encryption wasn't there on day one but we did start in 2015 and we always designed as part of the protocol because if you are going to replicate data equally over many, many home servers, obviously it needs to be end-to-end encrypted such that if one gets owned all the messages don't go out the door. Then in 2016 it says we launched Element. I'm not sure I pulled the slide from but it definitely wasn't called Element. Basically when we launched we were using Matrix console at the beginning and then we said okay we need a very glossy app to actually drive the usage of this. We launched something which became Element at some point but initially was called Vector. What was the second name? Okay let's quiz the audience. What was the second name of Vector before Element? Yay well done and here we go. Element now is the flagship client and still growing there. Eventually in 2017 we set up shop as properly independent both like we started with the commercial company Element and also a bit later like in 2019. Yeah I think technically the foundation I think was incorporated in 2018 but we didn't do anything with it until 2019 to try to make sure that there is a clear split between Governance, the open source project and the protocol versus us practically trying to fund the bloody thing, Element running around, doing commercial stuff but that was the point where things started to split properly into your classic open source foundation versus startup trying to build stuff on top. We eventually turned on and went encryption by default in 2020 alongside Matrix 1.0 which I guess was June 2019 and then fast forward to 2023 when we announced the idea of Matrix 2.0 as showcase development X last foster and here we are today in 2024 the year of mainstream Matrix. Who knows maybe if you saw the DMA bit of the talk earlier it may or may not be getting the but yeah we'll talk a bit about it and Travis afterwards is going to have an amazingly very deeply technical talk all about everything you wanted to know about DBA. DBA? Well you could do one on DBA or you could do one on DMA up to you. I haven't asked permission for any other people in this photo to put this photo up but this is the original Matrix team on our way to REN from the London side playing Magic the Gathering or something. No it was all of us. Yeah that point the front and side is all in REN. Yes because we hadn't got to France yet. We're literally at some crappy travel lodge I think in Luton or Gatwick or somewhere on the way through to REN and so yeah basically that was the vibe at the beginning of Matrix back in May 2014 and more of a vibe is this which was the whiteboard in the Jupiter project room in the offices in REN where we basically drew up the possible architectures that we could use for Matrix. You will notice that there are four if not five architectures here the simplest one is just client to server to client and this was almost just mapping out the various options we had on the table but at that point we hadn't really decided how decentralized it would be. Then we had the one that honestly I came into this with which was assuming that it would be a little bit like SMTP and IMAP that you'd have like or just SMTP your client would talk to a home server that would cash in rooms which would talk to another home server that would be a single point of failure that would talk to a client. I mean it's a bit like a mark in XMPP sounds pretty easy. What I did not expect was for some of the folks on the previous slide to turn out looking really excited saying you know I think we might be able to do it like it we can actually go and replicate this between the home servers which I christened at the time the distributed sink nightmare in terms of active active replicated version of a protocol and then there is another one down here when you've got sort of two inboxes that sort of synchronize ish together but you basically have queues rather than DAGs and you had this one which has got lots of double arrows and I have no idea oh it's a mix net I think is basically all that was talking about you'd have a personal home server you have a bunch of relays which trusted maybe or not trusted I don't know I can't remember it was 10 years ago but either way that was the level of sort of whiteboard diagram that we were playing with at that point. So basically as Matthew said earlier fast forward almost 10 years now 2023 was very much focusing on the basics to work well thanks to the limits of funding which is good sometime like if you have a bit less money then you do focus on the thing more important. So we have posed a lot of things how do you want to do this do you want to go through the list Matthew? We have no you don't. I do it okay I will do it so the focus was very much on 2.0 CNAPs the SDKs Rust and GS SDK the otherwise peer-to-peer matrix is on the side pseudo IDs as well crypto IDs accountability however we still hope like very very soon we'll be able to get back to all of this low bandwidth as well and some of the done right work funded by Elements. The legacy Elements apps and dsdk that are based on are just put on bug fixes only and hopefully we'll be able to switch to everything to Rust soon and LiboM as well now that we have those amounts taking over and PortoDrum is on the side waiting for someone to take it and bring it up to what it all the power it can do. Yeah third room is particularly frustrating we got an email from the W3C after we announced that we had had to lay off the team element who are working on it and that nobody picked it up saying what this is meant to be our promised land of Web SG the Web scene graph API that we created I thought this was how the future of the spatial web as Apple would call it is meant to be and I said well I'm really sorry but we literally could not find anybody to fund it whatsoever even people like Rolls Royce who promised that they really needed this and would fund it then proceeded well first of all to lay off the team that we were talking to on their side and be not funded at all anyway it's been a really fun year so that said I'm gonna disgrace myself as you probably expect by wanting to talk a little bit about the projects which are shelved because it's really frustrating that an awful lot of work went into them last year until around November they got forcibly parked one of them is some pseudo IDs MSC4014 so this is the project to replace MXIDs with arbitrary identifiers per room now the reason for doing this is well first of all GDPR at the moment MXIDs get baked into the conversation history of your room and they are things like at Matthew got on matrix.org whereas if you had a different unique identifier on a per room basis that problem goes away and it's up to me whether I want to publish a mapping of my matrix ID onto the sender key or not the idea of this MSC is that it works out of the box with existing clients no code change is needed because the CS API maps the sender keys back to MXIDs when it hands it to the client however this does not provide account portability it's just replacing the MXIDs and it got implemented in dendrite in June of last year and if you're feeling particularly creative go and turn on the feature flags on dendrite and have a play with it but as I said unfortunately it is currently on ice I'm not going to force Amundoon to do the crypto ID one for the sake of alternating slides so for crypto IDs is an extension of pseudo IDs highly experimental the idea is that your sender keys become your end-to-end encrypted identity so we finally unite together end-to-end encryption in matrix with the idea of your MXIDs so the idea is that when you join a room for the first time you get a crypto ID generated for that room interestingly and perhaps controversially your client then signs everything it does or the events with the crypto ID so that you can basically prove that you own those events and as you move between servers in future you can prove that it came from me as an individual Matthew rather than being signed by your home server which you don't really care about if you're migrating between home servers this has the gbs side effect that we no longer have cryptographic deniability because by definition you would be able to see that a given client owned by a given user has sent a given message so there's going to be an interesting trade-off to be had there right now we do technically have cryptographic deniability but practically speaking it really depends on the trust model I'm not sure just how useful it really is other than on paper whereas this would obviously throw it away again implemented in dendrite and was just being drafted in Rust SDK when it got shelved in November the idea is that if you take pseudo IDs and add crypto IDs and add some magic glue which probably means storing account data in a room so it can replicate between servers then you would have client controlled account portability also a prerequisite for peer-to-peer matrix which is also on hold and again is currently on hold how am I doing on time Jan? More than 15 minutes? No no no no I am don't worry I am can I do a demo then? Okay so whilst we're talking about daily departed projects I know this is probably going to piss off a bunch of people but I really want to very briefly show the final bits that the third room guys did before they got killed so here's our third room using OIDC as oh that's a great start this is what happens if something is busy a bit rotting away let me try to sign into this using OIDC because third room was the first thing that we used to test out native OIDC we might have to wait a little minute for that server to wake up because I haven't logged in very recently so this is definitely a dangerous demo talk amongst yourselves imagine that a server is actually working here which it is right so where we left you last year at Fozden was that this thing had just launched and the next big thing was actually to make the whole thing scriptable and do fun stuff with it and it got to the point here where you could go and enter a world like this and this is just a matrix room stored in gltf with the sorry with the world data stored in gltf itself but what Robert and AJ implemented is if you press the tilt button at any point you go and get an inworld inspector up you can go and select things like buildings and you can do things like move them around and manipulate them in real time I think I showed this last year the next thing though was to make the entire thing scriptable by Wasm so you have a script editor now built in here which gives you a little bit of javascript what you can do is to go in and grab something like the buildings you just drop it straight in there and it right see the javascript to grab the buildings and then for every time the world updates you get a delta timestamp and absolute timestamp and I can go in there and do I do not know what this API is how back in this go let's assume it has a translation button and say that y is going to be what 10 units times or the sign of the current timestamp that will work right and if you head um save as run what it will do is compile the javascript down to Wasm using quickjs written by the amazing Fabrice Miller and reload the world and there you go let's upload in your dance I think this is so cool this is so cool you can see why w3c got in touch afterwards saying hang on this is how the future of the web is meant to be and where are the people and it's like well this is what it is so if you're watching this and you think this deserves to exist well first of all I'm not sure I'm gonna ever persuade the guys to work on it again because they feel pretty pissed off obviously that the project collapsed but the code is all there still and it's so antelizingly close to being absolutely amazing right sorry back on to what we're talking about cryptoidies or yeah what's the next time matrix 2.0 I mean who was in the thing in Nansen this morning do I need to go through this again oh crap yeah only about half of you yeah yeah perhaps we should have done that at the beginning of the talk in 20 minutes through anyway right so matrix 2.0 very quickly first of all this is not a spec release this is a state of mind a bit like web 2.0 it's made up of various MSCs and the status is sliding sink so instant launch and instant logging and instant sink it kicks ass but it's too fiddly we are currently performing slidectomy which is the technical term for removing the sliding bit from sliding sink and there is in fact a PR against the Rust SDK which basically shifts all of the ordering onto the client rather than doing it on the server and this is all my fault being stupid and over enthusiastic going and trying to do this over optimized implementation where the server figures out the best possible ordering and then the client tweaks it at the end and it turns out that having two different things fighting over control of the order of a list doesn't work very well so we've basically said the client gets to order it entirely the server does a very approximate probably based on time stamp thing and the good news is that it is just a subset of the current API so it's not a yes or no rewrite it's just basically simplifying the API so it's easier to implement then you've got end to end encrypted VoIP which again kicks ass we demoed it and Janssen and it worked this morning need to update the MSC because it's on its 6 or 7th iteration now and I think it's stabilized enough that we should actually spec it properly faster joins so synapse rapidly joining rooms and other home servers for that matter if they implemented them incrementally lazy loading the data in it would kick ass if we actually finished it so we got the hard bit done the kind of infrastructure and made rooms non-atomic in synapse and then actually didn't get to the point where we would get it to go faster significantly faster and then IDC which does kick ass but it's going to be a big migration as we need basically everything to support it before we start turning it on our matrix org etc but there is lots of stuff in progress if I have more time and try to show the QR single hop login demo which is super cool then Rust SDK is the brave new world that goes and wraps us all together on the client side and as of Friday as I mentioned in Janssen the JS SDK and therefore Element Web and anything else using JS SDK now uses Rust SDK for crypto so we are finally at the point where the old Le Bon c++ library is in maintenance mode and then some whereas Vadosmats the Rust implementation is our brave new future and I spoiled that Demir has already produced a post quantum PR draft for Vadosmats using the kyber primitives wrapped around I think curve 25519 so a kind of hybrid approach which should be compatible with the signal and PQXDH and key exchange stuff and what else are we doing in Vadosmats it was another big thing but I can't remember what it was another PR that landed basically we fixed all the crypto bugs in one place and a huge huge focus in the coming months is making the crypto finally suck a lot lot less. Should I keep going on MLS you can do the whole end bit of it okay on MLS people might be wondering hey it's not talking about MLS anymore what's that all about first of all we are still doing this you can track the progress on our MLS yet.com MLS is the group encryption that scales much much much better than normal double ratchet and almond Vadosmats we have it largely working on matrix has huge key bundles you have to store the keys and the media repository they're so big at the moment however there's been a lot of discussion on the meme side which we'll talk about briefly and Travis will talk about a lot more in a few minutes in terms of what if you actually used MLS to synchronize everything so rather than having a matrix tag for tracking no synchronizing data between servers what if you just chuck everything into MLS. TBD so there's a little bit of a do you put MLS over matrix or do you put matrix or meme over MLS and debate going on right your slides. Yeah basically as we said at the beginning 2024 could really be the year where our prediction was the convention would come to and the prediction was this that this is a slide taken off investor pitch deck saying in 2019 in five years everyone will communicate over matrix that's why we did this right. In 2019 it said 10 years and now because we're five years later and now it says five years just saying also this is written in R and it's real traffic from 2019 showing the I think the top 100 home servers talking to one another just saying if you're investor decks aren't written in R you're doing it wrong. So basically killing email and the phone network so why the digital market tag you may have heard of it they demand the big communication services called gatekeepers to actually interoperate with the rest of the world. Two of them have been named so far WhatsApp and Facebook Messenger iMessage is pushing back saying no no no we're not a gatekeeper but let's see where it goes. To the business yeah business to a user. So last year it was coming into force in the 7th of March they will have to actually expose these APIs as production ready and anyone in here who actually wants to interoperate with WhatsApp because they don't want to create an account there will be able to come to them and say hello can I please integrate against your APIs to talk to your users. She's a little bit ironic because it starts to look an awful lot like the PSTN in terms of you have great big telecoms providers and you go to someone at AT&T and said hello please can I talk SS7 to you so my little telco can talk to the big telco and they make you sign a massive contract and there's all sorts of back and forth to happen. Obviously we can't say what that will look like with method but there may be a risk oh well there could be entire spectrum between open federation versus closed federation versus everything in between and we just don't know what will happen. Let's see in a month and basically yeah one we may get to a point where matrix becomes the glue and between all the communication system and matrix them together. Yeah I mean I'm not counting on it honestly on this architecture particularly because everybody would need to agree both on the same dialect of double ratchet as well as the content payloads within but you never know if we get critical mass in some places perhaps everybody will follow. So yes we mentioned it this morning already still a lot a lot a lot of things to do especially on the core making sure the core is funded we're trying to put a big call out for fundraising and honestly the goal is really to get the big guys who actually are using it for hundreds of thousands of users millions of users without contributing a sense to the project itself and funding the core trying to raise the alarm. At the same time there is a public policy dev room there where we're trying to figure out how do we get the proven source projects actually funded so I'm going to run there to try to solve that problem very shortly after this. Cool. Thank you guys so this morning a lot of people actually let's go thank you to everyone who is supporting it and everyone who jumped in live this morning during the talk to actually become a member of the foundation and thank you to all the already how do we call them supporters organizational supporters as well in here. Yeah honestly if your organization is just happening to use element and matrix as its common system it really doesn't cost that much to put some money behind the bar to keep it going like we met x wiki on Friday and said oh how's it going and they said stuck notifications are the bane of my life and he said oh well if you actually want us to have more member to go and work on certain notifications and perhaps you can become a silver member of the matrix.org foundation and that is why there is a x wiki logo and a cryptpad logo on the slides there seriously it's meant to be relatively modest but if we get all the organizations doing it as well as the individuals then if nothing else is really a lot easier to go to the really big people like the EU and say look we've got already got 800 people supporting this this is an important thing it matters therefore you should match 20 fold 50 fold 100 fold and as a narrative it may work. So yeah meanwhile we have an awesome community a lot a lot a lot of things are happening around and this is the menu for this afternoon where everyone will be able to tell us a bit more about what they're working on looking forward to it thank you everyone. Any questions? Two I'm allowed two questions but they can't come from me. Kim. Excellent question. So the excellent question which I shall repeat is where the hell is multiple accounts support in element. Now most of the best of clients out there have it already however we've never got round to it in either element or element x there's not a good reason for it other than everything else taking slightly higher priority we did have it in matrix console the very first matrix client that we wrote before producing vector and riot or whatever it is now. Yeah no good answer other than we need to add it and element x would be a great time to do that it's built with it in mind we just haven't put it in the UI yet. Everyone can do it apart from element pretty much. Is there any indication that other assorted governments are looking at following on something So the question is whether other governments are going to take inspiration from the digital market tax there are some movements in the US around it trying to remember the name it's not the interoperability bill but something along these lines so there is definitely it's like GDPR then has been looked at by the US and European Europe is leading on these sort of things and yeah there is movement in that direction. So yeah the question is whether we are lobbying within the EU to make the API's the gatekeeper offer open so the DMA is forcing gatekeepers to open their APIs the big lobbying we've been doing in the last four years has been please please please don't ask them to only open their APIs but try to converge towards an open standard so that as small companies who want to integrate with these people I don't have to build a polyglot messenger which speaks what's up Facebook messenger Google blah blah blah all of them in parallel please please please so so far we don't really know what this is going to be the in the law of the DMA in the text it doesn't say you have to use an open standard but like basically we continue working with everyone the European Commission's and the gatekeepers and all the big corporations trying to convince everyone that that's the best way to go. I think the US equivalent of DMA is American Innovation and Choice Online Act maybe one of Senator Wyden's initiatives perhaps but you don't need to go and look it up and there was something I wanted to add to that but I've gotten what it was just the lobbying try to get everybody on a single it was it would have been amazing if we'd persuaded the Commission to basically put into law you have to speak an open standard the reason that they didn't is first of all it's not really their job as politicians to dictate the actual technical implementation they need to say what the outcome should be like a more competitive environment without massive anti-trust behavior but it's up to us literally up to us a lot in this room to figure out what that should really look like and the other big problem is that there isn't a standard that is suitable like matrix is great it's looked after the matrix org standard foundation but it's not an internationally recognized standards body so if we're gone through ITF already then perhaps that would have worked it's not like they can just tap on the wall and say use matrix even though it has some traction I would say this is an amazing segue to travestalk right now unless there are any other questions which I can cram in but I'm not allowed to no thank you very much
Let's talk Matrix between Governments and Citizens
Hi, thanks for coming. Huberview is working in the government of all the government in this room. I bet it's only a few. Oh, maybe 10% or something. Yeah, quite a lot. Okay, so, hi, my name is Marco. So let's talk about 2AM and why am I here. I am active in the Floss community for about 10 years now of contributions to SignalDino and also in projects in the wireless mesh community tooling. I have a background in IT security and my current project during the last three years was in this building state-of-the-art infrastructure for the public administration in Germany in a German federal IT agency. And yeah, we think a lot about how we can improve our infrastructure, especially in Germany, but we also try to, yeah, think out of the box, out of the border. And this is why, why Metrix is very interesting for us. But let's start with what public administration does in Germany and other countries. So the government provides a lot of services that ranges from healthcare services to social services. There's dog tags registration, for example, there are holding benefits. And in Germany, there are 575 service categories with a total of 13,000 individual services that the government offers on different federal levels. So that's a lot. And also what we need to think about is that the government has a monopoly on these services. Like if you want to have some housing benefits, probably the government is the only institution that will provide you with these services and with this support. So if you want to receive these services, you need to go to your local government. And this is why it's important to have a, like, see how these services are designed, how they work, if they are privacy-friendly, if they're usable, et cetera. So in my opinion, it's very important to have a look at the tech stack behind these services and also the privacy usability, accessibility aspects in there. So how do we apply for these services? First, there's, yeah, the option that you don't have to apply for them. The government starts the process by itself. For example, sending you money for your child benefits. The government usually knows everything about your child when it's born, and then they could theoretically send your child benefits by themselves. That's what we call proactive government. That's usually a thing, a neat thing, but it doesn't really work in all cases. So there are use cases where this doesn't work. For example, for a registration in the kindergarten, it's probably a good idea to ask the people where they want to bring their kids or if they want to bring their kids to this specific kindergarten instead of just, like, distributing the kindergarten places. That wouldn't really be a great usability thing there. The second option, you could always and probably have already done this offline. You can go to your local city hall. That works for many people, but still for many people, that's kind of inconvenient. So the third option that comes naturally, you can apply online for these services. For example, via an app or via a website. And I'd like to look at this third option a bit in detail. Okay, let's start by requesting some government services via a web formula or via a mobile application, for example. That's comparatively easy because, like, the government websites are public. You can just find them online. And the contact details of the government agencies are also public, including their private, sorry, their public keys, hopefully not their private keys, sometimes they are public, but that's not by intention. So you can just encrypt your application form and send it to the right government agency, and you're basically done. So that's comparatively easy. Then usually, hopefully, the government responds. But the person that applied for the government service may have already left the website, for example, and installed the app where they applied for the service. So that's a bit harder because the contact details of these individuals of us are not publicly available, and that's also by design. We don't want that. But also, we don't want the people to force installing some random application and keep it installed for a longer period, or even at all. There should be different ways to access these services. We can't just hope that the app is still installed and we can send a message to the people via this app, for example. So let's have a look on how the industry solved this problem. And here, for example, banks and some insurances put some online mailboxes in place. That's usually very easy because they just stored the plaintext messages on their central service and provide a web interface or an interface via an app to receive them. That might be okay for some banks and insurance companies because they already know everything about BitOS anyway. So that's their service. They're directly communicating with us. Still, it's not really anti-antimcryptus here, but the two ends are the bank insurance agency and the people. That's occasioned some way, but if we built this for all people and for the whole country, we definitely need encryption. So we have local government agencies that want to communicate with the people. And there's a huge, so there's a large amount of information, a large of different services that are being provided. We don't want to have a central server that stores all this information about the applications and responses to that online on the server. So how did government agencies solve this? You have to summarize. Mostly, they did it very badly. We've seen a lot of data leaks in the past years. And I think there must be a way that doesn't include any risk of data leakage. These are just some examples. I found a line there probably a lot of more issues, but this is not a European problem. This is like a global problem. You can find governments on basically every continent that lost personal identification information of all the people in their country. So that must be a better way. So let's have a look how did the German government solve this issue till date. We have a lot of different online mailboxes. There's ELSER for those of you from Germany. We probably know it. It's a big application to pay your taxes. We have so-called the email. That's a German email variant that should be super secure. It's basically some regulation on standard email protocols. We have BundID, which is like a central identification service that also contains a mailbox. Then in the justice context, we have a lot of different mailboxes that are somehow interoperable, but none of these really follows the security by design principles and the Cedar Trust approach. And this poses a huge risk to privacy and security to highly sensible data. So this might explode somewhere, somewhere. In fact, actually, there have been incidents in Germany too, of course, as in other countries. For example, we had the, since 2021, we have so-called digital health apps, and they got analyzed by, it's a forschung, that's a collective of IT security researchers in Germany and the fact that these apps leaked personal data of more than 20,000 people. That's especially problematic because like in the healthcare sector, there are often very sensible information that might get leaked here. We also had a recent leak in the justice domain. That was the case of the Justice Maybox leak last year between October 13th and November 9th, a directory with personal identity data was publicly accessible to you to a config error. So this shouldn't have happened at all. There should have been technical measures in place to, yeah, that, yeah, make sure that this won't happen. That's especially bad in this domain. For example, if stalking victims use this Maybox to contact some, some Kurds, and especially then it's not, not really a great idea of their personal information, including their address is publicly accessible. So let's talk about some solutions. And I brought this vision here. What if, if, if communication between governments and citizens was easy, reliable and encrypted? And since we are in the metric deaf room, yeah, let's take metrics to the rescue. Metrics already provides enter and encrypted messages. It provides multi-device access from apps and web applications. It also provides access via third party apps and services. So for example, corporate IT service or e-governance apps, et cetera. This is all possible using the metrics protocol. So why not build a metrics-based secure communication channel between citizens and governments? And that's exactly what we are planning to do. We wanted to integrate metrics in Germany's national identity system. So first challenge here would be to build a proof of concept this year to demonstrate that this is technically possible. And we have some like technical things we want to discuss here. Also, for example, usability issues would be discussed here. And in general, when we do this, we of course want to have a great user experience. So what do we need for that? We need polls and multiple choice questions. We need push notifications and status updates. We need also machine readable data. For example, these polls would then make it easy for bi-directional interaction with the public administration using machine readable polls. That would be an interesting thing to look into. Also image and document uploads might be a feature. And the neat thing here is that metrics already built comes with these features. So there's not really much to build on top. We can just use this and go from there. Of course, we also need a great developer experience. That's something most government projects don't really think about. But I think especially here it's very important for us to have some SDKs in place for developers that are working on IT systems either inside the government, but also to help building apps and an ecosystem of apps and services for citizens and company-facing apps, for example. That helps us here with development speed for government services. So again, what does Matrix offer us? We have a great usability, especially compared to email-based systems here. We have tried and tested security. This exists already a bit and the protocols are known and we don't reinvent the wheel here security-wise. It's interoperable and it's easy to integrate and it's also ready to use in the real-world layer. Many features are already there in the Matrix specification. Some strategic thoughts on enter and encryptive communication. In my opinion is a key enabler for seamless e-government services. We need this anyway. We will also be able to... This will enable us to really build a privacy-preserving realization of the so-called once-only principle that enables governments to reuse already submitted data and documents. We have all this data in a machine-readable, secure way. It also might support us in some wallet-like use cases, for example, at a station presentation of attributes like driver's licenses. This also needs a secure communication channel before we think about all the additional challenges cryptography-wise here that we need to tackle. We need to start all of these things, need a communication layer as a starting point to interact between peoples and governments. A proud vision, where might this journey go? We will start with a mailbox app and later, if this works out, it might be a good starting point to provide the most common e-government services via this app. We would have an entry-end process to apply for all these services. This will definitely help us with usability and user experience. This might be a neat thing to look into. In Germany, there are very little government services that are already integrated into an easy-to-use app. Most of them are just huge web forms where you have to enter lots of data and then you send a form and hope for the best. If we go further, finally, why not build a framework for any e-government service? Basically, the service that is integrated into the app is basically a config file. This would help us to scale, obviously, and give us an opportunity for modulary specifying different services that we want to provide by just providing a config file that defines how the UI, for example, in the app looks like. Putting it all together, this will provide us a National Privacy First e-government app, which would be a neat thing to have. Maybe it will help us build up speed and get better in this domain. To conclude, let's talk a bit about infrastructure. The status quo is that we have different text tags for requesting services and also for replies. These are completely different infrastructures. For example, we are able to request services via a REST API and then there's a SOAP API to provide messages back. This is completely different. Also, it would be nice to... Currently, we have different text tags between different government agencies. These might be encrypted or not. That's obviously not good. What can we do about it? The obvious solution here would be to take metrics as an interability layer. In my opinion, that would totally make sense to have a basic common ground to communicate with different government agencies. Actually, that's what metrics is designed for. We don't have only the chat application use case, but also the communication layer between different organizations or people. That might be an interesting thing to look into and build some prototypes here. Plus, it would also be very easy to integrate industrial needs here. The industry is also, of course, a large customer, so to say, to the government. They are requesting, for example, building permits for wind parks. It would be nice if they didn't have to do this wire paper, but also wire an easy-to-use API and integrate their own IT services in this ecosystem. Everything becomes easier for the government and for the industry to work together here. Okay. That's all I have. Thanks for listening. I would really like to continue the discussion of course via metrics, if you like. Join the Metrics channel. It's metricsforgov.org. It would be really interesting to discuss with you there. I think we might have time for some questions. You already answered one of the questions online for where is the place to discuss about it. Another question online was, is there any plans to bring it together with the TI messenger communication from the German healthcare sector? Yes, of course. We hadn't any in-depth discussions how to bring this together, but obviously we would then be using the same tech stack. From an architectural point of view, this is what we want to do. We have all these different Maybox infrastructures in Germany right now, and we need an interoperability layer between them to make it easy to use all of them and have one place for people to receive these messages or send messages to the government. This is one of the design goals in the long term, to have all these services using the same communication infrastructure, making it easier for people and governments. So the question was, if the GNUTALA project that also has some origins in Germany might be... So the question was if the Nutala project that also has some origins in Germany might be, so if there might be some lessons we learned there, so I'm personally not that involved in the Nutala project but I'm looking at it with great interest because I think that would be a nice candidate to have privacy preserving payment, a privacy preserving payment system here. That would of course greatly integrate in such an app here, so just yesterday I thought about this aspect of maybe looking a bit deeper into Nutala there. From the perspective of making it more or interoperable in the European domain we are trying to or we are looking into of course other European or we're talking to other European governments if this might be also a great thing for them. We have the Interoperable Europe Act, we have the Single Literary Gateway regulation in the EU so it might be a good thing to maybe harmonise this not only on the national level but also on the European level. I think that's an important aspect when we build infrastructure and I don't know any other standard, the metrics that has the potential to solve this quite nicely. We are talking to them. It's our question. So the question was if authentication will provide the requirements that government services have in terms of authentication would have any impact on what is needed with metrics and I think we're going in the right direction here with OpenID Connect so this is like what government services already use. The thing is here this is not completely zero trust and so we are not there yet with security and privacy by design here because if there would be one central authentication server that would provide identities for all people in Germany using OpenID Connect this will be a huge attack surface of course. So we are also thinking about how to maybe integrate the German EID system that have it in my backpack. So we have some EID cards that can be used to authenticate people and that would also be an interesting thing to look into if we could deploy this privacy preserving authentication system for these kind of services. So that's a huge thing we are thinking about how to reduce risks here security and privacy wise when we build such a massive system that deals with highly critical personal data. Yeah so the question was if we provide any OZG services via this protocol OZG is like the German government service accessibility law that requires governments to provide services online and of course we have this thought if this would be possible at all. Right now we have like different systems other systems in place that are using different text stacks here but in my personal opinion this would be the like natural evolution to if we like communicate with people via such an app or via the metric standards we might also look into using metrics to fulfill these services but I think that's a long journey to go. There are some things you need to consider when we build this infrastructure because not just the communication you also have to deal about which services or who can request these government services we have to deal about authentication routing which government agency is the right agency you want to address. So I think from a technical perspective this would work but I think it will take time to think about and maybe sometime build this but yeah we also don't want to build something separate to the services that are already in place so I think the only natural solution here would be to transform existing services to yeah maybe sometimes using metrics and have a roadmap to or for developers and organizations how to migrate from existing services to metrics otherwise this will probably not work and yeah will create a lot of confusion I think. Yeah so the question was how to deal with with backups and device signing and all that stuff so how to handle private keys basically and yes we are thinking about this and we have some some ideas how this could be could be done of course we don't want people to like manually store some private key file on their laptop and like take the burden to them but this is like definitely a thing we are thinking about so if you have any inputs on this thing I'd be very happy to hear from you in the metrics chat. Thanks. Yeah there is so the question was could we use our German EID cards for this and the German EID cards are so they are able to put some some digital signatures out there the problem is that currently the the signing keys are not deployed on this EID card so you would have to build some infrastructure to deploy the the yeah signing keys for for the people like private keys and as a like certificates for for every people this is like a huge organizational thing but yeah maybe this this might be an option to go for but I expected to be I don't know nothing that happens in the next one or two years production takes a bit of time to build this. Thank you very much. Thanks.
Embracing Matrix for Enhanced Communication: Migrating the WordPress Community from Slack to Matrix
Hi, so this will be about migrating the WordPress community from Slack to Matrix. So first, quick about me, I'm Alex Kirk, I'm from Vienna, Austria. I'm at Automatic for since 2014 now. We run WordPress.com and others. I'm an engineer. I lead teams around localization and matrix. I'm sponsored to contribute to WordPress.org. And I've got some site projects, so if you have a WordPress blog, check out the Friends plugin for making your site your own hub for subscribing to others and enable Master on Apps plugin if you want to use Master on Apps with your site. So quick thing, probably I don't need to tell you, but just to make sure, what is WordPress? It's a popular PHP CMS. But in 2003, today it powers over 43% of the websites on the web. It has a blog editor that allows you to edit posts, but also the whole site. It's well known for its plugin ecosystem, with plugins like Yoast, Runt Custom Fields, WooCommerce and so on. And it's open source under the GPL. And just a step back, just so that you understand what our needs are as a community. So this is how we collaborate. We've got 22 make teams in different areas, so one about accessibility, core, design, polygots, meta, lots of teams, performance, sustainability. And they all work towards separate goals. But each team has a P2, a blog, where they post about new things that are happening, proposals, decisions that are being announced, lots more. This is like the asynchronous part of the communication. And then we've got sometimes weekly, sometimes bi-weekly chat meetings for sharing updates and coordinating. And these are quite important because they give people a definite time when they can reach collaborators on the project. So you don't have to enter a room and hope that the right person is there. But you know the time, at this time, people who work in accessibility, for example, are available. And we've got meetups and work camps. Meetups are local to a city. They're like the smaller ones. Work camps are the next stage where people travel there to meet. And then we've got the flagship work camps. For example, in Asia, coming up in March, EU in Milan, and US in Portland in September. And another aspect is we've got a project, an initiative called Five for the Future. So there we encourage individuals and organizations to contribute 5% of their time or resources towards the workplace project. So this means a 100-person company would have five people dedicated to the project. An individual would have like two hours out of a week. And organizations like that concept because they retain control over the person who can contribute. And thus they're confident of pledging towards that goal. And if you want to hear more about that, there's actually a talk by my colleague Jesus in this room, Shaping the Future, investing wisely in long-term open source development with Five for the Future. And this is how a release of WordPress looks like. These are the companies who contributed to a release. So 640 people from 186 identified companies. This is the make-wordpress.org site. This is where we list the teams. And as you can see at the bottom, we list the next meeting that will happen and not only in Slack but also in Matrix. And this is the meetings during a week. So every day a couple of meetings take place. And because of the distribution around the world, some meetings happen twice in a day so that everybody has a choice to attend them. All right. So our plan to migrate. So it happened or it started in January last year. We announced that we'll create a subproject to evaluate migrating to Matrix. And then we would evaluate and create the environment that we need, migrate history and integrations, and then finally launch, finalize what needs to be finalized, and turn off Slack. All right. So what could happen? What things we anticipated? First people don't like change. We've been on Slack for a while. So we figured we need to prioritize something superior. So where are the strengths of the new system so that people will want to move? There is complexity around decentralized systems. Like everybody knows centralized systems. They need to go to one address and that's the only way to get there. So people might not know what to do. And then we'd had Slack lock-in. We've got lots of migrations created over the years that make Slack nice to use for everybody. And that's why people like it, I suppose. So when you consider Slack and an open source community, there's actually a few things that are a bit tricky. So one thing is that Slack, a sign up is email-based. So when you join the WordPress Slack, you have to follow a guide. And typically we actually do this at WordCamps where we have somebody there who will help somebody to get onto Slack. It's pretty complicated. Then it's a commercial product. The free tier has a message retention limit. The data is siloed behind Slack stores. So you need API keys to access it. But many companies use Slack and it's easy to just add one more workspace to Slack. So for many people, the barrier of entry is quite low in the end. Having matrix to that. Of course, Federation means everybody could join from anywhere, from any home server. But for the WordPress community, we would want to lock them in through an existing authentication system. No retention limits, of course. And our WordPress community has multiple Slack workspaces for different countries. So this would be able to unite them in one place. And of course, an open source project should have an open source chat. All right. So we tried to make it easier to join Matrix. Number one, I already mentioned it. We created a way to use your WordPress.org account to access Matrix. And we created it in a way that anybody could install this plugin on their own server to use it to authenticate a user against, like, to join a Matrix server. And with the upcoming OpenMidiConnect being like full in for Matrix, this is a potential authentication provider. Yeah, so on WordPress.org, we've installed it. People can use the WordPress.org account. They will go through their WordPress login screen and just authorize the WordPress server to submit the information to Matrix. Number two, we created a Matrix client in a WordPress blog. So a WordPress page is made up of blocks. And those blocks, one of them can now be a Matrix chat. So we call it Chattricks. And you can configure each block individually. So one thing that you can do is you can pre-define the home server, which we'll do. But you can also restrict it to a single room. It's based on hydrogen. And we did some upstream contributions. So before we used it, hydrogen had, you could only have one hydrogen open in the whole browser between tabs even. So we contributed something so that you can use it in multiple blocks on the same page. If you have multiple posts, typically they would be like all put on one page and then that wasn't possible before. And we had a couple of bug fixes to use hydrogen with SSO. I'm not sure how many people had used it with SSO before. And this allowed us to create team chat pages. So what does this mean? We can give a contributor a URL, a WordPress.org URL, where they should go for a meeting. They don't need to know this is Matrix. They just see it's a URL on WordPress.org. So for example, for the Make WordPress Core that creates WordPress Core, the address of the Make blog is make-wordpress.org-core. So the chat page is just slash chat. Core has different chats. There's another chat. The design team has a chat and so on. So this is what such a page looks like. On the top, you have a custom, like it's a WordPress post. You can put anything there. We put there when's the next meeting, instructions on how to go there, also instructions on if you want to use your own Matrix client how to get there. And this is the chat rigs block, which shows the room at the time. For FOSDA, my colleague Ashish created a small demo, and it uses the WordPress Playground, which is an interesting concept where you can run WordPress in your browser, and you can test any plugin in a sandbox in your browser. So I've recorded a demo video, to be sure, but it's real time, so as it loads, you can see it's pretty fast. So this now loads WordPress, and we've preconfigured it with a chat rigs block, and here it joined the chat. You can go there and enter a message, and all you have to know is the URL of this page. If you want to add such a block to the page, this is how you do it. You use the Gutenberg block editor. You add the block. You configure it. You set a home server. And if you want to lock it down to a room, you don't have to, but it can be practical to do it. You just enter the room name, and then the block loses the room list and just shows the room that you attached it to. And then you can... It's a block. You can add stuff before, after, as you wish. It's a pretty neat way of giving instructions to people, or putting, I don't know, meetup agendas, whatever. It's like, it's a post. Additionally, we created our own element instance. Just you can preconfigure it with the home server so that you don't have to tell people, you have to enter this home server into the login screen. It's something where people might typically get lost already. And we also created a bridge. So since we control both the bridge and the matrix server, we were able to create all the users on the matrix server and use the Slack bridge with a slightly forked version so that we can use puppeting. So when you post something on Slack, your matrix user already will say the same thing in your name. And there's some upstream fixes, by the way, that could be merged. And... Yeah, so that makes things quite streamlined. And another thing that we wanted to do, we didn't want to lose the history of Slack. But it's been a bit tricky because if you create a bridge, the bridge needs to start at some point and you cannot really backfill messages. So what we did, we figured out this little trick of first creating a room and bridging it, then creating a second room and migrating the history of that room there. Then we would add all users to that new room. We would import the old events in sequence so that we can backdate using an app service. And if the user is no longer in the room, we have to re-invite them and so on. And when we're finished, we can then copy the events from the first room that already started to be bridged and thus close the gap in the history. So there is this period between you importing or getting the data from Slack and bridging that this gives you a way to close that gap. And then we can change the room aliases, reattach the bridge, delete the old room. We've got a room with all history. So now we have a matrix server, community, it uses Synapse with a Slack bridge. Open ID Connect configured and with the app service. And we migrated 3 million messages in 170 rooms, 45k users, 55 gigabytes of database size. And during this process, we updated the community. We had held weekly meetings as it's common in WordPress. We published meeting notes afterwards. And we've got coverage from the tavern. So it's like first in January, like we're starting this in April. This is what we're continuing. We had to figure things out about private and public messages. And then we installed the matrix bridge. So now to the migration. In November, we announced we want to migrate to matrix. And this is how we'll do it. We'll ask people to use matrix instead of Slack. Before the final migration, we'll post a message in every Slack room. Slack will be closed. This is where you need to go for instructions. And then finally, disable postings. It's actually quite interesting that it's pretty hard to just kind of disable a Slack server because in a way you want to be sure that it's still around. The only way to completely shut it down is to delete it, which is a destructive operation. So what remains is that people could still DM. Well, OK. So the feedback that we anticipated from this. People want the default. So we would figure like they would use element. We knew that the notification element is not to everybody's liking. There's no dedicated threads and mentions view as in Slack. Threads is coming. I saw it. There's a couple of things that people are used to from Slack that are a bit different. We anticipated that. So we felt like people could live without that. People in matrix have been living without it for a while. Search is a bit difficult. There's no search langes in element. And while there are many other clients, some of them miss important features like threads or have implementations that are kind of different. I mean, I've tested some of them. Nico, for example, it works, but it's different. And then when you provide a home server to community, it comes with all sorts of troubles. So you cannot limit people creating rooms on the home server. So people will create some spam rooms, whatever. So you need to be aware of that. So we started to collect issues by the community. So they said we are unable to force some things like we can on Slack. So you cannot reduce the time allowed for editing messages. You cannot enforce room membership for federated users. Well, okay. Thread messages in Slack, you can say, I want this message to also be posted to the main room. It doesn't work. Other Slack features, they're considered essential. You cannot ping a group on matrix. And you cannot ping here. As the room mentioned, but there is stuff that when you have one central server that you can enforce that you cannot enforce on a distributed environment. And scheduling of messages, reminders, not there yet. Through a bot maybe, but not in the UI well integrated as in Slack. Then accessibility problems. So there has been an initiative to improve elements accessibility, but there are still gaps like macro navigation. This over wasn't super great. Then we had bridges, glitches like out of other messages, duplicates, double, yeah. All sorts of small issues. Use experience around threads management, obviously. We anticipated that. Then some things that don't work with matrix failing to load time zone positioning, lots of user join events that can make things pretty slow. So what we did to address this. We implemented integrations and many fixes via bots. We used the mobile framework. We could use the RSS bot, GitHub bot. We tried to make it easier to migrate our own Slack integrations. So we have a post to room bot that uses the web hook to post messages to the room. The other direction, how can something on our servers react to something in a room? We implemented group mentions. Not super great. If you post a command to the bot, it will post another message mentioning everybody in the group. Which there are some very large groups. So there could be very long messages. And watchdog so that we can be alerted if some spam rooms are created by some community members. We also, because we had our own element instance, we could ship fixes there while they're waiting to be merged. We provided the channel for the community where they could get help. We created the commutation and guides. But we had to stop the migration. So Matt called off the migration at the state of the word. Then we posted about it. And in turns out the accessibility problems were too big. We weren't able to merge them in time. We submitted the patches upstream. And there was uncertainty around where, like, are UI needs, where are they on elements roadmap? What are the effects of the license changes that were announced with Synapse? And overall, like, do those changes mean that the ethos of the WordPress project are no longer aligned with the element or the matrix project? It's kind of, it creates a bit of uncertainty that was detrimental to the migration. So the current status. The WordPress community remains a Slack community. But now with the matrix bridge and all Slack history, new contributors can no longer need a Slack account to join the conversations. And turning off Slack is currently not planned. And we'll keep observing how the matrix product develops. So in summary, the WordPress community didn't fully merge and migrate in the end. But maybe the things that didn't work for us are not so important for you. We are a huge community. There's many voices. I could see in a smaller community those things not being as important. And I hope that this talk was able to help you identify which is important for us and decide if you are kind of suffering or not suffering under the same issues. Along the way, we did a lot of open source contributions. The WordPress plugins I mentioned. To matrix, open source, all our bots, the migration app service patches upstream. And that is it. Thank you. Yeah, check out the slides. There's lots of links in the slides. Yeah, so you mentioned that migration, you want to fill in your room with Slack and in for the messages there. You actually don't really have to do that because a lot of the bridges actually allow you to, the four messages that come in the history. So the question was whether we were using the functions of a bridge to get back the old messages. So in our experiences, it wasn't possible to backdate the messages. That was the main issue there. I suppose if you, it depends on the implementation on the matrix server. Maybe there's no, okay. Okay, so the question was whether we considered the relationship funnel? So that's a good question. Okay, so the question was whether we considered to use. to send another message. Okay, so the question was whether we considered to use an app service to change the push rules for a user in order to enable group mentions. No, we haven't considered that. Maybe it's a possibility. I don't know if the user would, basically you're saying that you would add a keyword to the user so that they would be mentioned and you would configure it for them. Maybe that's a possibility. Yeah, you were saying there were some accessibility problems which were kind of killed this. Could you give us a little detail of what was actually missing or what problems were that were? Sure, so the most important problems were around macro keyboard navigation, so navigating between the bigger sections of the app. So there's the sidebar of the spaces, there's the room list, there's the search menu and the messages. And for example, you couldn't get up the message list using the keyboard. And if you were able to somehow get into that area, then the voiceover read out wasn't very useful. It repeated for every message, for example, profile picture. It, so stuff like that had been annoying people. Thank you for the POS, but like the accessibility when I was outside, I think it's already reviewed. One of the other plans. Matthew said the accessibility team has reviewed the patches that were submitted. Other questions? Yeah. Did you have to disable some of those integrations or work around that you mentioned? For example, I imagine that here things wouldn't work very well on both sides in a different way. All right, the question was whether we had to disable integrations. So one interesting thing about the bridge is that it works in both ways. So migrating and integration could be done in a way that you first enable, like you create the migration on the matrix side and when it's ready, you turn it off on the slack side and enable it on the matrix side. And still both sides would be able to use the integration. So for the here one, well, it only worked on slack in the end, but I don't know, it depends on the team. Like part of the WordPress project, like there are so many teams that every team has their own ways of doing meetings. And some heavily rely on those group mentions, others don't, some need the here mention, others don't, it's like hard to make everybody happy all the time. So that's probably part of such a big migration that it's really, yeah, you get so many opinions and as with many communities, there are some louder than others. So the question from the internet, where can you find the tools you have used for the migration of the history of the room? Yeah, so the question from the internet was how we, if you can access the tools. So I recommend you to look at the slides, in the slide where I talk about the app service migration, that's where it's linked. Is there any integration with Element Call? The question was whether we did integration with Element Call, no we didn't. There is no culture of using video conferencing in the regular, some teams use it, but they tend to use Zoom at the moment I think. Yeah, it's depending on the team what they use. So for example, Slack Huddles as an alternative on Slack are not being used as far as I know. Is there a possibility for the migration? Or is it more of a licensing issue of accessibility? So the question was whether there is a possibility to complete the migration. I think it's certainly possible at the moment. I think there has been a bit of tension of implementing the migration fast so that people are not left behind. Like if you let the migration linger for a long time, then people will never migrate and then always at the end, like people get panic and then they do the migration so the whole long period is wasted kind of. So that's why the initial plan was to have it rather short. But on the other hand, I think this current hybrid state is not as bad as you would imagine because for new contributors, we've got this easy onboarding and one thing that I liked about this, the way that we implemented is that you can kind of slowly upgrade your experience. You start with the chat message, the chat URL, and then if you use it a lot, you could upgrade to elements, the one that we hosted, and then you could upgrade to another client. So I think that's an interesting way of luring people in. So maybe over time, the number of metrics users will increase so much that it becomes like a request from the community. But as of now, we're kind of waiting what the license changes do. And yeah, this hybrid state is one that I think is acceptable for the moment. Okay, no more questions? One more time. No more time. One last question. Matthew. I just want to know what it is about the license change on SNAPs that is causing it to. I invited to talk to Matt. Yeah, it's basically like WordPress is on the GPL license where you used to be able to modify software on servers and not having to push the changes back and also contributing back codes to the element project and assigning a CLA is something that makes people uncomfortable. All right, thank you. Thank you very much. Thank you. Thank you. Thank you very much. Thank you. Thank you. Thank you.
NeoDateFix - A solution to organising meetings in Matrix
Now, we will have Milton and then Norgin and then Amat and Mikhail. They will tell us about the Neo-Date fix and a good solution to organize meetings. Thank you very much. The stage is yours. Thank you. I'm happy to see everyone of you today here. As Jan presented, I'm Milton and we're going to talk about Neo-Date fix or previously known as Matrix meetings. It was the starting name of the project, but anyway, we'll start by talking a bit about who we are. Amat, yes? Closer talking. So we are four developers from Norddeck. We have been doing software development specifically developing web applications on top of Matrix. We are doing this in the context of the OpenDesk Sovereign Workplace project for the German public sector. We have built this suite of web applications that are embedded within Element. I'll explain a bit about that more later. But yeah, we have Neo-Date fix, which we'll present here today. We have Neo-Board, which is a real-time collaborative whiteboard, which is actually what you're seeing in this presentation. So we built these slides and they're presenting with the Neo-Board. We also have voting polls, which is Neo-Choice application that it's not spec-based, but I won't get into that. And also, if you were at the Fringe event last Friday, we were using the BarCamp application to manage the schedule and the speakers and the whole tracks there. So yeah, what is Neo-Date fix? Neo-Date fix is a web application that allows you to create meetings and video conferencing meetings, especially within a Matrix client using the widget API. So currently, the only client that implements the widget API is Element Web. So that's what we have to work with. It is a good thing. And yeah, what can we do with the application? We can create these meetings as meeting rooms, as I've said. The meeting rooms are created with the default widget layout. So we have the video conference widget expanded and front and center with other widgets that you can choose. Typically, it could be whiteboard or some other widget that you want to previously set up. So you can pre-configure this for usability and quick action when you get into the meeting. We can schedule recurring and non-recurring meetings and see them in this calendar view that we'll show. It also supports creating breakout sessions. So if you have larger meetings and want to create sub-meetings when split people between those meetings, you can do that. We also support users that don't have an account on the home server. So we'll bring them in, creating them as temporary guest users, and they'll join the meeting. And we can also integrate with third-party clients. So specifically in the open desk project, there's open exchanges. There's also a calendaring solution. And when you create a video conference call there, it will create a meeting room in element with everything set up for that call. And finally, all of this is fully accessible and with support from multi-language. Okay. So going to the widget part, if you're familiar with widgets, you sort of have an idea. If not, it's a way to embed web applications inside element. It gives you access to the room events and room state events, and not much more, but that's the gist of it. And the way we have built our applications is we built this common layer, which we call the widget toolkit, that gives you, like these, for example, a React component, which will inject the widget API client into your React, and you can start using it without having to do that integration by yourself. It comes with material UI components. So you can also have this consistent look and feel. You can change the theming. And it also comes with some mocking components for easier testing. And finally, it comes with a base Docker image that you can use to quickly deploy into your infrastructure, your widget, based on that. And finally, it's also not only a widget, but it's also a bot, because the widget API only gives you access to the room data. We need to create the room meetings. We need to set up all of these accessory workflows that we then use the bot to perform these. It's built using nodes in TypeScript and in the bot SDK and in the SGS package for the API that we exposed. And yeah, this is a broader overview. Now Mikhail will talk a bit more about the internals, how we are doing this. Hello, hello. Thanks, Mutham. I'm Mikhail. I would like to continue with a high-level architecture of how it works. So we have this Neo-DateFix widget that is embedded in the Element Web Client. It just uses a widget API with a toolkit to send and observe state events and call some other actions that the API allows. It all goes past through the Element Client to the Home Server. And some of these goes to Neo-DateFix bot. So in Neo-DateFix bot, it looks for some particular message events of some particular types. And when they are received, it applies certain actions to the rooms. So besides having metrics API, Neo-DateFix bot also has HTTP API that is used by, again, Neo-DateFix widget to provide widgets lists and additional configurations that widget may need. And additionally, it provides the HTTP API to manage the meetings by external clients, as I've already said. Addiction to these components, we also developed several Element Modules that simplifies a bit this setup. And also add some additional optional features, like, and this one, like Lifecycle Model, Guest Model, but these are optional. And then, we have a new one. It's a very simple way. So if user wants to start with Neo-DateFix widget, it has to create a room. And then he needs to invite bot to this room, bot will auto-cept the invite. And then user needs to grant moderation rights to this bot. As soon as it's done, bot adds Neo-DateFix widget to the room. So he can see calendar and create first meeting. So he could create single meeting or recurrent meetings. It all end up with one room for one meeting. But the meeting room is a special room. It has type Nordic meetings meeting in M-Rooms create event. It also connects to the parent room, this calendar room, with M-Space child and M-Space parent events. There is one too many relationship with the meeting rooms. The meeting room has widgets and of course it has some other state events that are related to the meeting. Within the meeting room, user could create a breakout session room. It also would be a separate room, but with its own breakout session type. And it also has a connection to the meeting room where it was created. So we use message events and state events obviously from the metrics. And all the message events there are prefix weeks, not Nordic meetings prefix. So they are the events that are sent by the widget. So just to manage the meetings, to create, change permissions, pages, participants, Tomstone the meeting or send some messages. The state events are used to store the state of the meeting. Mostly there are metrics ones, but in addition there is a net Nordic meetings metadata. It contains calendar information. So this is an example of this calendar information from the meeting metadata event. So for the single meeting, it is a list of just one entry that has certain end fields with a date time stamp together with a time zone. And so it's quite simple. And for occurring, in the case like it could be excluded dates and overrides, it has besides frequency rules, it's frequency daily interval one. It has exclude dates, to exclude some particular dates from this recurrence. And as well to, it could have several overrides to change some particular occurrences of this recurrent meeting. Yeah, this is all regarding the slides and I would like to hand over to Nurjin to show some of the features. Thank you. Thank you, Mikhail. Hey, Bill. It's Nurjin. So yeah, hopefully after the all talk you'd be teased enough to see some actions, some demo. So here, me and Ahmed will try to demo quick the basic features. So yeah, first we need to create a new room. We call it a calendar room. And yeah, so we need the bot to be added to the room and give him the correct exact right as a moderator above. So the bot would be able to configure the room for us and add the widget. Yeah, here we can see the bot added the widget into the room, pinned it in the middle. Yeah, so we can schedule meetings while, yeah, here you can see there are the information that you can create for the meetings. You can add participants. You can also allow or not allow messaging in these meeting rooms. And also we have a set of configured widgets by the bot. You can add, remove whatever you want. We will create an example of a single meeting and a recurring meeting. So it's basically the same. We can add the recurring meeting. We say if it's an open-end meeting or if it's end after specific time or yeah, after specific recurrences, for example. Yeah, here we can see we have first the list view, which are the meetings are shown as cards. Each card contains the information with extra buttons for the participants and also the share meeting. We can share it with the meeting with a link or by email or we can download it and as an ICS file and important into other libraries. We are also able to edit, delete the meeting and of course go directly to the meeting room. Yeah, other than the list view, we have also this calendaring view. We have the day view, the work week and week and the monthly view. And of course in each of these views, if we click on the calendar entry, we would be able to edit the meeting. So for example, here we can try to edit the whole series by switching or just edit like the whole series or one recurrence. We can edit here one instance, for example, save. And if we go back to the calendar, so the changes are reflected into the calendar, this deviates from the others regarding timing. So we can also join the meeting. We can see that the bot already configured the room with the widgets that we chose with a specific layout configurations that we set. Here we set that new board and the JITC are now configured. We also see that the bot sends notifications to the room with every change that we make. And also besides those, we have the near-date fix details. It's basically just another detailed information about this meeting room. Also we can do other actions with it, edit it or delete it. We can also go back to the parent room of this meeting room. And as Milton already talked about the breakout sessions here in the meeting rooms, we can create breakout sessions as many as much as we would want or need. Here we can select, they are divided into groups, named defaultly by group one, group two. We can distribute the whole users randomly or we can select them manually, whichever we would like. So, yeah. Yeah, here we can see the breakout sessions are created. They are also as cards and extra we can check that we can send a message to all breakout sessions. As an organizer, you want to notify all breakout sessions, yeah, let's be back to the meeting room or whatever. So we can send it. We can go to other user and here we can see that he got all the invitations. For example, this for the daily, if we view the message maybe, or go to it, yeah. Yeah, we can, so basically the message of this invitation contains also the recurrence rules, information when it occurs and when and who you are added by. And yeah, you can see here, for example, in the breakout sessions where Alice sent hello word or hello, the, yeah, the message was sent to the room and also breakout sessions are configured with Jitsi. Yeah, I guess that's all. Thank you. I will hand over to Milton. Thank you for the demo. Hope you guys liked it. Just to finish our presentation, we have a couple of interesting things that we find that want to share with you. So the first thing is that as you imagine creating lots of meetings and these temporary users creates, I mean, it's not, it's relatively cheap to have these resources, but we want to keep things clean. So we have these additional features where you can clean up the temporary users using a signups module and also have this sort of hackish room reaper that will go through the finished meetings in the past. According to their, there's a field that will tell you tell the bots when they should be deleted and it will clean up after himself, which is a good thing. And also we have, can you move to the next slide? And also we have what we believe is a very good end to end test suite because as you may know, besides units and integration testing, end to end tests allow you to script the full interaction from the browser to the element web to the widget and how it then interacts with the bot. So we have a fully automated way to have the environments being created, tested on, then destroyed. And this is obviously a preconditioned for us to have releases when these tests pass and they covered most of the features. So if you want to see a good example of using playwrights and test containers as a way to do end to end testing, please check out the repo. There are obviously still room for improvement. We are just finishing and should have soon, should be releasing support for encrypted meeting rooms and encrypted control rooms or the calendar room. We've had a slight issue here because we are clients, requires us to deliver to this special IBM Z platform. There weren't any bindings for the crypto rust. That's the case. So we're, I think we have that in order for soon release. Also make the bot clean the rooms instead of that hacky script. Support element call also when it becomes a defect host and they're there. Also have space scoped calendars. So in the demo you saw that there's a single calendar room and it will create the meeting rooms in your top level. So if you could have it create within spaces so the meetings would be within that space and who couldn't maybe manage different teams or different groups with different calendars, that would be a good thing. And yeah, and finally we can get meetings in publishing them out to another in another calendar in client would be a great thing to wrap up. Here are some of the resources, the links to our repos. I mentioned, I think it's an open source Apache to licensed applications. So be sure to check them out. Yeah, I think we're ready for questions. Yes. There was one question on the internet. Do you have support for entry of the rooms and if not is there plans to do so? Yes. So as I said a few minutes ago, we currently don't support it, but we are soon releasing that. So it's a matter of days. And the question was do we support encrypted rooms? Sorry. Yes. Yes, this is the Neil boards. This is maybe can can you show maybe a couple of features. So this is the this is a widget that allows you to have a real time collaborative whiteboard draw. It's an initial feature set, but it's if you went or if you participated in the summer, the matrix community summit in Berlin last September, we did a full presentation. It's online. You can check it out for more details there. So the question is what is how are we implementing the invitation page when we show them to the invited user with the information about the meeting. Do you want to? Yes. So it works like first of all, in invitations, there is a message itself, but it's of course constructed just inside of a member member. There is invite text, but so it's unfortunate we didn't show it, but it's there isn't there. But besides that, there is not this metadata event, for instance, it's we configured it separately in synopsis. Share it in the strip state. So when you go to invitation, it's already shared and you can see it already in calendar. So if you have calendar as a second user, you would see it there already. So we just yeah, we added to the strip state. So it's a bit. Yeah, exactly. Did I understand correctly that it only supports. The question is if we only support jitzy meetings and not in app meetings. What do you mean by in app meetings. I mean, most of the clients have own meeting functionality. And I just wanted to give the chance to use that. So if the answer to that is if there is a widget for it, we can support it. So if there are other alternatives for video conferencing, it's a matter of just developing a widget that supports it and setting up the bot configuration to then include it in the room. It can be implemented as a separate. So if not supported as a widget, so in theory, you can develop the widget is a toolkit and edit as another widget to the meeting. Okay, the question is what is the Docker container part that I mentioned. So in order to deploy widgets, they are a web application. So they need to run on a web server. So we have this Docker file template based on engine X that it's already prepared for you to just auto include in your base Docker image. For your app. So you just instead of including Debian or node based image, you include the widget server image in your Docker file and just copy over the build release distribution assets there. That's the main accelerator for that. So it's ready to use base image for widgets. Any further questions. You can download an ICS file. The question is, can we can we integrate with Google Calendar and other calendar publishing platforms? We only support downloading in the ICS file for recurring or single instance meeting. The inner format that Mikal showed it's restoring an I cal. I cal format. So the storage is using that. But yeah, we don't export any data out currently, but that would be a good thing. The community is open source. You can contribute with support for that. As well. Yeah. If you go to maybe can go to the resources page. You can. Well, not the widget, but the Neo board. Yes, we have a live widget demo for having this. Sure. The question is, how can you can you include and use this right now in your element web client because it's a widget and the bot you have to host it somewhere. So you would need to download and deploy it to some server or VM. It's not included in this time that I don't know. And it's, yeah. Thank you. Thank you very much.
MatrixRTC: The Future of Matrix Calls
Thanks for the amazing introduction. And as Jan said, so we're here to talk about the future of calling and matrix. And we actually bring some pretty cool new things and there's quite a thing going on at the moment. So yeah, we really hope from now on there's finally good calling and matrix, or at least we're doing the first steps and all this will be built upon matrix RTC, which is a underlying protocol basically, which empowers all the calling in the future. And that's what we're going to talk about, how this works, how this is structured, and how calls are built on top of this. So matrix RTC is actually something a couple of you probably already has encountered in the form of element calls. So basically this is a standalone web app where you can just have calls. It's very similar to JITC, but in the background it's actually running matrix RTC. And what hasn't reached yet though is that this is really in the federated system. So what we have here, this single page application is very, very enclosed. So it's running its own home server and it doesn't federate. You can't log in with your actual matrix account. You have to have a custom account for this specific application. And the change we're going to present now is that we actually have the same technology but in our federated matrix system. So before we actually start into the interesting new things, we talk about why we even considered redesigning all of this. Because as probably all of you know, there is calling in matrix since quite some time already. It's in Niko, it's in the legacy matrix apps, element apps, and it's an element web. So why not just work on those? There are issues. For example, if you call each other at the same time, you get issues that the calls sometimes don't figure out that two people want to talk to each other. Sometimes one of your devices never stops ringing. But why not just fixing those? Oh, I see a lot of knots. That's actually super satisfying. It's really good to see that people know what I'm talking about. So why not just focus on fixing those? Why rebuilding something entirely new? And the thing is, there are some pretty fundamental limitations. So it's by design just one-to-one calls. That's just how it's designed. And it never really was in this specification designed for something bigger. It's very call specific. So you can't build any arbitrary real-time application on top of it. It's just for calls. And that's something which we think would be cool if that changes. And the signaling is done over room events. That's not necessarily a mistake, but it makes things a little slower than necessary. And also, it's really hard to get right, as we can see with ringing never stops, or we call each other at the same time, and it doesn't converge to an actual call. So this is basically our vision, what we want to achieve. So we want calls to be a great and central part of matrix via matrix RTC. And those four columns are the core things which we really want to get right. So we don't just want to have calls. We want to think beyond calls and build an expandable system that motivates also other projects. So we already had this, not with the exact stack we have right now, but something very similar. And people like Element build third room, and also Nordic build things like the NeoBoard, which are also kind of built on a similar thing than matrix RTC. And we want to make matrix RTC really a thing where it's super easy to build those kind of things. The other column which is super important is that it's using a pluggable RTC spec end. So currently that's LiveKit, and LiveKit is an amazing open project. So it really fits into matrix from a culture point of view. So it's an open system, and it really solves all very complicated issues if you use WebRTC for calling. It even ships a SFU, and it's just a very, very decent combination, like matrix for the high level signaling and LiveKit for actually doing the WebRTC shenanigans you need to go through. It actually gets quite annoying if you look into the details, and they just do an amazing job to really get this all nailed down. And then it has to support large group calls. Everything which we want to have in the future shouldn't be just for one-on-one calls. I guess that's pretty obvious. And last, we want to make it as simple as possible for other clients to support the whole infrastructure. And we already have two apps from Fremantly, like the Fremantly app itself and FluffyChat which support it. And we have Element apps which support it. And we also, we talk about this in more detail later, want to make it as easy as possible for others to also add calling. There's like a widget path you can take, and also LiveKit helps us here, because they provide pretty decent SDKs. So if we want to build calling on Matrix, we really want to leverage all the good things about Matrix. So here's like a very short, I guess I can really do this quickly because everybody knows probably, what really makes or what Matrix is really good at. So what things we really have to pass over through this real-time infrastructure. And one of this is that it's an open standard. That's like one of the things I really see as a core of Matrix. It's super cool. It's really fast and so it's great out because it's not really that surprising. Then we have Matrix encryption, which is really powerful and it goes further than just encrypting for large rooms. It also has a very good authentication and verification system. And that's a thing which I think is super essential that you not only can connect encrypted to other people, but it's also verified. So you have the guarantee that if everybody is doing this device verification correctly that all the participants are actual participants you trust and where you trust the devices are not malicious. And that is what actually makes security in the end that you don't have any weird third party being in there which shouldn't get the data streams your streaming. Then it's a federated system. So calling definitely has to go this path as well. And what Matrix is also really good is in having persistent storage. So it's not just exchanging data, it's also storing data and replicating the stored data over multiple home servers. But that kind of comes with the cost that it's not real time, real time, what we need for calling. It's more like a, yeah, more in the below second range but not millisecond range. So having those four columns in mind, how can we now use Matrix to really build up something like a system that uses the best parts of Matrix while still succeeding in actual real time? And this is done by like those three core parts. We have the Matrix part, then we have Client apps which then use the LiveKit SDK and we have the RTC infrastructure which is LiveKit in this case. So starting from the top we can see that we have just a Matrix room which can be on a federated system. And the core component in the Matrix system or what the problem Matrix solves here is that it basically stores which user is currently in which session. So if I'm joining a room and I'm reading the room state, I immediately can tell who is in which session and how to connect to those people. So I know if there's a running call and I know how to connect to them. And yeah, of course then Matrix also does a lot more sharing keys and providing the accounts and the verification. Then as the next in the center part we have the Clients themselves. So here we have a couple of Clients which have only the green box in there and then Clients with the green and the blue box and each of those boxes is basically one RTC application. So to make this example more concrete, one could think of as the green box being Element Call or just calling in general and the blue box being some document, shared document real time system or third room or whatever have you. And some of those members are just in one RTC session and some are in two and this is also something that should be possible. And then last at the bottom we have the RTC infrastructure where we primarily want to use LiveKit but it also would be possible that you use FullMesh or we have this empty box at the end. It also should be possible to basically use whatever new technologies emerging. So if web transport at some point in time replaces web RTC then you could implement a new infrastructure which then does the same high level signaling over matrix but it still uses this new technology to have even better or even higher data transmission or whatever the advantage is. So now we look into a little bit more detail for those room events. So before we had them at the top, now they're at the right. So we have room, multiple member events and each member event has an array of memberships. We need this array because as seen before we could have a call and at the same time a real time document. And the top part of the membership JSON object here is the actual core matrix RTC part. This data is just there to know how to connect to this specific peer in this RTC world. So it has this very central field Fokie active where it says the type of Fokie or the type of connection you want to use in this case live kit plus it has all the necessary information to connect to this. And this could also, this is then the part which actually can be replaced with web transport or full mesh or whatever you. And then there's another pretty important field and that's the application. So each membership has a specific application associated with it. In this case it's M call and that basically also gives the typing for all the other fields. For M call we have a call ID as well and a scope. So if it's just a call for the whole room or if it's a breakout call or whatever you want to add to the calling specification. But you can also imagine all kinds of other things. So if we think about third room one possible field could be that you have for example a country or continent in there. So when I look at the room state I can immediately tell who is in which country and based on that I know whom to connect to. So we can do very high level optimizations in this matrix RTC world already before we even connect to an SFU. What time do we have? Like, oh it doesn't say here because. Oh, okay this is fine. And we actually can talk about this as well. So this is kind of an interesting thing and it's one of 20 problems I could have chosen which we encountered which I find really interesting to just really get the mindset what those call member events are and what kind of problems we encounter in such a federated world. So it's about call history. Whenever we have a call it's of course super valuable to then see in the room history that there was a call. How long the call was, how many participants there have been in this call. And one idea, one very trivial approach would be that at the end of a call we just send a summary in the room and the summary contains all the data. How many people there were, the duration and everything. But then we encounter specific issues which are very, very common in a federated world who creates this event. Like there has to be some kind of glaring and maybe nobody feels responsible for it. Maybe the one responsible has a client which crashed at the moment where he needed to send it. Maybe two people think they're responsible because there were some state which hasn't resolved yet. And it also would be redundant data because every state event is of course also part of the DAG so it is in the history of the room. So by having another summary event we of course introduce a possible conflict where if you look through the state history you see that the call was ten minutes long but in the summary it's twelve minutes long because there was a client side back failing to calculate the proper call duration. This slide actually got broken. Either way it's still visible enough so it works. The cool thing is if we look at the call member events which we showed before it's very easy to pass those events as join or leave events. So if we look on the left hand side with the green border we can see in the unsigned field we always have the previous state of that event. So if the previous state was an empty array and the current state is an array with a membership this can be easily passed as a join event while on the right hand side with the black border we have a previous content with a membership so somebody was in some kind of RTC session and now the current content is an empty array which implies that's leave event so it's really easy to tag those events. And looking at the next slide we have a visualization of a basically timeline so the left hand side has to be interpreted as the past and the right hand side as the present and the red boxes are state event changes which we tagged as leave events with the system we used before and the green boxes are state event changes which we tagged as join events. So if we go through a very simple example, member three for example they just had no changes at all so during the whole period which is shown on screen they were no member. If we look at member two in the past they were no membership then they had a join event so from that point on they were in a membership and then a leave event. So if we now run an algorithm locally that we start from the current or the present and we just go back collect all the leave and join events we can basically recreate the call state. So at each point we know who was joined and who wasn't and then we just loop through this algorithm until we find a point where nobody was joined and that then is of course the start so this slide indicated with green border so we have then all the information we need we have the start, we have the end, we even have the number of participants who joined we basically even have a heat map at which time there were how many participants like there's lots of data in there and each client can decide on their own what exactly they want to do with it and how they want to render this in the timeline. So yeah this is your part now. Who on time? Thank you Timo. So now we are going to look at implementing because well client implementers also need help and if you are one of those people whose client already has the WebRTC parts implemented you might be thinking ah shit I need to throw away all of the stuff I've already done. Not really. So Timo showed this already but there's this small RTC infrastructure bit which we are going to look into a bit. This is MSC3401, well kind of MSC3401 the m.call event has already been removed because it caused way too many glares and stuff if you want to know more about that you should watch Timo's matrix community summit talk about why the m.call had no ownership and stuff so it caused way too many glitches. The first half is just the matrix RTC stuff which Timo already mentioned about. The participants send the member events so the room has a history of who joined when and now if you don't have an SFU you could just say the infrastructure or the foci or the back end in the matrix RTC as mesh and then you can potentially just use the P2P MSCs which were already implemented by you or hopefully will be implemented for a mesh call and a mesh call is basically a P2P call between multiple participants. It's just not as scalable as you would think but now you can use your existing MSCs, your existing implementation for mesh calls and you don't even need an SFU or something but if you are rich and if you do want to set up an SFU then it gets much simpler. SFU in our case will be LiveKit but all of the signaling bits are now handled by LiveKit itself over WebSockets. The previous thing was over two device matrix events. The first half is the same but basically all of the signaling part is now handled by LiveKit over WebSockets. More about LiveKit I'm going to keep saying that SFUs are cool but SFUs are also very expensive and if you don't want anyone else to use your SFU you probably want to have some authentication in front of it. So if you are a home server owner admin and you also host SFU then you probably also will be hosting a JWT service which basically gets an open ID token from your Synapse server. You send it to your service. The service then validates if you are the actual one who generated that token and well then it generates a JWT token for you which you can use to authenticate with LiveKit SFU. Right now I believe that the Synapse thingy only checks if you are the actual one who generated the open ID token but I think there's already work going on for checking if you are actually in the room so only people who are in the room and if you want to actually join that room only then you can get access to the SFU. Some fancy stats. The LiveKit docs say that with around a 16 core Google virtual machine you can have calls with around 150 members. This is I believe 720p no simulcast just draw 720p 150 members feeds. From my personal testing I used a Hexner CAX21 well not personal but family gave that to me but it's a four shared VCPUs ARM core thingy and I could get around 70 participants with simulcast and 720p everything optimized I think. Ringing. You might not think ringing is important but ringing is actually very difficult to get. Mainly because native operating systems are not really friendly with you and will try to kill your app every possible second. So they started a GSOC project by the way. GSOC 2022 project at matrix. It's basically a three month window and you have to do this particular task. Well my task was actually implementing the whole WebRTC thing but I implemented the whole WebRTC thing in two weeks and for the next two and a half months I had to fight ringing. You need to focus on three cases. Your app could be in foreground, background and terminated. By ringing I basically mean if your app is in one of these three cases you need to be able to somehow ring the application when you get a call. Pivot it three times and we'll see the three ways. This is a story yes but hopefully client implementers can learn from this. This is the coolest part which I wanted to show at FOSTA. I did not know you could do this. This is Android specific. As far as I know only has one way you can do this. That's using Colkit which is the phone dialer app on the iOS thingy. I think WhatsApp also uses that but turns out Android also has a way to do that. It's called telecom manager or the connection service API and what you're seeing on your screen right now is the Samsung OEM dialer application and what the telecom manager allows you to do is put any wipe call from your application to the dialer so you don't really have to handle all of the OS killing your app and stuff because the dialer already has that. Then you get this fancy UI. You see all of these buttons, the hold call, Bluetooth, even the merge button works and I didn't have to do that. You also don't have to implement a new UI for all of those holding calls or you have another call when you're in another call. This was very cool but why this could not be implemented? For this you need to add your app as a calling account in your dialer app and that is a very hidden setting. I could not find a way to programmatically do it and also in some of the regions it's just blocked. It's apparently a regional thing so this could not get in. Frustrated by that I went to try to. Where? We just hack it. Apparently Android has two very nice thingies. Show on lock screen and the up here on top thingy. What we basically do is apparently we're out of, well running out of time so this is going to be super fast. We just call the up here on top thingy which then brings your app on the top and then you can use the show on lock screen. Even if your app is foreground or background and your screen is locked you could potentially just hack the app to get live and stuff. It does not work on terminated apps and no way my coworkers would have let me merge this thingy. Try three. Fine. We'll do it the right way. By the way, if you are thinking this is an obvious solution this was not obvious for me because family and Fluffy chat are written in Flutter and when I get a notification I would have to start the right Android bits then start the right Flutter bits and then decrypt the event and then show the ringing too much work. But well turns out after two tries I found out that push notifications already do that for you. Well, so we just abuse that now. You use the Firebase push thingy or the unified push thingy. They start a worker for you. They bring up the Flutter engine. A Flutter engine is basically something which is attached to your Android activity. Once the Flutter engine has started you can just hook on to that. You can hook on a VoIP listener to that and then kind of abuse it to see if there's an invite event coming in and then you show your own UI. That works. I hope that's the right way to do it. Please tell me if that is not. By the way, if you, like I said, I use the m.call.invite thingies for the thing now. But that's not a thing with LiveKit because all of the LiveKit stuff happens on WebSockets. So there's a new MSC for that. With this you can basically, this uses intentional mentions so you don't spam your whole room with your notifications. But you can specify which user IDs you want to spam, ring, and which, what your notification type is. It could either be a ring or a notification, yes. S-Frame key sharing. No time, but SFUs need another lock because WebRTC and said the SFU stuff uses S-Frames secure. Trust me, bro. Cascading. Yes. Right now your calls are, well, right now the calls are technically federated. So you could potentially have a call inside a room with SFU one and you could have a call inside another room with SFU two. The only main limitation right now is that all of your participants who want to be on a call need to be connected to the same SFU. With this you can also have like secure deployments where you basically just have the left half and then all of your communication is within your organization just for the local network, etc., etc. But in the ideal future what we want is cross SFU communication where every home server could have its own SFU and their JWT service, then all of the users from that home server connect to their own SFU and then the SFUs cross-federate, everything is federated. Yay. This is already a thing by the way, but it's a proprietary thing in LiveKit. So maybe if someone from LiveKit is watching, please open source it so Matrix can use it. Probably not going to happen. And how you implement this? The easy, there's two ways. You can either implement element call in the widget mode. I believe there's two SDKs right now, the Rust SDK and the React SDK, which already support widgets. So you can just use the iframe in your app, looking at you fractal people, do it already. And if you don't unfortunately support the widget API, well then you have to go the hard way. You need to implement it using the native LiveKit SDKs and, well, LiveKit has a lot of SDKs. The Flutter, Android, Swift, Rust, obviously Rust is there. Yeah, that's it. Thank you. Demos. By the way, if you can join this demo, you, Timo, I think they can use develop.relement.io. Yes. Basically maybe you go ahead and show the... Ah yes. Yeah, but so they can sign in. You can either use... You can either use... I should have written this down. You can either use develop.element.io or td-family.github.io slash fluffychat. I promise you this is not a phishing attempt. I can show you the CI run from what I deployed it. But well, and once you go there, just type in this alias and then you should pop up in a room and you can join a call with us. Could you repeat the URL? This is the URL. It is... Yes. Timo, do you want to start it now? Yeah. So basically, can people hear me if I talk without the microphone or... Okay, then you just have to talk with them. I'm talking. Okay, perfect. Yeah, then I can also talk. So basically what I just did is start a call. And the cool thing now is that we really have inelement web, inelement x, and infamately or fluffychat, we have the full new matrix RTC stack implemented. So all of them are able to... Can you hear me? Yeah. I can hear some weird sounds. So all of them can talk with this new stack. So you have to go to develop.element.io and there is a feature flag there. Oh, to be in the camera, makes sense. But in general, this is like the big new thing now that everybody can without doing something highly crazy, just go to develop, activate the new group call experience and then still and then be able to use the new calls. So basically what I just did is start a call, but I think I did a private call. So that's why I did the ringing as well. So I am joining here. And TD now... Someone's already in the call. Yeah, this was me just joining there. And I think maybe Kim is in there already. Oh, there is multiple people. Interesting. Well, that's element for you. You have been seeing this for months now, but now we go to the fancy thing, fluffy chat. This started a month ago, so this is probably ridiculed with bugs, but well, if it works yes. Kaboom. Nice. So this is really super, super cool that TD managed to get like in record time. Fluffy chat into a state where we have again a federated multi client system with calling with group calls. So yeah, this is one of the first few multi client... I think it's the third time we do it now. Multi client federated matrix RTC call. With screen sharing apparently. Questions? Do you guys want to break it? How many people can still join? Oh, we are doing a test. Might as well. Yes. Yeah, are there any questions? Does LiveKit send any emails back to say who's talking? Oh, yeah, there's actually lots going on on LiveKit. Oh, okay. So the question was if LiveKit is sending any signaling back to let us know who's talking and yeah, probably also who's showing video. And there's lots of things LiveKit does, so it's actually pretty sophisticated in that regard. And even there's things like if I upstream video, but nobody's consuming my video. Like let's say we have a conference of 100 people and everybody has me at the bottom, and LiveKit is communicating to my client that I don't even have to upload video anymore. And that doesn't only work with upload video and don't upload video, that even works with a resolution. So basically if lots of people consume me in just a tiny thumbnail, then my client automatically notices that I only have to stream the thumbnail. So there's like lots of optimization happening that at the end from a receiver point of view you basically just download what you actually see. And from a streamer point of view, you also only upload what people actually need to see. Yes? Who holds this LiveKit? You said that this is fully federated, but maybe I somehow missed the point where we talked about whose LiveKit service is used. Because in the previous iteration with full mesh, I thought the cool thing were that multiple MaTvic servers are involved, also multiple SFUs or whatever are involved. Now it seems like it's maybe the LiveKit server for the first one initiated or something. Yeah, so basically. This is kind of two questions. The first part was who's hosting the LiveKit server, where are they coming from? If it's federated, there should be like, yeah, same similar to MaTvic server, multiple servers, and that's exactly what's happening. So the idea is that in the future it becomes very, very common that next to your MaTvic home server you also host a LiveKit SFU. It's kind of similar to that lots of people also host a turn server right next to their MaTvic server. And then the second part of the question was how do we decide which SFU do we use? And of course, like what TD presented at the end, where you have the option that SFUs talk to each other, there you would just always connect to the SFU of your home server. And if they're federated participants, the SFUs in between each other would figure it out. Now there's actually a system that the first one, exactly how you also initiated it or presented it, the first one who's joining defines in their member event which LiveKit SFU to use, and then everybody's jumping on that SFU. And since that means if the first one is leaving and maybe others are joining, but they have a mistake that they put the wrong or different LiveKit SFU into their member event, we even have real time switching from SFUs. So it's not, I think it's a one second interruption you get, but it still works really well that if the first one is joining with SFUA, the second person has SFUB in there, then the first one is leaving the call, everybody's immediately switching to the SFU from the oldest participants. But I guess it's quite obvious this is mostly a workaround until we get to the point where the SFUs in between each other can exchange the streams directly, that would be, of course, much more elegant than we don't need this anymore. But for now this is exactly how it works, so we can always guarantee, because that's a very simple glaring algorithm, just take the oldest member state event, call member state event, that we can always guarantee everyone is on the same SFU, which is quite important for a call, of course. Does that answer the question? Yes, always. Do you see any technical difficulties with having recording or transcripts? So the question is about recording and transcripts, and if there's technical difficulties around this. So basically since this is matrix, the ideal and easiest to cross approach, or UX, however we want to call it, would be that those kind of things just happen as bots. So, or recording would happen as bots, where you can easily just have a recording bot, they are just another participant, they are part of the room, they get into the key sharing, so it's very transparent for everybody that it's not just those participants, but also the bot receiving the streams, and then this bot would take care of recording. And since it's all based on LiveKit and LiveKit is a very, very good infrastructure already, there are amazing tools for this, so recording should be fairly straightforward. The transcript question, which was also asked, that is basically an implementation discussion. You could also have a bot, and then the bot could stream the data into a data channel, or the bot could stream the data directly into the room, because it's then part of the room, or you could say you don't want any bot to get the data, and you want to run local systems, which do the transcription, and then just, yeah, do it locally, like there are multiple solutions for this. I guess we'll see what the future brings. This is amazing. I think somebody just joined the room with a, oh, but it's just unmuted. I thought that is somebody already having implemented recording and now playing live. That would have been so cool. I just got super excited, but I guess this is just my echo. Any other questions? So basically the current state is that it's just on develop, but it is ready to try out. I think this is actually something, can you show the path to activate the new group call feature in Element? So if you go to develop, there is one feature flex, so for now you will only get the option to do jitzy calls and legacy calls, but if you want to have the new matrix RTC calls or Element calls, you need to go into the settings and then feature flex, and there's a flag called, yeah, new group call experience, and if you turn this on, and on the sending and receiving client, it should all work, and on Element X, like the mobile client, Android and iOS, it also should just work. Like there you don't even have to activate a feature flex. You just go into the room, press join, and it should end up in the same room there as well. Actually, that's a part of the demo we could just do, right? Do you just want to join with that user? I think, yeah, this is actually also a thing we can show. It's basically easily also possible to have multiple devices per user, so that basically implies we have simple continuity, so I was connected with this computer, and now I just connected here. Oh, I need to read this. It's dangerous. So I'm connected here as well, and now you can't see any streams, right? It does show streams on my computer. Maybe they will recover. I mean, yeah, seems to not work, but it works on this computer. I mean, I can turn it around, so at least the first row can be convinced that it's actually showing this stream right here. So if I would hang up here, I basically did a continuity to move the call from here to here. Oh, and this is also pretty interesting. I'm not sure probably no one can see it, because it's just on the screen, but Paul has joined with an older version of Element X, and currently, if you're in an unencrypted room, you will stream unencrypted media. If you're in an encrypted room, you will have per sender encryption, and that's a part where TD kind of rushed over. So basically, if you have an older version of Element X, this isn't considered yet, and since this is an unencrypted room, if you join with an older Element X, you still stream encrypted data, but my client doesn't expect encrypted data, so that's why it's giving me all kinds of noise. So basically, this is proof that it's actually encrypted. So what TD said is... Trust me, bro. It's always super hard to demo an encrypted call, but here we are. We managed to break it, and there you can actually see that it's encrypted, which is... And the only reason this isn't an encrypted demo today is because we have different encrypt implementations of that. I believe Element uses room events, and I decided to use two device events, because why not? But this will get figured out once we start drafting MSCs and stuff. Exactly. Last question? All questions answered. Cool. Thank you so much.
The state of the Matrix Rust SDK in 2023
Hi, everyone. So today I'm going to talk about the state of the rest SDK in 2023. So all the things that we've accomplished as of last year and some of our future plans as well. So first of all, who am I and how did I get into the rest SDK? Well, I'm Benjamin Bouvié. I'm a software engineer in the Rust team at Element. Prior to that, I worked in a game engine, well, a game dev company on a game engine that was written in Rust and WebAssembly. And prior to that, I was a compiler engineer in the SpiderMonkey team, which is the JavaScript engine, powering Firefox, where I did Rust and WebAssembly. So you can sense that there is a common theme here. And back in the days at Mozilla, we were using IRC. And so I wrote a few bots that were just pulling out jokes from the internet and posting them on the channels. And then at some point, we decided to use this new cool thing called Matrix. And so I rewrote my bots so that they could also run on the Matrix using JavaScript at the time, because when you work at Mozilla, you have to bet on JavaScript all the time. And then a few years later, I decided to rewrite it in Rust because I like Rust. And I made this framework system called Trinity that would use Rust for interacting with the Matrix system. And then you can actually write the bot comments themselves using WebAssembly, which is pretty sweet. And I experimented to neutralize it in production. It's mostly a fun project. And that's how I started to use the Rust SDK. So what is the Rust SDK? Very good question. So it's a Rust library implementing the client server API to allow you to implement clients easily if you want to use Rust in your project. So the code is available on GitHub under the FHT2 license. And it does all the things that you would expect from a Matrix client. Logging in, logging out, sending messages, receiving messages. But I guess the most interesting thing is that you get into an encryption for free. And you don't have to worry about the, excuse my French, gory details in the sense that you don't have to learn about Olm, Meg Olm, like sending, uploading your keys, claiming keys, querying keys and all of that stuff, which is very fine. And that we handle for you. Some history for this Rust SDK. So there was in the past one project that was called Ruma for Rust Matrix, which modeled all the events that are happening in a Matrix room timeline. And also all the requests and responses to the endpoints. And the goal at the time, I think, was to try to create a home server in Rust. Eventually that didn't happen for the Ruma project itself. But people realized that it was a good idea to actually model all those events, request and responses, and reuse them across other projects. And there was another Rust home server that started to be written and that is conduit. And like in another timeline in the world, so there was Damir, who is now the team leader at the Rust SDK team at Element. He was doing Rust on his part like free time. And he maintained a small plugin so that you can use WeChat with Matrix. And that was written in Python. And so as he was trying to learn Rust, he decided to rewrite it in Rust. And the thing is, well, he did so. He searched for library written in Rust to do that. And there was none. So he decided to start one. And that's how the Matrix Rust SDK started. And from the outset, it started to use Ruma because it made sense. And that allowed to reuse massive amounts of code, which was very nice. And Damir, being a crypto engineer, he also implemented all the crypto stack, which was very sweet. And then that was first in the Matrix Rust SDK. And all of that code was pulled out and extracted as an independent library called Vodose Mac, which apparently in question means amphibian. And it's like a big pun across languages like Olm, Megolm, and like all of these just refers to amphibians, it seems. And yeah, so that's how it goes. All right. So why Rust, you would ask, well, this is my minute for the Rust evangelizing taskforce. So I mean, you're probably convinced if you're here already, but it's like at the same time high level and super fast. It allows you to write code in a very fast fashion without having to worry about lots of low level details and issues. It is secure and memory safe, which is very nice for a library because you want to have something very robust. It has an amazing tooling and ecosystem, like all the packages, the crates that are published on craze.io give you all the things that you want to have. And like the cargo, the tool that does it all is just wonderful. You can run tests and, you know, the documentation and all of that. You just want to also very important for the rest of this talk. It is compatible with foreign function interfaces. So you can call into other native languages that speak the CABI. So it's quite important to see. And one of the things that is maybe a bit undervalued in the Rust community is that it's actually also in trying to empower you to write a multithreaded code without you having to know too much about it, trying to make it very accessible. And it's a value that was in the community first, and you can find it in all the places. It transpires in translates to all the places in Rust from the error messages that just hold your hands and try to explain you what you did wrong and try to tell you how to fix the problem that you run into, et cetera, et cetera. So it's very sweet to use. And yeah, being a former C++ programmer, so there was this notice in one of the offices where I worked before that read, you must be this tall to write multithreaded code. And it's apparently at three meters high on the wall. So this is something of the past. Like with Rust, you can just be fearless when you're writing multithreaded code because there is this thing called the ownership model. And that makes it really easy to also model concurrent implementation of anything really. So that's really, really nice. So why the Rust SDK? Well, there was this story where we had three apps, basically Android apps, the iOS apps and the web version that is also powering the desktop version. And they all were using a different SDK and a different crypto stack. So that means that if you are serious about your security, and you want to, for instance, audit your cryptography, now you have to do it in three places and make sure that every single implementation actually does what it's supposed to do, which is a bit of a nightmare. And now you have also per platform issues. You can have a bug in one stack, and then you need to check whether the other stacks also have it, et cetera, et cetera. Well, now we are saying, no, we have only a single stack for the element apps, and it's written in Rust. In particular, it's a single crypto stack. You have very high test coverage. As I'm speaking, it's like more than 83% of test coverage in the Rust SDK. The VodoZemac library, the crypto stack is being first as well, which is very important in terms of finding issues, security issues. So that means it's a single place where you can add features, you can code once, where you use everywhere the old Java Dream that everybody knows and loves about. All right, who's using it? So there is Fractal, the GTK-based Matrix client, which is using it. There is IMB, terminal UI client, if you like, Veeam bindings and all of that. There's the new generation of element apps. The element X apps are only using that, which is pretty sweet. And also the crypto stack, as it could be extracted, and it's also like there are specific bindings just for the crypto stack. And so it could be used in the current generation of element apps. And it's another codename element R. And I guess that you can imagine what the R stands for at this point. Rust. All right. So what happened since the last first time? Well, the previous release of the Rust SDK was in October 22. So we made a new release this year. Yay! At the beginning of this month. Thank you. So it's still not 1.0, still quite experimental. We're breaking APIs all the time, but trying to do a better job at writing, changing logs and all of that. And we'll see how it goes. So new features. So you probably heard about sliding sync last year. And this year, the new kind of sync synchronization that makes it so that logging into a new device and retrieving events is always instant, even if you haven't opened the app for months or years. So we entirely support that. There is the basic feature that you can subscribe to specific rooms and list of rooms of which we get a sliding window that is computed by the server. But we're getting rid of that, as Matthew said. And it also implements a modular design in the sense that you have opt-in extensions for read receipts and typing notices and many other things. And all of that is supported in the SDK. As you can see on the right, it's quite verbose because, well, it's a very versatile and general like API to give you the most control so that you can build higher level primitives on top of that. We'll get back to that. And it's vitrugated behind the experimental sliding sync cargo feature. And you can, we basically use it in production in element X. So it's quite stable, actually. There's also support for OIDC, so OpenID Connect. It's a cross stack effort moving from the custom metrics authentication to OpenID Connect. If you have a metrics authentication service running, so it's another service running on your server alongside the Synapse or your own server, it can act as an actual OIDC provider or specialized proxy to an upstream provider. So if you have a GitLab instance, for instance, you can connect it to the metrics authentication system and then have your GitLab users log into matrix for free, like that. And so that's the server side part. It's also written in REST, which is pretty sweet because that means that the request and responses can be actually reused in the client, the REST metrics SDK. And the SDK implements all of that already. And we are also using it in production in element X. So it gives you all the things that you would like to do with OIDC, create, reload, metadata, register on your OIDC client, do the login flow in all the steps and all of that. And it's also behind the cargo feature at this point. Among the big news, we have a new default storage backend. So the storage backend are implemented using traits, which are REST for interfaces. The previous defaults when you wanted to persist things on disk was sled. And now it's been replaced to SQLite because, well, pretty much everybody knows about SQL. And it's also much faster for our use case. We still have an in memory backend if you don't care about losing states and an index DB backend that is used when you're compiling for the web to WebAssembly. Some new cryptography features. So there is this new thing called secret storage. And it's mostly an implementation detail, but it gives you an encrypted key valley store that is backed in the user account data. And where you can put any information that you would like to share across all your devices in a secure way. Like the server doesn't know about this information. It cannot peek into it and know what is in there because it's also encrypted. On top of that, we implemented key backup and restoration. So that means that when you have a new device, well, when you're using elementics, for instance, it will store all the room keys that are used for decrypting room messages in encrypted rooms in the secret storage. And then another device can restore them so that you can actually see the history of events before you joined with that new device. Also, in addition to that, we made it so that the cross-signing automatically happens and you don't have to worry about this at all. That's what's used to verify your own devices and other people's devices. And it's also like some of the private keys are stored in that secret storage as well. And speaking of high-level primitives, so we made a new crate, new package called the Matrix SDK UI. It is highly experimental and also highly opinionated in the sense that we're enabling a few cargo features by default. And we are trying to make it so that we implement the best practices in terms of user experience and performance. And it's also as robust and tested as the rest of the SDK, which is very sweet. And we use sliding sync as a foundation for all these new high-level features. One of these features is the room list service, which, as its name suggests, gives you a list of the rooms. Yes, it does it so in a way that we try to make it to show something to the user as soon as possible. So that's how you feel that the app is kind of instant when you open the app, because it will try to load just one event for all the rooms you were in, or for no, a few rooms you were in. So you have something to display. And then in the background, once that's done, it will try to fetch more events. And also you can configure it to say, this is a set of visible rooms in my apps. So because when you have an app, you cannot show like a thousand rooms, you will only show a subset, right? So you can configure it to say, this is the ones that are actually rendered on the screen. And those are prioritized so that you get more events for these rooms. Another thing we added was the encryption service. So it's basically a sliding thing that is just running encryption on the side, and it gives you access to more concurrency with the other one. So think of it this way, the room list service, the one I just talked about, when you're scrolling on a mobile app, it will change the list of rooms that are shown on the screen, right? So that now means that it's sending new requests to ask for things. And if we did the encryption in the same request, but it's getting a bit technical, but that would mean that we would need to abort those requests and delay encryption. So now we have basically more concurrency and more performance. And we can do the encryption task in the background while you're still scrolling on the room list using this encryption service. And we also have a notification service. So that's very specialized client that just handles push notifications. So if you're given an event and a room identifier, we want to retrieve the event and maybe a bit of context, like what's your name, what's the name of the person who sent the message to you and all of that. It's also using a sliding thing for that. And it makes use of the encryption service because on an encrypted room, of course, you would get a push notification for an encrypted event and the server cannot know if it's a meaningful event, right? Maybe it's just a reaction, putting a thumbs up on one of your messages. So we decrypt the event in the client itself and then we decide whether it's worth sharing as a notification. The one fun thing, if you can call it fun, is on iOS, if you want to modify the notification in case it's encrypted, it's running a separate process. And that makes our life very hard because even if you're just decrypting data, the state of the cryptography keys is mutably changed, right? So now we have multiple states that is global across two processes that are sharing the same database. So we had to be a bit creative to solve that issue and we are basically enabling the writer head log in SQLite and using some data in the database to indicate who's the process that currently tries to read and write to the database. So basically implementing a text like that. All right. And since we added those two services, the encryption service and the room service, we wanted to make it very simple to just fire synchronization and forget about it. So we made some nice high-level service that just wraps the other two and you can just build it and start it and it will do all those things for you and implement all the best practices and you don't have to worry about any of this. And then you can just take listeners on that service and get information that is meaningful to do when we're rendering for a client. Now that we have a list of rooms and decrypted events, what do we do? Well, we want to display them and we have an API for that called the Timeline API. It's basically a room view MVC, so model view components on steroids. The thing is that in the matrix protocol, events are actually like atomic. It's an app and only database. So let's say you have a thumbs up reaction to a message that is a response to something else that would be two events, like the reaction itself and the message itself. So the timeline will aggregate all those different events into a single timeline item that is much more what you want to render as a client on the screen. So it makes it much simpler to render a single timeline like that. And it does a lot of things for you too. It can enter local echoes. So basically when you're sending a message to a room, you want to show it even before the server has returned that it received it. So it will do that and then reconcile the response from the server with the local state and all of that. So it's pretty sweet. And it's all observable, very reactive. So that's nice. You just get, like, as a user of that API, you get notification that one item has been added or removed or updated. And you can just, like, react according to that. So how is this all used in ElementX? We're using a Mozilla project called Unify, Unify FFI. It will automatically create bindings for you for calling interest from other languages. So at this point, we generate bindings for Swift on iOS and Kotlin and Android. It can also generate bindings for other languages. And we use that for Go, for testing purposes, I think. It requires a bit of integration with the foreign languages runtime. And over the years, we've contributed a few PRs to this project. So we made it so that you can just use procedural macros for exporting your types and your input blocks to other languages. And we also added this year's support for async code. So you don't have to block when calling into an async function on the Rust side. It will just look as an async function on the Kotlin or Swift side. And you have actual concurrency and background processing happening, which is pretty sweet for performance. And reactive programming in Rust. How do we do it? Well, the principle of reactive programming is you have some data and you want to make it observable so people can subscribe to it. And then they will get notifications. And I mentioned the Timeline API that will notify you when there is a new Timeline item that has been added, removed, et cetera. So we're using crates that we created ourselves, IBOL. And there's also an extension that is div-based for collections because when you have a vector with a thousand entries in it, you don't want to say, oh, now there's a new thing that has been pushed into the vector. I hear all the 1,001 entries for that vector. No, you just want to hear that there's a new entry and that's its position, right? It also has some extra querying facilities. So you can batch all these updates, div updates. So you don't have to cross the FFI language boundary too often. That has an inherent cost that we want to avoid, some overhead that we want to avoid. And for your batch transaction, well, for your batch to be quite precise, you need to have also transactions to say, this is the beginning of the batch, this is the end of the batch. And also you can do some filtering on these stream of events, limiting, sorting. So it's kind of mapping to things that you would do on SQL in general. It's pretty sweet and that's what we're using, for instance, to filter the rooms in the room list immediately on the client side. All right. So some of the future work that we're going to do, well, I intentionally remain a bit vague here, but we're going to eventually support all the major features a matrix client would expect. We are already working on Scrantum cryptography. And as of today, I think there has been a PR against Voters and Mac to have something that is compatible with Libsignal and with what they do. So that's pretty exciting. And there is a general theme of doing more things client side. When you have end to end encryption, your server kind of becomes dumb sometimes because it cannot peak into the encrypted event. And so you have to resolve a lot of things on the client side. If you get a new event in a room, does that trigger a notification for an encrypted room? Well, you have to push a notification and it's the clients that will decide whether or not it resolves into an actual notification. And even for sorting the room list, you have to do it client side because if there is some room activity, you want to sort by room activity, just show me the room that have some activity. Well, it's the same thing. If the event was encrypted, you don't know if it was just a thumbs up reaction. Maybe that doesn't justify putting the room at the top. If it was something meaningful like an actual message. So that means that this task has to be done on the client now. And yeah, we're also computing the other badges client side in the rest SDK. So we are trying to be very careful to not get into static notifications situations because it's a pain for everyone, us included. And yeah, that's pretty much it. All right. Just a few things. Well, first to all the contributors of the rest SDK special shout out to Kevin Komei from the fractal community. It's done like a bunch of work in the rest SDK, including most of the support for OIDC on the client side, which was MaceFPR. And if you want to be on this slide next year, you can contribute to we have a few issues that are tagged as good first issues or help pointed if you want. And I would like to take this opportunity also to thank elements for donating all of my work to the matrix organization. You can also be a supporter of matrix if you want by following one of these two links. Thank you for listening. And I would be happy to answer any questions if you have any. To the internet asking why have you moved away from sled? Why have we moved away from sled? That's a good question. So I think in terms of performance, so slay this, if I recall correctly, I wasn't there when that happened. So it's kind of hard to answer this precisely, but I think that it's a key value store embedded key value store. And the performance was not great, especially on mobile devices. And we just figured that using SQLite that has been like performance tested and improved and tuned over the years was a right thing to do. And also the way you structure your data using a SQL database is quite different from the way you would structure it with a key value store. So it's just like slightly easier to perform requests when you have a SQL database because you know all of that. Yeah. Any other question? The internet also asks how is your developer experience when using UniFi, UniFFI in general? Are there any hard edges? That's a good question. So there's, oh yes, so when using UniFi across, for calling rest across other languages, have there been hard edges? Yes. It's been a few cases where we have a memory leak that is identified. Well, Kotlin uses the GVM and GVM as a garbage collector. And so we accidentally, and when I say we, I think it's like the UniFi group in general, introduced some leaks by having the equivalent of premises or futures leak sometimes. So that was a problem, but usually it's, I would say it's 90% of the time it's stable. And the 10% of the time where there is an issue, it's high priority for us because obviously it breaks our apps. So we fix it, we try to fix it as quick as possible and we contribute back. But most of the time it works fine for Kotlin and Swift. So the support and stability is also per language, I suppose, since you have to create bindings for each language. So yeah, I cannot speak for the Python or Go generation on the UniFi side. But usually, since Mozilla also use UniFi, they have to provide high stability guarantees as well. So they are pretty reactive and also fixing bugs. So it's working well. Yes. I was wondering about the startup times. Yeah, so the question was, what about startup times for the rest of the time? I think there were two questions. The first one was just starting the SDK itself and then when you're syncing a list of rooms, do you get instant and response and all of that? And well, it's native code, so you don't have to boot up an entire VM for the SDK itself. So it's pretty fast. It will restore the state from the disk. So that can be a slow step. But even like for users who have thousands and thousands of rooms open and I'm looking at Matthew on the side of the room, our general benchmark runner, it's pretty fast. And for receiving a room list, we are also tracking these performance of our time. Pretty much instant. And every time there is an improvement that needs to be done, we'll do it. Yeah. I mean, we are in sync with the synchronization times where about between five to 20 minutes, if you are a very heavy weight user of Matrix, now it's really up to three seconds. So consider that an improvement. Any other questions? Yes? What's the state of supporting extensible events in the rest SDK? So I think that's a question for Ruma. And since we're using Ruma for passing the events, and I'm pretty sure that so the rest type system is quite extensive in the sense that you can have union types. And for each event that can be extended, I suppose that you can have there is a variant in that unit tab that says it's a custom event. If you're referring to a specific MSC, I don't know what it is. And I'm sorry about that. Was that a custom MSC or? No. No. Okay. Just events in general. So yes, you will end up in these case where we have, you will match on this union type or the event and it will say, well, it's something I don't know about. So I'm just ending it over to you and you do something with this. Yes? Can we also provide a question for the rest of the SDK? I'll rephrase this question as are there plans to use the rest SDK for web? Because it's not used. So right now, as we are speaking, as of last week, people have started, well, have enabled by default for new logins on Element Web, I think. So that may be the nightly version using the rest cryptography for Element Web. We have a separate repository for bindings that are for WebAssembly because there's no meaning in using Unify for that. We can directly compile the rest to WebAssembly. So no need to have an intermediary in the middle. And I think the long-term goal is to use the rest SDK everywhere for the Element apps at least. So don't, don't take my word as granted, but I think that this is going to happen. Yeah. Any other question? Yes? That's a very good question. So the question is, is search in scope for the rest SDK and what kind of features would be out of scope for the rest SDK? So to respond for search, that depends if you mean room search or message search, full text search. And well, actually it doesn't depend because the answer for both is yes. We're going to try to take care of that. For full text, we, there was a previous client made by Element called Hydrogen. That was a web client and that could do that and had the fancy system to actually index the messages on your client and then share parts of the index with your other clients devices. So we're probably going to reuse and reimplement some of that in the rest SDK at some point. Yeah. In terms of what features are out of scope for the rest SDK, it's kind of hard to tell, but I think that everything that is like high level UI related like rendering widgets, but not in the sense of widget API, but actually UI widgets and stuff like that is not something that we want to implement or provide. And then I think that, well, the MSCs that have proven to be features that have been proven to be not very useful will probably not be implemented. It's not clear what's not in the roadmap at this point. Sorry, it's not a very satisfying answer, but yes, question here. So the question was using the rest SDK can store a lot of data if you're listening to lots of events and is there any way to limit that amount of data that is stored on this? Well, as I was saying, the storage is implemented as a trait. So one could always implement a different version of the Scallot backend and decide to drop items at some point. One thing that we wanted to make it is the ability to store events locally. And that's connected to the previous question. If you want to be able to do full text search, well, you have no other choice, but decrypting all the events and storing them locally, at least in memory for some time to do the indexing. And then the indexes have to go to the disk. And that means that, yeah, the size of the index can grow a lot. And so we would probably have to implement some kind of garbage collection and say, well, we kind of forget about like old data, older than like a month, year or something like that. And we only care about the most recent data. All right, thank you very much.
A microkernel-based orchestrator for distributed Internet services?
So like the 13 instances of this bedroom and by this I am over to our particular class. Can you hear me? Can you hear me? Okay. Hi everyone and thanks for being here. It's great to be able to speak here. So I'm Alex and this is my presentation on some very high level and pretty speculative ideas I've had for how we could use microcranels in like to do different systems and to host websites. So I'm part of an association which is called Duffler and if it works at Duffler we have some infrastructure which looks like this. So we have some very low powered computers like this one which are hosted at home. So at home of course we have possible issues like power going down or internet being cut. So we have some machines at different locations in Belgium and France. And the idea is okay we have this infrastructure which is like pretty fragile but maybe we can just put all these notes together and build this system. And this is actually what the Duffler infrastructure is doing. We have email, we have websites, we have instant messaging and a few other things running on these very basic machines. So currently our infrastructure looks something like this. So the idea is not to spend too much time to enter into the details of this but basically on the right end here we have the actual applications that we're interested in running. So for instance we have an element for chat, we have Jitsi for video conferencing, Crippad, other things. And to run all these applications we need this whole huge stack currently. So it's based on a Linux OS and MixOS for declarative configuration. And then we have this platform stack here which is based on an orchestrator which is called Nomad which we use. It's a bit like Kubernetes but a bit simpler and I'd say probably easier to use. But still we have all these different components which are basically... ... storage systems. GARZ is one that I'm building myself. And we basically pull these... ... software. And if we look more closely in what's happening on a single node, actually it's kind of a huge mess. So like this is the operating system running on one of these... Here we have all these management tools. Things that are... So yeah, from a conceptual point of view, this really systems like... ... to enter into too much detail. Let's say for instance we have here internet traffic coming to our server to request some information. It's going to traverse a reverse proxy which is going to do TLS encapsulation. Then it's going to go through an HTTP link to the actual backend which is going to talk to with specialized protocols to the storage layer. And basically we can describe all of these things with boxes and arrows connecting these boxes. So the idea is that actually this model of boxes and arrows is the model of micro kernels. Boxes are... ... ... ... memory between different processes sharing the CPU time. And also controlling hardware access. So this is like fundamental thing that only the kernel can do. Separate the resources of the computer at the CPU level between different things that are going on. And then the micro kernel will also provide some IPC mechanisms like message passing or shared memory. So... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... I've made things and connecting things very explicitly only when they. So this diagram is like what's running on one node, but maybe we can include some form of network transparency to make this more into a distributed. You There would be some impact on performance and we also need to be quite careful about that. Okay, so it's still time for some questions, comments, whatever. Okay, I might have one question. So the use case should be always the God, the thing that dictates what how the architecture should really look like. So what do you have in mind in this area, something like safety critical or security critical or really just some average information system? Yeah, we're doing the association with the facts. I mean security is important because we're handling personal data of people, but I wouldn't say it's a security critical infrastructure per se. But of course, like, yeah, one of the advantages of such an architecture is like security is easier to like build in a robust way because we have much more control. Okay, thanks. So so quite natural follow up question. We probably have seen it in this discussion here. So how do you persuade the average guy to buy him? How do you persuade a very guy to stop using their Linux distribution and start using your architecture? I think this is going to be a very long, long work before we can get to that point. But the hope is that this system is both more robust and easier to use because we like we can get rid of some complexity probably. And and we can have some. So if we get to a point where there's good tooling around this, and whereas there's a lot of examples which are already already running and it's easy to get your own started, then I think we can really have something that attracts people. But yeah, of course, it's a long road before we can get to there. Thank you. Any more questions or comments? I don't see anything. So thanks for the talk.
Is Toro unikernel faster for MPI?
Okay, if I may have your attention again, it's time for next talk by Matthias Lorsen, this time about his unicernal and how he can run MPI code faster. Which is yours? So hello everyone, you take me well? Okay, thank you. I'm Matthias Lada, in this presentation I'm going to talk about the player in MPI applications using Total Unicernal. This is an exploratory work, so it's an area I am still investigating, so at the end of the presentation feel free to ask me any questions because I'm still benchmarking things, I'm not pretty sure where I'm going. So first I would like to present myself, I'm fascinated about practice system development and mutualization, and I have been working in these companies, this is my email, I did have a profile if you want to get in touch or you want to see some of my projects. This current project is not related with my current work, so it's something that I'm just doing when I have some free time. I would like to start by what is my intuition about what is an MPI application, I am not an expert on MPI, so it is what I understood since two years I have been working on this. So it is an application that compiles with the implementation of the MPI standard, so there exists several implementations of the MPI standard. The standard defines a set of APIs to synchronize and communicate parallel instances of the MPI application, so for example we have this sort of API, like MPI barrier broadcast and all reuse for example, so to set some of them. My impression is that the only performance matter when we deploy MPI applications, so I have a feeling that the virtualization is not very popular in HPC, at least my impression for the overhead that this adds. So my thought was that maybe MPI applications may benefit from the unicurners because for example Cisco are expensive, so in unicurners we remove that, we have calls, threads are cheaper than process, so you may know that we are not switching the page today, every time we are doing context switching in unicurner, depending on your application you can completely remove the scheduler because you are going to run only one thread per course or something like this. You can rely on communication and share memory for example, in the case of unicurners. And sometimes this is something that I just added, sometimes perform better than a general operating system as I guess and I say this because sometimes you can tweak your operating system to reach good performance, let's say. So yeah, well this is the diagram or the components that they are involved with when you are deploying an MPI application using a general proposal of the operating system. In this case I am thinking that the MPI application is running as a built-in machine but the diagram is more or less the same in case it is bare metal. So what we have is your MPI application, then it compiles with implementation of the MPI standard, for example OpenMPI and the OpenMPI is going to use some Cisco to communicate with the operating system to get some service like scaling file system, networking and so on. So what unicurners propose is well let's take a look at the data. Thank you. you you you you you you you about the scheduler. So the scheduler in tutorial is quite simple and also well here is no scheduler, it is the way that the tutorial creates threads. You have a dedicated API called a begin thread but it is a parameter that has to tell where the instance is going to run so you have to set up where the core is, I mean where you want that function to run. Otherwise it is going to choose all the times the booting core. The scheduler is quite simple, it is a cooperative scheduler, so I mean the thread is going to do something and then call this thread switch which is going to invoke the scheduler and each scheduler, I think I present that in the next one. Yeah, I mean each scheduler is independent one another so there is no communication between the instance so each core is scheduler completely independent one another and the algorithm is quite simple, it is going to choose the next thread that is ready, not more than that. And the idea behind that is that the idea was to have instance of the kernel that don't require any mechanism to synchronize the instance so there is no speed lock or something like this so all the access to the kernel data is lock free. So I just talked about the scheduler, now I am going to talk about the memory, also total memory is dedicated so when the kernel initialize is split the memory in rations and then all the allocations happen from that rations depending the core. So the splitting is quite simple, it is just split by the number of cores so you have two cores, you have two cores, you have three, so for a moment this algorithm is quite simple and maybe it could be improved for sure. So for example we have a memory allocator that since each ration is assigned to different core the way that we implement the allocator doesn't require any synchronization between the core, keep also the same idea that each instance runs independent one another. So for example when a thread has two allocator memories it is always coming from the same ration and also doesn't require any synchronization between the cores and so on. The idea behind this is also to try to leverage from technologies that integrate paths that you can have a known informer memory and then you have faster access to some rations. And in a general way all the kernel data in total is per CPU variables so it means that it doesn't require any synchronization between the core to access kernel data. And also to access faster to these CPU variables using the chs register for example and this is an improvement that I did a couple of months ago and so we have a table and it is faster access through the chs register which is pointing to that table. I don't remember exactly the mechanics but I think I have a blog that I wrote about that and all the access is log free. The only moment that we require synchronization between the cores is when we wanted to for example create a thread from one core to another. We need to synchronize somehow the cores to migrate one thread to another something like this but it's the only moment that we need it. Otherwise this is completely independent all the instance. And to end the principles of total I will be going to talk a bit about the core to core communication. And the idea was to even if you as a user you can implement anything on share memory as you want. I decided to implement the entire over share memory so each core has a set of big use that allows to get data from a remote core and say data to another core. And it was just a bit of I mean to have fun to do it like this. I mean I'm starting to see if I can implement the entire like this. And the idea is that the communication is core to core so we don't have only one queue per core. You have as many birch use as you need to communicate one to one for each core. I don't know how to say exactly but this makes that you don't require any protection to send or keep the exclusive access to this birch use. Because you have only one consumer or one producer and so on. And relying on this mechanism then I could implement the API from MPI like MPI Gutter broadcast and MPI scatter which are functions that require communication between the core. So from the root core to the root core and so on. So I think I will just talk a bit about the benchmark I have been done. I feel free to comment about this because I'm not really sure about the numbers I'm getting. What I did was to choose a set of well known benchmarks called also micro benchmark which is since that it is used for benchmark in different implementations of the MPI standard. And I pick up two of them the also barrier and also already use which what they do is just stress some function. So for example the also barrier stress the MPI barrier function which is something to synchronize the instance of an MPI application. It's just a software barrier let's say. And the other one they already use is stress the MPI already use function. Which is going to send some vector to the root core process something and get back the rest to the other cores or other instance. What I did was comparing with Linux Bermuda and Linux in IBM and I use it. I pick up this machine from the from a clinics which is an AMD epic with 24 cores and 64 sheen rate and the host I use it for the VN is a Wuntu with isolated cores. No sorry yeah with isolated course. And I ran the and I use it. KBM team was hypervisor. I'm sorry now the host was Wuntu and the guess was Fedora 38. What I did is in this particular case what I did was to use a huge VM with 16 cores. Maybe it's not the most common case for MPI people just have several nodes instead of putting everything on the same. In my case I was trying to play with this so I decided to use a huge VM let's say and then compare with total right. So this is how I launch the benchmark. So for example I am using 16 threads for example. I'm not an expert in MPI I'm not really sure if this I mean if the MPI run for example is really using one core per thread it will not be optimal otherwise I think. And I was launching for 1000 interaction so this is the result for the Linux Bermuda. No Linux in IBM sorry. So these are the numbers for the host barrier. Which is this test if I yeah. So you can see that there is quite huge difference between the Linux VM and the Unicolon. But still I have to read redo these numbers I'm not really sure about that I mean because there are one order of magnitude at least. At the beginning I was interested to compare with Linux Bermuda because I think we can achieve something like this in IBM. But then when I started to play with Linux VM I said well there is a huge already difference with the VM. And also I was comparing with the host already used as I said before in particular with that side of the vector. And also it's quite huge difference with the Unicolon. So in the two cases are 16 cores in the VM and the Unicolon too. And I think that's all about the benchmark. To have this number also I figured out that some issues in particular I don't know if you were measuring something in VMs. In particular in carry-in the early TCC register is not emulated so you have to be careful when you use that. For example you have to when you are doing numbers you have to check that the carry-in is still in time. So if you make the difference it's not going to work always so you have to be careful about that. That's all I think. The question is a question. It's a question. It's a pity I'm not doing... The question was why I'm not doing communication between the VMs using this implementation. Basically this implementation can only run on a single node but people are using MPI on classes with tens or hundreds or thousands of nodes. Why? Do you have any plans to extend that? Well I'm thinking about that because it's not the first time that they mention this. Maybe create an interface, I mean use butyonet or butyvisoc to communicate with other instance. You will have multiple VMs running that. But for the moment maybe I will do it soon. I'm not really worried about that. What questions? Which MPI implementation are you implementing? Because there are different kind of versions of MPI or Pitch or so on. Which one are you based on? I'm not really sure because what I'm doing is just trying to read the semantics of MPI. I'm trying to implement it at code. The number of the functions I'm implementing is based on what is the benchmark. That's all. This is why I'm doing it. No more than that. Do you have numbers when you increase the number of nodes? Do you mean if I have numbers when you increment the number of nodes? How that behave? Yeah. I'm still doing that number. The difference is still there between the VM and the Linux implementation. Still a difference in the sense that it's faster, let's say. I'm still doing those numbers too. There is no point in finding the question. I don't know. Do you have a question about the big problems that are happening in the end of the time? The question is if I understand why we have that difference. I don't know. There are a lot of ways to tweak Linux to make it more performance. Maybe I'm lacking that. If you tweak it, you're going to dramatically drop that difference and the configuration. I'm not really sure from where it's coming. But I said before, it's still numbers that I'm working on. Okay, I think we are running out of time. So, Api, thanks again for the talk. Thank you. We have a short break for five minutes. And after that, we will have talk. Thank you.
News from the Hermit Crab — From Soundness Foundations to GPU Virtualization
Go Martin, go! Okay, I guess. So let's get this started. Wow. Okay, thanks everyone for coming. I'm Martin from Avitiha-Aachen University, and I'll talk about the Hermit operating system. I'm here together with my colleague Jonathan, and a few students are also scattered around the room. Yeah, let's get started. These are the things that I'll talk about today. First, a general introduction into Hermit and Juni kernels, although if you've been to this room in the past few hours, you already know some of that. Then I'll cover some arguably interesting internals structurally, and then talk about two applications, namely GPU virtualization using Cricut, and application kernel profiling. Okay, we've been through this a few times now, but let's go through it again. We have compared to a standard VM where we have a hardware and a host operating system, which might also be missing if we have a level one hypervisor, and a hypervisor, we then have this virtual machine. And this virtual machine runs a virtual machine image, which... What's happening? Okay, this virtual machine image is just a full-blown operating system with its own guest kernel, user space, and everything else. Then we've also talked about containers before, which throws away the guest kernel and really tries to minimize the image for the application, and we have unicarnals, which then run in virtual machines again, but inside the unicarnal, everything is packed together as tightly as possible. We have the application, we have some user-provided libraries, and we have the library operating system all statically linked together. What this gives us then is an image that we can really specialize to the use case at hand. So that means for the environment, namely the hypervisor, and for the application itself, and what it should do. This leads to tiny images, only a few megabytes in size for Hello World, for example. And since we only have one process in this whole unicarnal image, we don't need any isolation between this process, other processes, or the kernel. That means we can do this as a single address-based operating system without any costly address-based context switches between. We can run everything at kernel level, have no privileged context switches, and then can just make system calls to function calls. And that's pretty cool. Enter the Hermit operating system, as you can probably guess by the logo. The logo is written in Rust, 100%, well, not 100%, but there's no C in there, at least. There's only Rust and a bit of assembly, of course. We mainly target Rust applications, too. So we have an official tier 3 Rust target for Rust applications that we can use. But we also have a GCC and NewLip fork if you really want to run C applications, though that's not our primary focus. We have multi-core support, we are easily configurable, and we can now also compile on Windows. Yeah, we can also support stable Rust nowadays through our own distribution of the Rust standard library, which you can check out here. Okay, let's talk about the platform support. Okay, once we have this image seen on the left where we have the application, standard library, NewLip, and the kernel, we can then run it on our own hypervisor, for example. U-Hive is a specialized Hermit hypervisor that is specialized to running Hermit unique kind of images, which is the focus of Jonathan. The main target for that is Linux KVM on x86, though there's also some degree of support for Mac OS on both x86 and ARM. And also upcoming, though not yet merged, is Linux KVM support for RISC-5, which is something that Simon worked on. Philip, sorry. We can also target generic VMs through our Hermit loader, which then chain loads the Hermit ELF image. We can support multi-boot on x86, we support firecracker, and there's also UEFI work going on, which will be there soon, hopefully. For ARM and RISC-5, we use the Linux boot protocol to be able to run on things like KAML. Okay, so that's all you need to know if you want to use Hermit. Let's take a look inside. This is the same unique kind of image again, but from a different point of view now. The left stack is the application stack. It is the application. It's some user defined libraries, Rust crates in this case, and the core crates of the Rust 2 chain itself, so standard, Alagon core. On the right side, we have the Hermit kernel, which depends on some crates as well, and Alagon core. These two things are compiled for different targets, though, because we don't want to use any floating point operations in the kernel target, because that's costly to switch between. And the user code is compiled for a special Hermit target, which does have floating point support and also tells the Rust standard library how to communicate with the Hermit kernel. We also provide together with the Hermit kernel, but compiled for the user target some intrinsic such as libm for math functions, or mem intrinsics for things like mem copy, which really benefit from having this floating point support available. One thing that I personally worked on a lot are soundness foundations. You can see unsafe and safe Rust on the right. And we published a paper on that. It's called on the challenge of sound code for operating system, and what this basically aims for is to make the Hermit target sound. That means any safety reasoning must not require context. That's extremely important, and the history behind that is that Hermit was once written in C without much strictness around the locality of this kind of reasoning, and we put a lot of work into going forward and migrating to a more Rust-like approach here. One thing that came out of this is Hermit sync, which is a collection of synchronization primitives used inside the Hermit kernel. Most of these are also independently published as single crates and republished through this image, so you can also pick whatever you like in your own project. Another thing is count unsafe, which you can use to count the amount of unsafe code inside your Rust thing that we use to analyze our progress there. The next thing I want to talk about is our evolving network stack. Originally, it was just a user-side thing, so the Rust applications would compile some network stack with small TCP, a Rust network stack, and C applications would use what's it called LWIP, such as Unicraft does. In 2022, we moved that from user space into kernel space, which is not that meaningful since everything is kernel space, actually, but we moved it to the distribution of the kernel. Then we implemented support for these D-Style sockets because before we had a custom-made API for networking, and now we want to standardize it and adopt these things because that will allow us to throw away all the user space network stack, which can then both C applications and Rust applications use the kernel-provided small TCP network stack. In 2024, we are going for Pulse support for async.io, which would enable us to run a whole bunch of Rust networking applications, which usually run on Tokyo or something like that, and work on this is already well underway. Okay, then let's talk about the two application-focused things. First, GPU virtualization with Cricut. Short introduction to Cricut, which is another project developed at our institute, ACS. It's basically just plugging networking in between some API. So classical GPU CUDA applications work like seen on top, where we have this CUDA app that calls CUDA APIs, a library from NVIDIA, which then performs the actual computations on the GPU. With Cricut, we plug a Cricut client next to the app and a server to the CUDA APIs, and then just tunnel through all requests and answers. That separates these two things, and we can move them wherever we want and control what's happening there. And we found it's not that... Yeah, it's not that high of an overhead. We can then use this for remote execution, scheduling, or monitoring of GPU applications, as seen here. We can have several nodes with virtual GPUs, which then run on another node for computation. We then adapted Cricut for Unicornals, and published a paper on that. And how we did this is Cricut is based on ONCRPCs, which came out of Sun way back when. And the reference implementation is Oden Complex and uses Linux-specific networking features, so it wasn't easy to port to our Rust toolchain, for example. And as you can already guess, we ported it to Rust. Our user code is then run inside the Unicornal and only like the server part serving the GPU is not run inside the Unicornal. We did this for Hermit and Unicraft. For Unicraft we had to develop Rust application support first, but we did that and now it's working fine. The last topic that I want to talk about is application and kernel profiling. It's a project that has been dormant for a while, but we are reawakening it and getting it up to date and getting it working again. It's called RF Trace for Rust Function Tracer. How this works is that essentially we want to find out how much time is spent in which functions when we run software. Instrumentation does this by changing the code that is output by the compiler. We are essentially changing the program that we measure, which kind of falsifies the results a little bit, but for that we get extremely reliable things because we measure each and every time frame inside a function. It works like this. We have our Rust source, which squares some number. That corresponds to this assembly for inter-architectures. If we just append the corresponding flex to the compiler, the compiler nicely inserts this call to a special mCount function. What this mCount function then does is it can inspect the stack to find out which function we are currently in. It can then take some timestamp and it can also insert a return trampoline into this stack so that it also knows when we leave the function again. Together, all of this together, then lets us measure the time of functions, which is cool. In the image it looks like this. Our F trace is just another static library, which is inside the whole image. It works for Rust programs, C programs, and also for images, obviously. It is very encapsulated, so it exposes only a few symbols like mCount and then does everything internally. When we measure such a trace, we can then look at it and have a trace replay and really see which function we go into how and how long it takes inside them. We can also have a look at these graphically, of course. There are tools available for trace visualization. You could also create flame graphs out of this and then optimize the kernel. We are looking forward to using that for further optimizing the network stack, for example. All in all, I think that is all I have to say for today. That is a broad overview of the different topics that we covered last year. You can check us out on GitHub. You can say hi on Zulip. With that, I thank you for your kind attention. Thanks, Martin, for the talk. We have a working mic, so we can have some questions. Five minutes. Hi. My question is how do you instrument the Rust code and how do you actually get the function codes in there? The what? The instrumentation and turn some calls into the Rust code, usually, that you have. My question is how do you get those function codes in there? The question was, you said it to the mic, so it should be. There is a compiler flag for that. For C code, it is much simpler. You would just compile with GCC and then say dash PG, I think. For Rust code, it is more complicated. Well, it is not more complicated. It is just more lengthy. I did not put it on the slide because it was two lines or something. But those are features available to us through LLVM. Rust work is on the way to make this easier because it is not a stable thing exposed by the Rust 2 chain, but through manually enabling the corresponding LLVM passes for the code, this works. Thank you. More questions? I had a similar question. We also have a track on profiling, benchmarking and Unicraft. You are using instrumentation for profiling. Are you also considering sampling profiling? For example, what you are using is Unicraft, we are trying to tie in VMI, virtual machine interface. That will be able to do some sort of snapshotting and the others. Is this enough? Also, Unicraft, you have GCof support now because GCC 13 has embedded GCof support, so that makes things easier. Is this enough for what you have tested so far, the instrumented approach? Because you have to build the application, you then have to run the instrumented one, maybe it is not similar practice, is this enough at this point? We will have to see. In general, we are not that automated yet compared to Unicraft. Our Rust application story is quite seamless, I think, and you just enable profiling through a simple feature flag, and then you run it and it gets dumped on the disk and you can look into it. This is also what Gaby is working on. Did you consider, I am not sure how F-TracingPerox does it, but for example, there is something called K-Probes or K-Raid-Probes or something like that, which is a dynamic way of instrumenting the calls. What that does to you is you don't have to have these items done at build time, so that means when you want to instrument the application, you can tie in some flags and then while you execute it, it replaces some sort of function, pro or web, with some sort of jumps. Interesting. There may be something interesting to look at. We are looking at that on Unicraft's side. Is this like inserting a general hook into every function and then dynamically chain? Gaby knows a bit more about that. It is a bit of a rewrite of the function for organic load. Basically, you have a function that you want to jump in and then you can do the whole function that you want to jump in. Similar to that, just by hand and for some functions only and switchable. Okay, makes sense. Still very cool with the flame graph. I mean, this is the most important item because everyone does profiling, but having some sort of visual way of determining what's actually being spent, that's really useful. Yeah. We have to switch to another talk, so Martin will be around for more questions. Thanks again.
Support Dynamically Linked Executables via Linux ld.so and Implement ENA Driver to Expand Application of OSv
Hello, everybody. Can you guys hear me? Hello. Cool. My name is Valde Kozachuk. I'm one of the few OSV committers and I'm here to tell you about the latest enhancements made to OSV since my last presentation at Fosada a year ago. So, first off, I want to apologize for this very long title. Actually, most of my talk is really going to be focused on the first part, but I'll also try to mention a little bit about the other things. So, in today's presentation, I will talk about the enhancements to support statically linked executables and dynamically linked executables launched by a Linux dynamic linker. I will also briefly describe the implementation of the inner driver to support AWS Nitro. In addition, I will preview the new Xconfig-based mechanism to allow further customization of OSV. Finally, I will talk about upcoming one, zero release and beyond. Most applications do not make system calls into Linux currently, as we know. Instead, they do it indirectly by way of calling Lipsy functions that delegate to Cisco or SDC instruction on ARM. On Linux, for example, the dynamically linked executables are launched by Program Interpreter LD, which memory maps the executable else along with other else files. It depends on, like, Lipsy SO, Lipthread SO, and so on. Then, resolves undefined symbols like puts or pthread create and finally involves the main function. On OSV, the built-in to kernel dynamic linker plays the role of the Program Interpreter that performs similar steps as on Linux. But instead of loading the aforementioned libraries, it resolves the undefined symbols by pointing them to OSV implementations of those. The OSV linker supports both shared libraries and dynamically linked executables that are either position-dependent or non-position-dependent. The benefit is that programs interact with the OSV kernel using the fast local function calls without the overhead of Cisco instruction. On the negative side, the Linux compatibility is a moving target because Lipsy keeps adding new functions and on the OSV side, we have to keep implementing them. This slide here illustrates how dynamically linked programs would traditionally interact with OSV kernel. The drawing shows an executable procedure linkage table, PLT, on the left side. The dynamic linker and Lipsy implementation that are part of OSV kernel on the right side. In this example, after the dynamic linker memory maps the program into the memory, actually, more specifically, the self-segment, it then sets up the PLT to later resolve and replace the put function call placeholder with the address of its implementation in OSV kernel, which typically happens upon the very first call. Now, the statically linked executables interact with Linux kernel by directly making system calls and reading from pseudo file systems like ProgFS and SysFS. Initially, OSV implemented a fairly small number of system calls around 70 to support running going programs that were interesting because they would call Lipsy functions to create threads, for example, and execute system calls to do other things like, for example, Socket API. But this was not enough to support statically linked executables. To make this possible, we had to implement some key new system calls like BRK and clone and add substantial number of other ones to bring the total to 137 at this point. However, the most tricky part was adding support for the application fed local storage so-called TLS. The dynamic-linked programs that run on OSV, in a traditional way, would share the thread local storage with kernel and allow OSV to fully control setup of TLS. The statically linked executables, on other hand, want to allocate their own TLS and set the FS register on X64 or TPIDREO0 on ARM and to the thread control address for each thread. On X64, the solution was basically to utilize the GS register to point to the Persepio structure with a copy of that application, TCP, and basically update it on every context switch. On AHR64, we did similar thing. Now, the point of this enhancement is that we basically improved the Linux compatibility because now we don't have to worry about these cases, where, for example, application tries to call functions in Lipsy that OSV doesn't implement. But the drawback, obviously, of the system calls interface is that, obviously, we pay overhead of Cisco instruction every time, which on average I measured this around 110 nanoseconds on X64. This picture actually illustrates what happens behind the scenes. So on the right side, actually, OSV dynamic linker still plays some small role. It still memory maps the segments of the elf. It reads the headers, obviously. But then, really, it just jumps to the start of the elf. And from this point on, the interactions basically between the program and the OSV happens simply through Cisco instruction. The exciting side effect, actually, of enhancing OSV to support Staticly Link executable is basically capability to run dynamically linked executables via Linux dynamic linker instead of basically the OSV built-in one. The Linux dynamic linker, LD, is Staticly Linux, a tightly linked position independent shared object that is loaded and processed by OSV kernel in an exact same way as Static executable is. In Linux, the dynamic linker would be launched implicitly, right? And by simply introspecting the inter-program header. In OSV, we have to launch the LD, the Linux LD executable explicitly and pass its path along with the arguments as you can actually see in RO. And actually, as you can see in this script, runpy example. So we're passing actually the absolute path to the Linux dynamic linker and then we're actually adding all the path of executable and any arguments. So obviously, just like with Staticly Link executables, there is the same benefit that we are now much more compatible with Linux because one can take any application that works on Linux with G-Lipsy and it should work on OSV just because when we build the image, OSV is going to run, it's going to actually load the G-Lipsy, and we can't use it as any other library that given application needs. The drawback is the same because we are again paying 110 nanoseconds for every Cisco instruction. And this slide again tries to illustrate the interactions between the OSV and the application. It's, as you can see on the right, you have the OSV kernel. On the left, the application, the news dynamic linker, that is executed just like with static executables. And then it loads the application LLF into memory by using M-MAP system call. And then also executes the application itself, loads any libraries. And from this point on, all the interactions happen with Cisco instructions. Now to help analyze and troubleshoot static link executables, or dynamic link launch basically in this new way, we have added a new diagnostic tool that called S-Trace, which is obviously similar to what one can do on Linux. In essence, one can specify all interesting trace points using a regular expressions. In this example, to monitor system calls, you just add a Cisco star, and you enable S-Trace system thread that basically would print all the trace point calls to the standard output. And as the application basically gets hit, while program runs. How many minutes do I have left? Seven minutes. So to recap what I have talked about in previous six slides, in the first two I described the traditional way of running dynamic link programs on SV, which benefit from fast local function calls, but may suffer from compatibility issues. In the next two slides, I explained the new enhancements to allow running static link executables. And finally in the last two slides, I covered a new alternative way of running dynamic link programs launched by Linux dynamic linker on SV, which again may suffer from a tiny overhead of handling system calls, but benefit from much better compatibility with Linux. In essence, these new enhancements greatly improve the OSV application and should make possible to run more programs on it. In addition to what I have talked so far, we have also implemented a better version of the AWS elastic network adapter. In essence, we basically took the 3DSD implementation by AWS and made it work on OSV, and we tried to minimize all that. So basically, minimize the changes so that we can backport any possible future, for example, fixes. And disable a lot of stuff that simply does not apply to OSV. The resulting driver costs us around 7,000 lines of, sorry, yeah, 7,000 lines of mostly C code, and 56 kilobytes of larger kernel size. The challenge obviously was testing that because it can only be done on the running Nitro instance in AWS. And so far, the driver seems to be pretty stable. I've tested using, and seems to yield decent performance. I've tested that using IPerf3, NetPerf, and some simple HTTP server app application. As you may have guessed, actually, the ENA driver implementation is enough to run OSV on with RAMFS on Nitro EC2 instance. And so there's actually a script that I wrote to simplify the upload of the OSV image, creating a snapshot and basically creating AMI. And one thing, obviously, to run OSV on a Nitro instance with non-volatile file system like ZFS, or hopefully EXT in the future, we need to have NVME driver implementation, which is actually two pull requests from community at this point, but they haven't been merged yet. They need some love. In my previous presentation at FOSDM, I talked about kernel modularization and driver profiles. This year on it briefly describe a new feature that takes modularization to the next level, and which has been greatly inspired by Unicraft. In essence, the goal is to use the Linux kernel build configuration tool, Xconfig, to let the user select OSV components to be included or excluded, and various parameters to configure it. The make file would then simply act on a generated config file, exclude relevant object files, and pass any configuration parameters to the source files. And this is obviously very much work in progress. And obviously, unlike Unicraft, where all the elements are effectively Lego blocks, with OSV we pretty much have to do the opposite. We have to put sprinkle basically the source code with all these if-deaths. And this is just example of what kind of modules or parameters can be modified. And basically as an example of what can be accomplished with that new feature is that by hiding basically all the symbols, but those used by application, excluding all necessary components, and changing values of various configurable parameters as listed on the slide, one can build a kernel image of 788 kilobytes in size, and running a low-world app using 1.2 megabytes of memory. So it is, when I started optimizing OSV kernel like five years ago, it was like, the kernel itself was like 10 megabytes at least, and it required a minimum of 30 megabytes of memory. So it is almost 10-fold improvement. Well, I'm sure not as close as Unicraft, but we are, maybe we can squeeze to be at half megabyte. So we are, as I am moving toward the end of my presentation, I just wanted to mention that we are also planning to cut a new release of OSV10, which should include all the features that I've talked about. And I hope that we're gonna be able to implement the EXT file system, merge the IPv6 implementation branch, and potentially implement NVMe driver. I'm especially excited about the EXT file system support because I think it will make it easier to build damages on Linux, and then introspect, for example, if something happens afterwards. So beyond the upcoming release, we're planning to revamp Capstan. Capstan is like effectively like a craft kit. It just, but it hasn't been really enhanced in any way, or even to take advantage of any recent features of OSV. So we're planning to basically revamp it, and make it really easy to use, basically to help application developers to use OSV. And then in addition, we're planning to work on some of the security, so like ASLR, and that requires making kernel relocatable, and some optimizations. Eventually, and also finally, we are planning to make OSV to run on AWS Graviton, but that requires UEFI and some other things. And with that, I would like to thank the organizers for inviting me to this conference, and tell you about OSV. I would also like to thank SyllaDB for sponsoring my OSV work, and Dorbola Orr for words of encouragement, and Nadav Haral for being my mentor, and reviewing hundreds of patches, and implementing other enhancements. And finally, I would like to also thank all the community contributors to the project. And this slide, you can find some links about OSV, and thank you for your attention. And I'm not sure if you have any questions. Time for questions. We have time for one burning question, if there is. You wanted? Yeah, go ahead. This is your work on Linux compatibility. How are you handling new APIs, such as the IO U-ring and similar applications? Are you using? Your question was how do you add new applications to? No, no, so with the Linux API, that you are right for, I believe, for, how are you handling IO U-ring and similar APIs? So how am I consuming new APIs, Linux APIs? I don't know how are you handling applications, which do make use of those? So basically, this happens as the way I describe, typically, if the application is launched in the traditional way, OSV simply, resolves all the application symbols, like Lipsy symbols, and simply redirects them to OSV implementation of Lipsy functions. If I have an answer to your question, then we can meet afterwards and I can address better. Thanks again for the talk. Thank you.
[Protocols] Things we wish we knew before starting an IMAP library
Hi, thank you for being here so early to hear about such an old protocol. So we're going to talk about IMAP. We've both started writing some IMAP libraries and we want to share experience in that. We've hit a few issues along the way, a few surprising things. Hopefully this can help you if you want to deal with IMAP as well. So, I'm Simon. I'm working on the Go libraries and he is Damien. Hi, I'm the main head of IWF-Code. Yeah. So the first thing you might wonder is what is IMAP useful for? So maybe some of you know that IMAP is used to fetch messages from a mail server. So if you have a mail client and a list of messages shows up, this is fetched via the IMAP protocol. IMAP lets you organize messages into mailboxes. So mailboxes is what regular people call folders. So inbox, archive, spam, drafts, all of these are mailboxes for IMAP. The main advantage on the upside of using IMAP compared to older protocols is that it's possible to synchronize from multiple clients and devices. So for instance, if I want to start writing a draft on my laptop and then continue later on my mobile phone and sending my mobile phone, that's possible with IMAP. What's the basic way you interact with IMAP? So it sounds pretty simple at first. You open a TCP connection, ideally with TLS and without start TLS. And then you write a command and then you get back some responses from the server. So it sounds simple. Here's a very simple example. Here's an example of a login command where you specify your username and your password. And then after that you get an OK response from the server if the password is correct and the login is correct. So something interesting to note before going to the next slide is that... I'm sorry. I'm going to do this, no problem. So something interesting to note is that there's a CMD1 right before the login command here. So this is what we call a tag and it's used... It's an arbitrary string, a sendizer client, and it's used to match up the server responses with the client's requests. So it's just a string echoed back by the server. So the client knows that the OK response is for the command, this particular login command it sent before. OK. Here's a more complicated example with a fetch command which is used to fetch messages from the server. So here the client sends a fetch command and asks for the message flags and message envelop. The envelope typically contains a subject and the recipients and stuff like this. And then the server sends back some replies here with responses with the first message as the flag is seen. So it's not unread. It has been marked as important. And then the envelope is very big, so it omitted it here. And the second message has no flags. And when the server is done sending all data, it ends with an OK response. Something worth noting is that here in the middle, you might notice that the command tag is not included here. There's a wild card instead. So this will have consequences later. If you ask for data, it's complicated to know if you get replies for which command it was and if it was for command at all. We'll see you more on this later. In the fetch command here at the start, you might notice the one column wild card. This is the way you specify which messages you want to fetch. And we'll see how we do this in the next slide. So how do we refer to a particular message? There are two ways. Both ways use a 42-bit inside integer. So the first way is with something called UIDs. UIDs are a unique ID which doesn't ever change except when it does. It increases when a new message is added to a mailbox. So if the last message in the inbox has UID 42 and you receive a new one, then it will get UID 43. So the second way is with message sequence numbers. It's an ordinal number. So if you use sequence number one, it means the first message in the mailbox, sequence number two, second email in the mailbox, and so on. And it goes the same way as ICOIDs, like the oldest message added to the mailbox is the first one. So something interesting is that the sequence number, they get reassigned by some operations. For instance, if a message is deleted from a mailbox, then the sequence number shifts a bit. So here's an example of a mailbox with three messages, one with UID 4, one with UID 6, one with UID 12. And if the UID 6 is removed from the mailbox, then the first message stays with UID 4. And the second message is none of the UID 6. It's now UID 12. So the meaning changes depending on the state. Another detail is that message data is immutable. So if you fetch message contents, it will never change. If you want to edit a message, you need to re-upload it and then delete the old one. So this was to refer to a single message and we can also refer to multiple messages with something called SET. The simplest set is just one message. So here's just sequence number one. Here's another example with a column. You can say messages 2 to 4 inclusive. You can specify multiple ranges like this, like 2 to 4 and then 6 to 10. And the last one is 1 to wildcard. It means 1 until the end, until the last message. That's it for the IMAP introduction. Now we can go into the meat of the presentation. Do you want the microphone? Is it on? Okay, so let's go through all these layers. The first layer is types. So what's there to tell about types? A few things. Probably your journey as an IMAP developer will start as either a client or a server developer. So it's kind of tempting to try to implement only half of the standard and to a certain amount. This is possible because as a client developer you can implement command serialization and response parsing only. And as a server developer you can implement command parsing and response serialization only. You can kind of pick only half of the routines that you would need. But the IMAP standard has quite a few of overlap between commands and responses. So there are many types that you need to define and many parsers that you need to define and serialize. So you won't end up anyway with implementing 50% of the standard but more like 70, so to say. So my suggestion would be to structure your code so that you can easily extend it to the other side afterwards. For example using a shared module. And if you are lucky and someone will provide the missing side to you and you have parsing and serialization handy, you can do kind of cool stuff because you can first generate a random message and then ensure that parsing and serialization is inverse to each other by doing randomized tests. So there's a pretty powerful kind of unit test for it. At least for me it helped a lot as you can see at the bottom. Complicated stuff. Complicated bugs. Yeah, perfect. Okay regarding syntax, oh my. I will quote Mark Crispin from the IMAP protocol mailing list because I think it's not that bad but you need to be in a certain state of mind when doing it. Alright, let me think now I'm a bit tired today. But first and foremost the formal syntax should be your holy book. If any part of the syntax distracts you from the formal syntax, you should ignore it in favor of the formal syntax. Your eyes will glaze over and your jaw will drop. You can start saying no, no, no. Just work through that stage. It's a steep hill to climb but once you make it to the top you will see everything with crystal clarity. And remember, no matter what you do, do not try to implement any command or response by looking at the examples. And he's what Mark said, so he's right. I would add that before reading the formal syntax you need to learn ABNF and I mean you need to learn it by heart because there are some subtle things you need to be aware of. And regarding lexas and parsas, I think we agreed when talking about this things. IMAP makes in some places the impression that there are things like tokens saying arguments invalid meaning that there could be some generic argument. I had a very hard time to figure out what should a token be. So there are no words on what constitutes a token and I think Simon in version one tried it and got away from this approach or used a different approach in version two. So I don't know, maybe someone has a better idea but for me you cannot lex the IMAP syntax. And another recommendation, even the syntax has layers. So first of all you have the ABNF corals that are described in the ABNF standard and referred in almost any rule. And then you have these IMAP strings which make everything kind of messy. As an example, you see this is the lock-in command, looks kind of simple. And then you have this innocent looking A string thingy in there which is for example here the username and the password. And an A string is in fact one of three types and one of two protocol flows. So you have A string means either an atom or a string, more or less or some IMAP quirks. And if it is a string it can be a quoted string or a literal. And literals do require special care when implemented. So as a simple example we will start with password. It uses only a very simple character set so you can just write exactly these eight bytes as an atom. If you have a white space in it you need to put quotes around and if you have a quote inside quotes you need to escape the quote. So it is similar to programming, most programming languages. And if you have a literal, obviously if you have a new line in there, this would be the obvious case, you need to use these prefix here in curled braces and then you just send exactly the bytes that made up your string after a new line. With a twist as we will see. What we will glaze over today are ambiguities and defects and I had a few discussions already about this one. So I would very much ask everyone if you find some defect in IMAP please report it to us. We really want to start a collection on all of these things. And one thing I finally wanted to say, I quoted Mark Crispin from this thread, but if you now will go to the internet you won't find it. So the IMAP protocol, at some point it's not available anymore due to reasons. So and for me the only lucky thing that happened was that someone I know, it's the maintainer of the Mealy email client, he had this super cool online interactive WebAssembly demo and he used the dump as test data. So that was the only reason I could read it. I guess the thing I want to say here is let's try to be aware that knowledge is disappearing and maybe try to resurrect the IMAP protocol mailing list because it's awesome, it's like a travel trove of information. Okay, then let's go back to framing. So... Oh, everything tanked up. Yeah, I'm back again. So we're going to continue to talk about some higher level layer. So flow and framing, but by flow and framing we mean how does one split the IMAP stream into separate commands and responses. So this is pretty simple. This seems pretty simple at first. Here's a simple example, similar to what we've seen. Log in command at first and then the server replies okay and then the client sends a select command and then the server replies some data and then replies okay. So one may think, yeah, it's pretty simple. You just need to split a new line and each line is a message basically. And then literals happened. So here's a slightly more complicated example where the client sends a login command, the username, and then the password is passed as a literal. So first there's a number of bytes and then the next line there's contents. So here what's interesting is that these two lines are a single logical message. The second line here sent by the client is still part of the login command. Another interesting thing is that in between here there's a plus sent by the server. This is because the server needs to acknowledge literals. So when the client sends the first line here, it says, hey, I want to send a literal with six bytes and then later the server has to reply, yeah, you can go on with this plus and then option and comment after that. The client needs to wait for the acknowledgement before sending the literal data. Okay, so that's interesting. Let's try to look at only one side of the connection. So here let's try to look at only the client side and see what happens. So we can still make sense of everything here, like login with the literal and the next line and no op. Is this valid by the way? This sounds a bit weird, right? The client sends the username and then announces the literal and then the next line here, it sends a completely different command. It's not the password or anything. Is this valid IMAP even? It turns out that yes, it's completely valid IMAP because if the server replies no to the first line the client sends, then it can send the literal. It says, I don't want your literal. So basically what I'm trying to say here is that it's not possible to pass IMAP just looking at one side because you can't make the difference between this case and this case here, whereas the server rejects the literal. So you need in your IMAP password to have some kind of feedback from the other side of a connection to know what happened. And so one may think that we don't really need to wait for the server to acknowledge the literal. We can just send the command and the literal in one go and forget about it. The server will probably reply, okay, we'll probably acknowledge the literal in any case. So here's an example of what could go wrong if you don't wait for the server acknowledgement. Maybe you have a web form on a page which lets the user save a draft in their mailbox. And maybe the literal contains like, may contain some text like this which are valid IMAP commands. So if the server happens to reject the literal, then these lines are interpreted just as regular IMAP commands by the server. And these lines delete everything from your mailbox. So that's not great. And this can be potentially inserted into HTML email hidden and HTML on a single line. And yeah, if you reply to the email, you just use everything. So yeah, it's pretty scary. So to recap everything, something I haven't mentioned is that literal can appear basically anywhere. We've seen in the login command, but it can happen in the search command. There can be many literals for a single command. It's limited to one. So literals interrupt completely the regular syntax. You have to pause the parser from the server side or the client side if you receive a literal. And then wait for the other side to reply, yeah, go on. And then you have to resume the parser. And the literal can be nested into a list or nested into something else. So it's kind of complicated to do, especially if you're using, for instance, a parser generator or something. So we can pass IMAP just by looking at the single side of the connection as we've seen. And it's important to wait for the server to accept literals before going on or security within. So another aspect of the flows we want to talk about is commands such as authenticate. So authenticate is a command that lets the client use Sassel authentication. Sassel is a binary protocol. And to authenticate in a modular way, you have several mechanisms. So here's an example of a plain mechanism, which is a simple one with username and password, but there are those as well. So basically the idea is that you get a binary message and code it to base64 and then send it over. And the interesting thing here is that, so the client says authenticate command, the server says go on, you can continue the authenticate command. And then the client sends just base64, like the, what? This is not a regular IMAP command. This is just base64. There's no tag. There's no command name. It's just like the base64 data as is. It just interrupts regular IMAP syntax with completely something else. And IDOL does something as well similar to this, where client sends IDOL, server says go on, and then client can just send the ASCII string down like the four bytes down. And it's not an IMAP command or anything. It's just like an ASCII string. Start CLS and compress are kind of similar in the way that when you start these commands, it interrupts a regular IMAP stream and wraps it up with CLS or compression mechanism. So these are fun to implement as well. So in summary, for the flow section, IMAP demands you to conflate your passing with business logic with higher level details. So you cannot have a pure password in its own little module isolated from everything else. You need to wire it up with the rest of the IMAP library. It's kind of special in this regard compared to other processes. Okay, now on to operations and semantics. So let's talk about fetching messages again. There are multiple things you can request from the server to fetch messages. So basic example, the envelope we've already seen. Body structure is if you request the MIME structure of a message with a tree of nested parts. If you have attachments for example. And then to fetch the message body, you can use body square brackets. If you just request body square brackets like this example, you get a full message body. So here's an example, very simple message with two header lines and then a simple text. So yeah, if you fetch the body square bracket, you get everything. If you want to fetch only the header, you can use body square brackets header. And then you get only the first two lines. And you can request only the text of the message. So the howdy part here with the text modifier. But you can do more complicated stuff as well. Oh my. Yeah, maybe I'll go very fast on this one. You can fetch particular header fields. You can fetch sections, bytes, substrings of the results. You can fetch, if you have a multi-part message, we have an example with two parts. So the main part, the first sub part, the second sub part with an attachment. Then you can fetch only the first part here. So the counter disposition in line one. Or you can, here this one is interesting because it returns nothing. A header actually doesn't work in nested parts. You have to use a special keyword called mine for some reason. And then if you have a message attached to a message, then you have a section of the RFC dedicated to this particular use case. Like something everybody does every day, I think. Messages into messages, like Russian dolls. The last thing I want to talk about is unilateral server data. That's another simple example of a fetch command where you want to fetch the body of message one. And then the server replies, yeah, here's the body of message one. So everything's fine. Let's say another client happens to mark the first message as important. So the way this works in IMAAP is that the next time you execute a command, then the server replies here in the middle. Hey, by the way, the flags of the message one have changed. Even if you didn't ask for it, just before completing the command, it sends this data. So what happens if another client changes the flags of message one and you happen to send a fetch command right after this happened? Then you get something like this where the server replies first the body of the first message. Like hello world, like before. And then you get something interesting where you get another fetch item for the same message, but something you didn't ask for at all. So... Yep. So it's not possible to think of IMAAP as you request some data and you get back some data. It doesn't really work like this. You can think of it as you request some data and then the server pushes some data into you whether you want it or not, and you have to deal with it. And as a client, if you ignore all but the last reply from the server for the fetch message you asked for, then you won't get the body here. So it's something to look out for. Okay, last topic, extensions. These are a bit interesting. In GoaMAV1, I tried to implement extensions as a very modular thing, which you can plug. But extensions turn out to be more like amendments. Like fundamentally alters IMAAP syntax, flows, operations, everything we've talked about. Idle and compress are examples that add completely new flows. So Idle switches to a completely different mode than you need to send a downed SQL string to switch back. And compress, yeah, just wraps the connection with something else. And then you have another kind of extension like extended list, which modifies an existing list command and adds some arguments in the middle to add more options for the clients. The search extension for extended search, it changes how the reply looks like. So you send a regular search command and then you get some completely different kind of reply. And then the literal plus extension completely changes how literals work. You get a new syntax that you need to pass. So yeah, this doesn't work at all if you try to implement it as a modular thing. IMAAP is completely mononit, if you want to implement extensions that implement everything in the same repository. It will help a lot. All right, that's about it. Unfortunately, we don't have time to talk about everything we wanted, but it should be a good start, I hope at least. Any questions? Thank you very much first. I see a first arm. Yeah, quite immune. It really helps you at the time. Hello. Thanks for the talk and thanks for the library too. I think we're using it quite a lot. Thanks for the talk and thanks for the library. Oh, okay. Yeah, yeah. My question is like, you said like, sometimes you get responses from the server. You're not even asked for, does the server also send without asking? Does the server also send data without asking? So it kind of. I mean, if you, it will only send data right before it, right after it, sorry, let's go from the start again. It will not send data on its own if you don't send any command. You have to send a command and then you reply to the command and then add its own unilateral responses to it, which can be a bit arbitrary. Like it can be anything, really. It's usually at the end of the, just before the okay response, you get some extra data and you have to somehow maybe distinguish it from the regular data. But yeah, it doesn't really work in practice. I was glad to have you. Yep. Oh, yeah, yeah, yeah. I just added that on a little bit. So the IMAP standard is quite specific regarding and it says you need to be able to receive any response at any time. So it's quite, it has in the standard, but us doing practical things. The thing we learned is that you should not trust anything that's in the standard and to the best of my knowledge, most servers don't. So you have, there are exceptions, for example, by, by respond, by untact, like when the server do a shutdown. Yeah, as answered, maybe if you can explain a bit more, but to the best of our knowledge, most people doesn't do it because at least when we tested some clients, many clients, and I mean, I mean, like the most of the clients, they crashed when we sent this. So I think there's a reason why it's not so common in the real world. Okay. Okay. Just wanted to say that if you consider the client server interaction more like that the client told the view about the server and then the server updates the view whenever you send a command, then it starts to make a bit more sense. Yep. But it can be hard to architecture a client with, yeah, against this IMAP concept. Like sometimes you don't want this kind of thing. But it's good. But yeah, it's a good mindset for sure. Yeah. All right. Any, any. Is the only, having regarding IMAP as a cash fill protocol where the client has a view and the server fills in the client's view is the only way to write an IMAP client that will preserve your sanity over years. If you try to, if you try to act as though this were a web server, you will have and this works over the years. Each new server will surprise you in some way. Painful. Don't ask me. Well, your code is. All right. Thank you very much. And thanks again to the two presenters and we come to the next talk.
[JMAP] JMAP: Getting Started
So now we are really getting a little bit after we dove a little bit in the old specs and standards and you know like details of IMAP, we are now going to hear a lot about Jmap, sorry, yeah, sorry. We'll be talking about Jmap which is a new set of standards that has been engineered in the last couple of years by some very engaged people that also in parallel have been contributing a lot to IMAP and still do and we are very happy that we have one of these persons here or representative from Fastmail which has been a company very instrumental and putting a lot of effort in this new set of standards and yeah Rick your stage is yours, we have learned about Jmap. All right, applaud fast because I got a lot of slides and I got a little bit of time. So we are going to talk about Jmap, funny story, I was pitched this talk where I was going to talk about Jmap and I'm like what is it, how does it work, why is it so great, how can you use it, how does Fastmail use it, all this stuff I covered everything, it was a really good talk, it was like an hour long and I looked at my email that I was coming here and it said you get 15 minutes. So and it had to be in PDF so this is just like the absolute minimum dot PDF of my slides and if you want to hear the whole thing and see all the builds and all the animations and everything about IMAP and Jmap that can be arranged very easily, talk to me later. That's me, I work at Fastmail, I'm not going to talk about myself, we don't have a lot of time. Let's talk about IMAP, sorry. Who is here for the IMAP, what I wanted to know before I wrote IMAP library earlier. You, okay well you missed a lot of horror stories but I'm going to give you some now, this is IMAP and I'm going to be real brief about it. What you're seeing here is the server in white, the client in yellow, we log in, we says yeah you logged in and now we select an inbox, okay this is the IMAP protocol, very basic but here's all the like beginning of the parts of the grammar you need to parse and it's a bunch and if you were here earlier you're going to see lots and lots and lots of more stuff, weird literals, like weird ways the interaction with the server changes how you parse the response, synchronizing and non-synchronizing literals, it's a complicated protocol and it's not like other protocols that you're using and there's a really simple reason for that which I'll get to, oh yeah, right this is the protocol to do stuff and then this is the payload of the message is MIME which is like another thing nobody wants to deal with, it works great and it pays my salary but I mean, oh, see what you want about HTTP and JSON but at least it's not this stuff, right? Like you probably all know how to use HTTP even if you don't know how it works under the hood and you probably know how it works because it's really nice and simple. So I had lots and lots and lots of slides talking about how weird IMAP is and I would love to tell you about it but I'm just going to tell you about this one thing and this was touched on earlier, blah blah blah blah, server and client are talking and eventually the client says I want to mark message 12 deleted, store the flag deleted onto that message and the server says great, you have fetched this information and this is where people get really confused and it comes down to something I didn't said earlier, the only way to understand IMAP is that IMAP is a cache invalidation protocol, it's a protocol that tells you what to do with your cache. So you've got a server and you've got a client and the client can send basically the commands you expect like I want to fetch or update or create or delete messages and the server's response is in response to that here is how you should update your cache and if you don't think about IMAP that way you're going to have a bad time. Everything, yeah everything works this way. If the client says I want to work with the inbox it says select inbox and the server says there are 172 emails and these flags exist which is a way of saying here's how to initialize your cache. When you say I want to look at my new mail the client says fetch these things and the server says you fetched these things which means put these in your cache. When you say I want to mark this mail red you say store this flag and the client says put this in your cache. That's how it all works and you have to start by understanding that even to understand IMAP. You want to talk much more about IMAP. Okay there is one more thing though. This is another fairly basic IMAP conversation where we're saying we want to come up to date and come up to date is really important. See at the beginning we say queue resync that means we want to quickly resynchronize our IMAP storage offline. So we say queue resync and that our client state is 123. Just tells what the state was the last time we synced. And we get told great your next sync is going to be 130. Here's all the changes to apply and when you're done you'll be at state 130. Without this IMAP kind of sucks. I mean it's better than pop but one of the great things about it is you can synchronize, go offline, come back later and quickly get up to date no matter what else has been going on. Okay now you understand IMAP. Good job everybody. Yep. Who wants to go implement it? Yeah these four freaks. Okay the good stuff is good but the bad stuff sucks and there's so much bad stuff. So like good stuff. You can resynchronize from previous session. Great. You've got a domain specific model IMAP is built around email. Really nice. How about the bad stuff? Okay the data format sucks, the transport layer sucks. The code that's out there is not great mostly. The key features of IMAP aren't in the core protocol sometimes so you need to make sure you've got the right extensions loaded, the right capabilities available or you're implementing to the worst common denominator and there's way, way too many parentheses. Okay so this is why we built JMAP. JMAP is the JSON Meta Access Protocol. It's just JMAP, it's IMAP plus a thousand right. This is what it looks like. So already I hope people are feeling better, you know what this stuff is right. We're posting a request, we're posting a request to the JMAP endpoint and we say I want to get these emails. Great right so just like everything else it's a restful protocol kind of. Here's what you get back in response. You said you wanted to get emails one two three four, here's one of the ones you might get. You did an email get, you're getting a list of messages, this one has ID one and there's its subject and there's more stuff. But it looks like this, you can parse this. Anybody knows what this means. Here's a bigger context of it so you can see like there's an ID and there's parts of the body and the subject but the thing I want to call your special attention to is it's got one simple date format. Yeah I mean you can stop there and it'd be a pretty good improvement on IMAP and IM. But we're going to keep going. Here's another thing, you can say I want to get, when the server responds to you, it can say yeah you did just get these messages and by the way your email collection is at state 616. It's just like that Q-resync thing. It's going to let you say later I've got mail and it's got it all up to state 616. Hey server tell me what changed since then. And the server replies and what it says is here are the changes. You were at 616, you will be at 717. These two IDs were created and this one has changed in some way. And then you can decide to do what? Update your cache. What do you do? Maybe you refatch those messages. Maybe you just invalidate the local storage but you know how to change your cache. It's just like IMAP. JMAP is a cache management protocol. It's just easier to use. Here's another example. Email query is basically what we call search when you search your email on IMAP. So we're going to search for mail that's been flagged and that's from me. Really simple. And the response to that will look like this. You did an email query. Here are the IDs that result from that. And the reason that it gives back IDs, it's about managing your cache. You should have messages cached. If you don't have these, well now you can fetch them. But if you did have them, why send you the messages back? You should have a cache with these messages. If you didn't, you would go ahead and say great, email get these messages. I didn't have them but I want them so you get them now. And it works great. It makes sense. You can think about this really easily. But we should talk about IMAP again. So in IMAP it works the same way. You say I'm going to search flagged messages and it says here they are and then you say I'm going to fetch those. Right? Makes sense. Same thing. IMAP and JMAP look the same in a lot of ways. This is what you don't always see in these diagrams. Where the round trips come in. Right? First we search, goes to the server. Server computes the answer, sends it back. They only say I need those messages. We say give me those messages. Send it to the server. Server finds the answer. Server sends it back. You're waiting for the speed of light back and forth twice. That's what happens here too. Right? You say I want to do a query. I get the answer. I look for those messages. It goes to the server again and it comes back. So the same waits sit here. But you don't have to let them sit here with JMAP. Because when you write your query you can write this. I want to do a query and a get. And what is the get going to fetch? I don't know the answer yet. That's okay. You tell the server which IDs do you get? They came from another thing I asked you to do. So get the IDs by looking at A. It should be an email query. Get the IDs out of the response that you compute before you send anything back to me and do the method call with those. It's called a back reference. And you can have a whole bunch of method calls that back reference to one another to let the server do all the work and only do a round trip back to you once. So you get one wait state. Really good. Okay. Couple more things. This is a larger section of a JMAP query. I've put in some more things. I've been skipping on these slides. Mostly you've been seeing this stuff, actual method calls. But what up here is good too. This is called the using block. It tells you what capabilities you want to use. This one's really simple. If you squint you can see we're using core which is like, yeah, I'm speaking JMAP. And mail, again, I'm looking at mail. But you didn't have to squint apparently. I had to build. But you can have lots of other capabilities. At fast mail we have contacts and calendars over JMAP. And those are going through the ITF now. They'll be RFCs and we have lots of other stuff too. What that means is if your server supports mail and contacts and calendars and other stuff, when you come back from offline, you can synchronize everything with the same request. Not just the same protocol, but hello, I'm back online. Please get all the changes since my offline state and fetch the updates to me all at once. You can also write your own custom data types for whatever appeals to you, whatever your business needs to use, add it to your implementation. Because even though the data types in JMAP are domain specific, we let you build your own. Anybody can build their own just by describing how those methods will work. I'll talk about it just a little bit. Fast mail uses for mail filters, your preferences, your credentials, your DNS, your files and billing, all kinds of stuff. We just do over JMAP because it's great. Okay. Getting close to the last things. We also give you event source. Event source is a long-running connection. I'm old enough that I still call it combat, right? Like you connect to the web server and you say, tell me when things change and you stay connected. And every once in a while, the server sends you a little blob like this saying, oh, there's an update to your email state. Oh, email and contacts have changed. And when that happens, what does your client do sitting there connected? It invalidates the cache. It can refresh things. It can update the screen immediately. So I'm at Paz this with something called idle, but CalDav doesn't, CardDav doesn't. And when you do this on your mobile phone, idle is not going to help you much because Apple sure as hell is not letting your phone sit there with a connected TCP stream live to your IMAP server all the time. So people build these interstitial servers instead of getting a web push which would just directly send your phone a message. And JMAP supports web push. So you could just get real-time updates from all these protocols. So this is our IMAP. We get rid of all the bad stuff just about and add all this good stuff. JMAP and HTTP, anybody can use. Avoiding around chips by combining these requests. Putting lots of data types in one place and real-time synchronization and the cost is that not everybody's using JMAP yet. It's growing, but it's still pretty early and there's way too many squiggly braces and double quotes. But like that's the price I'll pay. Okay. So what now? You want to know how this works? The first thing you should do is go look at this repository, fastmail slash JMAP samples. It's code that just does some real basic stuff with JMAP and you won't understand it yet, but it's going to give you an idea of what JMAP use looks like. Simplest form. Then it's time to read RFCs. Yes. Don't worry. They're actually pretty good RFCs. You should look at these if you want to play with JMAP. The first one is 8620, which is going to tell you what the basic methods are. And then 8621, which tells you the data types. So 20 is going to tell you things like how do you get, how do you set, how do you do changes, just what those are that work on any data type. 8621 is going to tell you the specific data types that we use like mailbox, thread, email and so on. Everything else, you just learn more data types in more, in calendars and contacts, basically how the protocol works. You learn the data types on top of the core methods. Some highlights from RFCs. Yeah, okay, I got it. A minute and 18 before questions. Email is the most complicated data type in JMAP for obvious reasons. Emails are big and weird and complicated. JMAP does a great job of making them easy to deal with. Here's an email get. When you do a get, you can also say which parts of the thing do I want to get. Don't get every property, just get pieces. So I might say I want the from, to, subject, preview, like a little snippet you see in your mail client, and it's mailbox IDs. So what do you get back? This. You have a build. Great. The to and from come back as structured objects that have parsed the email headers for you. Nice. The subject comes back decoded. That's ASCII, so that was a poor choice of string, right? But it comes back decoded. The preview is decoded and mailbox ID is this weird set thing. Why is it an object instead of just the one mailbox ID? Because it can be in multiple mailboxes ID. And if you hit me up later, I can tell you about labels mode, which is what we use this for. It's really nice. So the headers, you could fetch the subject, but you could also fetch the header called subject. And when that happens, you get back the quoted printable, the literal thing. But if you want instead, you could say, give me the subject, the literal bytes, or give me all the headers, because maybe there's multiple subjects, or all the headers, but decode the text. You can get anything like that. You've got no time left. I'll show you this. When you fetch the body, you can get the blob ID. Don't do that. That's how you have to mine parse it. Instead, you say you want to fetch the text bodies and all their values, and you get something like this. Here's all the bodies you need to display the full text of the message. Like there's no mind parsing, there's no remembering. What do you do with, you have multi-part alternative and multi-part related? How does that, no. Just, just do that. Okay. Yep. Time for Q and A. The first thing I will say is you can ask me for more later, use fast mail, blah, blah, blah. How about questions? All right. Same here. Hi. Thank you very much. So one quick question about adoption. Did you reach out to, because when looking at this protocol, and I've been playing around with it for some time now, it looks fairly similar to whatever Google and Microsoft do. I'm not familiar with those companies. Yeah, yeah. Yeah. So is there any chance that these guys would be interested in adopting this? Yeah. Yes. I mean, I think I can just say that. It's, you can imagine like Microsoft, Apple and Google are all standing around a well in like a spaghetti Western with their guns at each other, like who's going to change first? Right. Apple's client is by far the most popular mail client in use. Google's servers are the most popular servers. If either one breaks, we're in. And I've spoken with people, these companies, and they're interested, but of course, it's a huge amount of work on something that even though it's clearly technically superior and a big win is a gamble. It hasn't won yet. I'm pretty optimistic that we're going to see things happen, but I don't have any secret knowledge. Yeah. Thanks. Hi, thanks for the talk. What's about JMTP? Yes, JMTP. Yeah. So replacing, replacing server to server communication is a much more fraught problem than replacing what your client does. I would. Yeah, yeah. So submission. Are you asking about submission? Okay. So mail, MTA to MTA, right? Full exchange of mail between different, the Fediverse of email, if you will. That's going to be SMTP as far as I know forever. I'd love to see JMTP replace it, whatever the hell that is. But submission where your mail client says I want to give this message to be sent, JMAP supports that and it's really, really good. It has lots of really nice features. It has the ability to tell you, oh, by the way, that mail you sent bounced. It has the ability to tell you how many people has it been sent to. And the way that you create messages as a client author is much, much, much simpler. You don't have to think about constructing mind bodies yourself. You can just say, here's some attachments. Here's the text in the HTML and the server can do everything for you. So it does replace that. It also, because it's one protocol, you're never like, I can fetch mail, but I can't send mail because one server's up and one server's down. It just always works. What do you do about encrypted messages? So you're like open PGP or SMIME sort of things? Yeah. So what do we do about encrypted messages? Punt. Well, so there are some RFCs about SMIME and handling SMIME messages, I think all by Alexi, if not mostly by Alexi, that I would say are optimized for the server having access to your key material, right? Is that a fair way to describe it? Yes. Yeah. And there have been discussions about how we would deal with encrypted messages when the server doesn't have your key material and only the client does. We've talked about it's complicated and I think there's interesting things we can do. But generally, Jmap is built around the idea that whatever the server can see, you can see. And encryption, as usual, makes things less convenient. All right. Thank you again very much, Rick. I think it will be around.
[JMAP] OpenXPort JMAP: a PHP library for Data Portability
All right, we head on with the next talk. All right. Floss is yours, yours. So let's wait for the room to cool down a bit. So I'm one of the lucky ones having only five minutes for my talk. So I'm going to keep it very brief. I hope you can hear me. Good. So I'm Joris. I work at Rodrigo. We do quite a lot of work on data portability. And that's how we came to JMAP. So Ricardo already did a quite good job of, well, presenting what it's all about. For us, the main thing what we wanted for is having a unified API. So I think there was one slide where he said we add files and calendars and contacts and whatnot, own extensions for that. And it's really well for that, actually. Yeah. So I'm just going to skip that slide because I don't have much time. Yes. So in the end, one thing that was not mentioned in the previous talk is that JMAP, the JMAP calendars and JMAP contacts is built upon Cardiff and Kaldiff, which themselves, they build upon iCalendar and vCard. So there is a replacement, a modern replacement for iCalendar and vCard, called JS Calendar and JS Contact, and a modern replacement for Cardiff-Kaldiff, which is called JMAP Contacts and JMAP Calendars. And that's what we are mostly using heavily. And yeah, in addition to a bunch of other data types that we also added. So the work that we did is we, first of all, we have a client, we have a server. We move data from one service to another, data portability. So the client is using, it's a Java client. So we collaborate with Danny Gulch here. We have added a lot of features to the library already. We still need to work out how to combine that well with what is already there, because we also would like to see that the JMAP Java library is the go-to library for JMAP in the Java world. And on the other side, on the server side, we have our own software. It's called JMAP Open Export, which basically makes it very, or is supposed to make it very easy to add a JMAP API to PHP-based systems. We already added support for quite a lot of data types, or verticals, files, calendars, contacts, and so on. So it can also be used to lift files that are on a... It's an ongoing project that you could attach a JMAP API to files that are somewhere on a server, and then you can migrate those away. And obviously, we support JS Contact and JS Calendar. There is an RFC in progress right now to convert... RFC for JS Contact and VCAD is already existing. How to convert between those two, and another one is work in progress to convert between iCalendar and JS Calendar, so it's to make it easy for developers to start with those formats. Yeah, so basically, that's what we extended. Right now, we have a JMAP API for NextCloud, RunCube, and Ancient System Squirrel Mail, and Horder is more or less Ancient System, I would say. Yeah, we already use it in large-scale migration projects with a lot of users. Yeah, so let's finish with the last slide out of time. Yeah, so there's also a JMAP Dart client from LinaGora that we are extending currently, and building a JMAP CLI around that. Yes, so there are also other specifications that you could read upon. Didn't finish quite in time, I'm sorry for that. Oh, fine, thank you. Looking around, here's one. How many lines of code is your Java JMAP client, and what does it require in direct dependencies? So we might even relay it to the next speaker, I think. Yeah, but our client is quite big, actually, but it's a library that we're using, which is quite lean, I would say. Now I don't feel bad at all. Any further questions? Otherwise, I think the next speaker may come, which is actually Daniel Gutsch, the author of the state, the JMAP Java library, and some tools.
[JMAP] Intro to Ltt.rs a JMAP client for Android
It's fine. Anyway, good morning everyone. My name is Daniel. Today I'm going to take a few minutes to tell you a little bit about a JMAP only client for Android that I've been working on for a while. But first a few quick notes about myself. I usually work in instant messaging. I'm an XMPP developer. I am on the Council of the XMPP Standards Foundation. I develop an XMPP client for Android called Conversations. And yeah, JMAP is a long-term side project of mine. I checked yesterday that I registered the LTT.RS domain in 2017. And I think I've been working on that for even longer than that, somewhere on my hard drive. There's an implementation for the pre-RFC JMAP thing. That's at Fastmail Road. And yeah, these days I develop a aforementioned Java library and the Android client for letters. So why JMAP? So as someone who's starting from scratch, I think you already got the sales pitch for JMAP. You have a same set of extensions. You can do sent and receive over the same protocol. JSON parsers are readily available. You don't have to do whatever IMAP is. On top of that, you don't have to do any mind parsing. If you ever wrote a mind parser, you know how much of a relief that is not having to do that. It has built-in push support. So especially if you're targeting web or modern mobile phone operating systems, it's good to have vendor push. And yeah, essentially just see Ricardo's maybe omitted slides on how bad or how weird IMAP is. And yeah, you pretty much know why I way to JMAP. So a little bit of the architecture. The way applications have changed or how Android applications are developed have changed quite a lot. In the last 10 years, Google has released a set of libraries they call Jetpack that make application development a lot easier. And Lettuce tries to use a lot of them. For example, there's Room, which is a database abstraction layer where you basically define how your UI displays the information in the database. And then whenever you write to the database, your UI automatically gets updated. And only those things that have changed. So the way I implemented is that my JMAP library has a generic search backend that's then implemented by Room. We write data to Room. And then magically, our UI gets updated. And we don't have to do anything. And also because my main job, again, is developing conversations, which by now is like 10 years old and quite legacy. Lettuce also has a sort of playground for me to work with new Android APIs, such as Material.U, which is the new design language, or predictive things like that. So you already heard that both IMAP and JMAP are essentially like cache management protocols. And that allows us to have great offline capabilities in Lettuce. So all queries, whether you view a certain mailbox or even if you do a search, those are all cached. So if you retry a search or redo a search, when you're offline, you still see all search results. And then all user actions are handled by another Jetpack library called WorkManager that automatically retries those actions when the user comes back online. Yeah, while the app is in the foreground, we use web sockets and event source to listen for server side changes and refresh the UI. And we also, when the app is in the background, we have a fully open source web push implementation. We don't actually use LIC Play Services library, but we talk directly. We are open source code too. So Firebase, or the Google Play Services to retrieve a web push URL, you can actually trick Firebase into giving you a web push URL instead of doing the application server thing that you might be familiar with from other Android apps. But that requires like Vapet, like a voluntary application server identification, which JMAP currently does not support. And I'm in the process of writing in RFC for that. And yeah, because we have native web push, we can also hook in other push implementations that are not bound by Google. For example, like Unified Push. And the way that works is, for example, that the JMAP server can tell my XPP server to tell conversations to wake up letters. And then Google is not involved at all. And I can like self-force every part of that. We also have native enabled by default order crypt support. No plug-in required. It just works. You see a lock icon on your compose screen. If the other part supports it too. During account setup, we ask for key import. If we previously used setup messages, just refer to the auto crypt spec on how that works. But server devs, please allow us to search for arbitrary email headers, because we need that to discover the setup message. That's it. Thank you for your attention. You will find the code of the JMAP library on Codebook, the Android client. If you want, follow me on message. I'm Daniel at Google.Social. The source code for my slides is also online. Yeah, thank you. Any questions? Thank you. Any conversation about letters or JMAP? Come on. So you said there's no need for a MIME parser. Is there really no, never any reason to have a MIME parser yourself? Yeah, I didn't want to put that on the slides. But as soon as you do like PGP encryption, you do have to do a MIME parser. That's what it was. Oh, damn, now I have to deal with my parsing. But the MIME that is in most PGP messages that I encountered is a lot saner than what you might encounter on wild email servers. So yeah, that's a relief. All right. Any further question? For the boost, I want to know. Do you use Udify? You can receive the notification or all the Zmaps over? Yes. Yeah, so JMAP has built in web push support, which is like an RFC as well. And then you can either speak web push towards Google and let Google relay your messages or use Unified Push. And you best go to UnifiedPush.org if you want to learn more about the self-hosted version of Unified Push, because that's too complicated of a topic to have in a five-minute Q&A session. All right. Any further question? Otherwise, thanks again to Daniel. Thank you.
[Servers] Aerogramme, a multi-region IMAP server
Hi everyone. So I will present IROGAM which is a multi-region IMAB server. And so the goal of this talk is to discuss about this multi-region thing. But before starting some context, so my name is Quentin and I have a PhD in distributed systems and this talk will be a lot about distributed systems because that's something I know. And I try to work as much as I can for a collective. It's called Duffler and we try to build a low-tech, etic internet and if you want to know more about the things we are doing, there was a talk yesterday about garage where the infrastructure, the self-hosted, geo-distributed infrastructure we have is presented. And IROGAM is part of the strategy and the project of this collective and also a very nice thing. It is supported by an internet and they are very nice. I have to mention it. So first the problem we want to solve and I like to say that with emails we want to make other people available when it would be as always impossible due to distance. We can achieve this goal only if the underlying system is working. And so this talk will be about distributed systems but also about availability and reliability. And so I have three main ideas that frame the decisions when developing IROGAM. So the first is that we should not trust the cloud and hosting providers not only because they can fail and when they fail your service is not working. And the second aspect is that we think there is some space when it comes to IMAP server designs to study and to try new design, new trade-off. So there is no perfect solution. We don't have a magic solution but we can try new ways and new designs. And the third part I will try to convince you that this new design can work in the real life. So first don't trust your provider. So generally when you have the title of this talk is multi-region. I think the first part is to define what is a region when you talk about a cloud or hosting provider. So it's the Google Cloud Platform Region Paris. So its name is Europe West 9 and it's made of three data centers. And last April the whole region, so the three data centers, was unavailable for three weeks. Not totally but the outage lasted for three weeks in some part and it was due to a fire in one data center. And due to some tight interconnection between the other data centers and many software, the other data center were unable to work not due to hardware failure but due to software problem. So three weeks without emails you can imagine that it could be very hard when you use it for very important stuff like I don't know paying tax and seeking for a new job and so on and so forth. So the idea, it's not new, is that you should move to reliability first design. You should think reliability in your service and not rely only on your server. I'm sorry, the book is named Cloud Native Patterns but we could have named it Distributed Native Patterns and it's the same example with the region in, this time it's Amazon in the US and we see that, so the author of the book, Today's Three Services, Netflix, IMDB and Nest and only Netflix took the effort to deploy in a multi region and there was only one that were still working when the US, this one region was not available. I think it's the secret source of Google when it comes to Gmail or when it comes to Google search. It works despite data center failure, despite multi region failure because they are designing their service as reliability first. So it's easy to say that we should design our services as reliability first but in fact it's hard, like many things and something which makes it hard is that when you are in the same region, latency are very low, like one or two milliseconds but when you consider multi region deployment, I have made a test so between Paris and Warsaw in Poland and we jump to 30 or 40 milliseconds. It's not a lot but when you have distributed protocols, often this latency is amplified and there is such example in yesterday's presentation too. So we know that it's hard but it's even harder in the context of the email systems and the Apache Jambs documentation summarizes it very well. So the hard problem is, yes well done, the hard problem is about the monotony QID generation. If you are at the beginning of the dev room, UID in emails have been explained and so they say you have basically two solutions. Either you choose to doing weak consistency and so you risk data loss or you choose strong consistency and strong consistency is very sensitive to latency so it will be very slow. So currently the answer of the Apache Jambs developer is you should not deploy Apache Jambs or the cat-sandwrap part in a multiple data center setup. You should pay for consulting. Okay. So if we make a wider review of the existing work, maybe I have missed something and let me know, but you have some leader follower design which are for example, Cyrus or Dof Cut and you have some consensus or total order based design like Stalvart, IMAB, Gmail, Apache Jambs, Wilder, and so on. So this consensus or total order is often outsourced to the database. So for example, FoundationDB, Cassandra, lightweight transactions or MongoDB. There was also a research project named Pluto and they tried to design a mailbox server on a serial DT design. So it was very, it works very well in a multi-region setup but they have an incomplete implementation because they do not support the monotonic ID. They only support sequence identifiers. So yes, it's interesting if we don't implement the whole IMAB protocol, we can do multi-region way more easily. So our solution was, we wanted to implement the full IMAB protocol and so it's a trade-off. It's not a magical solution but we decided to live with conflicts. So in fact, in IMAB you can have conflicts as long as you detect it and you change a value that is named the UID validity. So it's not free, it has a don't sign, sorry, it will trigger a full expensive resynchronization for the clients. So for example, we see two processes, so you can imagine that's two IROGAM processes and at the end for UID4, the two processes assign the same UID for different emails and when the other one learns it, there is a conflict. And so in our implementation, assigning a UID is a log and we have an event log that is not totally ordered but only causally ordered and we have a proven algorithm to solve conflict and compute a new UID validity. Also there is a proof in our documentation, if you want to read it or to review it, we are interested and we try to be as clever as possible when we synchronize this event log to reduce the conflict window. And so you might say we are cheating because we are changing the problem. We don't try to have monotonic UID but we try this time to handle correctly conflicts. And yes, it's true but I have two arguments. Often people are tweaking raft and they are doing bad things. And I have two examples when in Kubernetes, an issue that has been opened like six years ago and it's still open because they are violating some invariance due to a caching of raft for performance reasons and another one is the post-mortem of GitHub where they also use raft which is a strong consistent algorithm. And they say that and they show that they have done some optimizations that break some invariance of the protocol. And you can reduce the risk of conflicts as much as you can. So the most important was to have the correct solutions. And so if you want you can put a multiplexer in front of a irogram and redirect all the same user to the same server and so you will reduce even more the risk of having a conflict. So talk is cheap, show me the mail server. So I will be quick on this part but I've tried the deployment in France, in Netherlands, in Poland. And so you have some screen shot and you can check the IP address. There are some IMAG server listening on. And on each region, this is the deployment. This is connected to post-fix through the LMTP protocol. We have implemented in irogram. And irogram is a state-less software and so all the data managed by Garage which is in fact doing the magic behind the scene with is a geo distributed design. Yes. And I have a demo. So I will try to show you. So I'm just using something like NetCAD to connect and show you that there is an irogram server listening behind the domain name. And after that, I have configured this iMAG server on my phone. And you can see that I have a mailbox. And now there is a Gmail. It's the Gmail web UI and I will send an email to this server, to this multi-region server. And so the email is sent and now we will wait until it's received both on the phone and the computer behind. And that's it. So that's the conclusion. So we started with three ideas. And so this is the answer. So irogram is designed from the ground up for reliability. So it was the most important thing to us. And we decided to tolerate UID conflicts instead of trying to be to enforce monotonic UIDs. And so we tried to handle them correctly and minimize them. And finally, we want to prove that irogram already works in real environments. But irogram is still a technological preview. And it's not yet deployed in production. So be very careful when using it. Don't use it for real workloads. No, I think during this year we will deploy it on infrastructure for real users. And that's one of the future work we will do as much user testing as we can because we don't want to lose important information for people. And we also plan to implement KALDAV and CARDAV. And maybe in the end, envision irogram as a group. And something that's so important is performance measurements and improvements. And I can say that many design choices we have done will result in the fact that irogram might use a bit more CPU or memory than your regular email server. And you have to take also into account this fact. So thanks for listening and I cannot start questions if you want. Thank you very much. I see one question over there. Gentlemen in red. So first, thank you very much for this design. I've been working on distributed email for quite a bit. And UID generation is part of the story. What is your approach with keeping the IMAP session synchronized, especially the modification sequence to UID mapping, IMAP ideally, and other things like that with such design? OK. So we are handling the synchronization, the rest of the synchronization in the IMAP protocol. So we have a view and ways that we are maintaining. And so as I've said, we have an event log. And each irogram server sessions are watching the event log that is stored in garage. And so when there is a change, we compute the difference. All right. Further questions? Last call? OK. So. Ah, there's one. Can you say it just again shortly? What is garage exactly in a few words? Can you say a bit about this? So we say that garage is a distributed data store. So there is one API that is S3, which we call often object storage. So it's like a file system, but with way, way, way less features. So it makes possible to have efficient distributed deployments. And garage is inspired by a research paper entitled Dynamo by Amazon. And it's a design of a key value store. And garage has a second API, which is named K2V, which is very similar to also Raya KV. If you know Bashow, it was a company and they don't exist anymore. So garage is really about replicating your data and making them available. And you have this API about object storage, but we have this key value API. And so it's really the foundation of your data layer. And that's a new way, I think. And that's what we wanted to prove with Irogan. But we can design applications a bit different and use not only for a garage for binary blobs, but also for that lightweight database used. So I think I understood from the website that you also encrypt data addressed. But you haven't mentioned that at all. Yes. You're doing it, right? Yes, we are doing it. It's in the code and it's a choice. Maybe we are keeping it for the next year, probably. But sure, yes, in garage, all data is encrypted with a key that is derived from your password. And so when the data that are stored in garage are always encrypted. And the data is in plain text only in the Irogan process memory. But it's not really ready. We have to find as many things, but we have many ideas about that. All right, thank you very much again. And I think we will head over to already mentioned Apache James.
[Servers] Apache James: Modular email server
I am working with Apache James. Basically, first, a few words I'm working at Lina Gora. Our mission is to promote data sovereignty and especially give the tools for organization to communicate together without using big gaps. So we are working on a suit called twig workplace with twig mail for e-mail, twig that is relying on matrix for the chat and also file sharing. So as part of this development effort, we were looking back in the days for an e-mail server that is easy to scale, at the time we were looking for a file sharing. So we are looking for a file sharing. So we are looking for a file sharing. So we are looking for an e-mail server that is easy to scale. At the time we did not hear yet the talk about aerogram. We were looking for a modern e-mail protocol. Hopefully we already heard about Ricardo's stuff called J-Map protocol. And we also needed to be able to do deep integrations inside the mail server. So we started with the protocol. I am sorry, I am a bit frustrated. I did not get to speak about J-Map so we will take one minute to do so. We started implementing J-Map into Apache James back in 2015. Before even the normalization effort started within the IETF. We are big fans of I-Map. We implemented twig mail client in Flutter. So Odriga is using it. The Dart dependency to write J-Map CLI for instance. Basically we are able to take a mobile team that is not an expert at all about e-mail and get them to implement a mail client. The things work fine, works fast. Synchronization is easy. Most of the pains of I-Map are lifted. So twig mail works on multiple platforms, iOS, Android, Web. And it is also used on top of other mail servers like StoreWart Labs. So about the mail server itself, because it is a track about mail servers, Apache James is part of the Apache foundation. So it is a track about mail servers. To my knowledge, it is the only e-mail server that is part of the foundation and has an open governance model. It started back in 2003 from Project Jakarta. So it is kind of a cousin of Tomcat and projects like that. It is surprisingly influential in the Java world. The mail that I will present later is kind of the servlet of mail. So a generic way to write e-mails. Some of the important people within the Apache Software Foundation did actually contribute at some point to Apache James. And Neti Network Library, which is very influential in Java. Norman Mauer is a previous contributor of Apache James. So regarding the overall setup, what I recommend actually to use is the distributed setup for Apache James, where basically we host metadata in Cassandra. Big binaries into S3, distributed search with open search. There was a little licensing problem with Elasticsearch. And last but not least, RabbitMQ for messaging, things like IMAAP, IDOL and stuff like that. Of course, we orchestrate everything and run it on top of Kubernetes and are integrated with metric systems like Grafana. So now let's look inside the code. This is more or less the classical e-mail server architecture. You've got protocols on the left, SMTP, IMAAP, which would call the mailbox where the mails are being stored. And you will submit emails to a mail queue and apply mail processing. So what's important here to notice is that you've got green dots. It's not updated the slides, but now you've got a green dot here that allows you to depend on simple interfaces in Java. Write Java code in a completely separated project, compile it, and embed it into Apache James. Configure it. You have a set of extensions that already exist. You can use James APIs. You can inject your own component. And then basically have your code run inside the mail server without touching the mail server. And then you can run it on the mail server. And then you can run it on the mail server. And then you can run it by switching a single line within that e-mail server. So sorry that might be complicated to see from the back of the room. I did not thought about that when I copy and pasted those rectangles. But basically the mailet container, you take things from within the mail queue. And the overall design is to have mailets, which is an action, applied conditionally by a matcher. So you have two little interfaces that you work with. The matcher represents a condition. And you would organize a pair of mailets, a matcher, inside a processor, which is a stream of execution. You have a specific mailet that allows to switch a processor. And a couple of various basic implementations. All of that is defined in XML and is fully customizable. I will give you a little example. So a hello world mailet that is kind enough to look up for the language and print hello world based on that. So a mailet would get the mail and applies an action to it. You can modify the mail. You can trigger some external APIs and so on and so on. All I need is to depend on the mailets API. From there, I compile my project, I get a job, and I just register it somewhere into my XML configuration, put the job into external jobs and go. So back into there, I can just go back to the mailets. I can just go back to the mailets. So it's actually quite powerful and you can connect the different sets of extensions together. We've been speaking a bit with Daniel about push. We received a contribution lately to have an IMAP extension to for push for iOS application. And basically you are able to plug a mailbox listener that listens to the mailbox events, register an IMAP app, and you can get an extension that creates the registrations and you would be able to get the push working like that. So that's quite powerful, but James is written in Java. You have interfaces for everywhere. Everything has an interface and we rely on inversion of control with a library called JUSON, which means that basically you can assemble your JUSON modules the way you want. And of course you can reuse existing modules, which means that you can make your own tailor-made server with Apache James. As an example, so because we need to follow the Apache way, we need to be in open governance. At Lina Gora, we decided to clearly split the project, which is Apache James. That's where open standards go. That's where the distributed mailbox is. That's where everything related to modularity, extensibility is. And we reuse that as a framework to bundle our own twig mail servers that have a couple more extensions, things like autocomplete for email addresses and stuff like that that are not part of the JMAP standard. So that we reuse to actually build the JMAP standard and build our product. This is a very nice contribution that we did get back in 2020. So this is to give you an idea of how you could use James. The idea is to validate GPG key. So basically, using the Web Key protocol, I would submit my key to that modified Apache James that will send me an email encrypted with the private key that I've just been uploading. I would reply to that email, which would validate the key and serve it there. So it's proof of concept. It had not been merged part of James, but it's to show you that you can really play and do interesting things with deep integrations. Who is doing pop free? It's the guy in the room doing pop free. Pop free is an awesome protocol because you don't have a UID and it's really, really, really simple. So in France, when you go and see a practitioner, you would get a repayment order that would be sent to the National Healthcare Insurance that of course transits by email. And every insurance would get a mailbox receiving millions of dollars. So you would have emails a day. And of course, you need to have a damn thing, geo-replicated on three different locations and so on and so on. So I map the latency, it would go crazy. At least we don't use aerogram. Volumetry is big. And of course, they have a very crappy description of homegrown custom formats that you need to, that slide, don't do justice. It's actually a couple thousands of lines of code to get all of that fitting in Apache James. The point here, when I arrived on the project, they were actually able to write tons of mail-in matchers, listeners and so on and so on themselves and plug it in together. So we were also able to rewrite the storage engine. And we also had a lot of different design to be able to leave some where Cassandra restrictions on dumpstones and listing millions of emails. Another project that we did was actually to also integrate within MSS Santé. So that's the mailing system for French health practitioners. It has some specific security restrictions that are related to it. So we were able to have also some specific integrations for that customer, like upload directly, attachments received into their drive. So basically, we quite a bunch of extensions and modularity going on in there. And surprisingly, even things like banking applications, that's also email. And it's very specific. They have millions of users with very, very, very tiny mailboxes and it needs to be cheap. And they have custom SOAP APIs to access the messages. That's also the kind of other things that you can do with Apache James. So I did not cover much of the technical details. I did do a hands-on session back in the day in 2019 in the Apache conference in Berlin. So if you are interested in getting more information on the code and watching some hopefully live coding that did not go too wrong, you can do it. The talk is online. Thank you very much. Do you have some questions? Thank you very much. Okay. Let's see your first hand. Thank you. So are there any pre-existing modules for spam filtering directly with Apache James? So you need to speak louder because I did not understood the middle of the question. Are there any existing modules for spam filtering so that you can use the same language as you did with Apache James? So basically we are integrated with spam assassin and air spam D, especially with spam assassin because we are an air spam D because we have mailbox listeners. We are able to live train based on the way you move messages, your spam filters. So I think that's a good point. So I think that's a good point. So I think that's a good point. So I think that's a good point. So I think that's a good point. So I think that's a good point. So I think that's a good point. So my answer is yes, there's already some integrations. All right. Further questions? So here's somebody. Yeah, I have a question. You were talking about these examples from the health system and from the banking. And I'm not sure if I understand it correctly. It looked to me like this is very email as sort of an API in a certain way, right? For very specific procedures and processes. And if that's somehow right, you might fix me anyway. Do you also do special processing of these emails? I mean, is there any special mind parsing involved or maybe you can say a few words? So first your understanding is correct. Apache James is very modular and of course it works as a regular email server, but you can use it for all various corner cases that could be hard to handle with other technologies. Regarding mind parsing, I'm also the maintainer of the Apache MIME 4G parsing library that of course you can do some pretty complicated mind parsing within Apache James. Does it play a role in these use cases, in this medical or banking one? Yes. All right, let's see two more hands. Maybe first the other guy and you. Yes, related to the previous question. Are the emails handled by the healthcare encrypted or? So they are encrypted and it is transparent mostly transparent to the work that we are doing with Apache James for them. Okay, so is this like transport encrypted or pay-due encrypted? Depends, but there's a lot of things going on with S-MIME. Oh, okay, thanks. Have you seen any maillets be created in programming languages like Scala, Groovy, Closure, those ones based on Java? So yes, yes, we have a couple of example of Scala mailets. We use Scala at some parts within Apache James. For example, the J-Map stack is completely written in Scala, so yes. All right, we would still have time for a quick question if there is any. One here. Oh, sorry I didn't. Ah, sorry. Yes, okay. Misunderstanding of mine. You mentioned POP3, it's very nice, but I suppose you have IMAP as well. Is it ready for standard IMAP usage or do I have to? Sorry, it's a misunderstanding. POP3 is a horrible protocol, but it's that one given use case of needing highly available protocol that can be multi-data centered. It's so simple that it fits the bills. Okay, and IMAP is a separate? We support IMAP, the big range of IMAP extensions. IMAP is fully supported and we also implement J-Map as a protocol, so very wide range of protocols implemented. Okay, fine. Thank you, and thank you again, also Benoit. I hope I didn't see anything. Thank you. Thank you. Thank you. Thank you. And yeah, we have one more talk into service session, which will be Mikkel about MOX.
[Servers] Mox: a modern full-featured mail server
So good afternoon. My name is Michiel Lekin. I'm a freelance software developer from the Netherlands. Last year here at FOSDEM, I first announced Mox, a modern, secure, all-in-one e-mail server. As you may know, running your own mail server has a bit of a reputation for being hard to do, but what have I told you? Running a modern mail server can be easy. Alright. So, thank you. So the goal of Mox is to make it really easy to run your own mail server so that you actually do it, and then you can stay in control of your data, and you can help keep e-mail decentralized. Now, Mox is a new implementation, entirely new, and written in Go. Now, that's a lot of work, and you might ask, why would you do that? Because we have so many open source components that you can just use, and that's true. And for the past decade, I've used many of those components to good use. But a few years ago, I had to reinstall my machine, so I got a completely new one, and I just felt a bit reluctant to install that same software again that I've been using for the past decade. And for two reasons, at least. One is to see. The language with small mistakes have big consequences. Don't get me wrong, I like C as well, maybe in the past. And the software written in C is very high quality, but I wanted a new machine that lasted for another decade, and I see that C is not really going to be part of that too much at some point. But the bigger problem is basically the complexity. Over time, as e-mail has grown, new protocols added, new extensions, new software components have been added as well. So, to make our fully modern e-mail system, you need many components and make them all work together. I think many self-hostings, at least, they stop halfway, so they have a semi-modern e-mail system set up. You can make it easier to get all this configured and set up with a distribution or a Docker image or something, but you still have all these components working together. There's many integration points, a bit of friction, some data loss. Sometimes there's security issues when, you know, message headers are seen as authoritative but added by some component. So, I think what happened is with all this complexity, some people, you know, just stopped running their own mail servers anymore because it just too much work, and they migrated to the cloud, centralizing e-mail, that's not a great development. So, what we need is an easy to use mail server. So, you need to write a set of features. So, Mox tries to deliver many features. I'm at four for reading your e-mail, S&TP for sending receiving e-mail, SPF, DKIM, DMARC for message authentication, because just S&TP is not enough, but that's also not enough. You need TLS, of course, for encrypting your communications, but S&TP for sending receiving e-mail is unverified TLS. So, you want MTA, STS, and ordain to, you know, check that you're talking to the right person, right machine. So, Mox implements both for incoming and outgoing e-mail. Then there's ECME for, you know, your management of TLS certificates. You want to make it easy, don't do any manual TLS tumbling. Junk filtering is part of Mox. So, based on historic messages and their non-junk classifications, Mox will reject, accept, incoming mail, more about that in a moment. Then internationalization, so you can have Unicode in your e-mail addresses and your headers, both in your domains with IDN and your local parts. Autoconfiguration is in their various flavors, all supported by Mox to make it easy for mail clients to find and write service settings for new accounts. Then we've got a webmail included in Mox. We'll have a quick look at that in a moment as well. An admin web interface, so all configuration is in files. We want the full power. You can use the admin interface to quickly navigate and make some changes like add, remove, an e-mail address, an account, or a domain. A web service included, it may sound a bit crazy over the top, but modern e-mail basically requires HTTP stack with MTA, STS, Autoconfig, Jmap soon. Now it's already part of the deal. What I've noticed is people trying to run Mox and a web server on the same machine. That's really annoying because configuration gets complicated. Instead, I just added some web server to Mox for some static file serving and reverse proxying. That problem is also solved. Permit these metrics, structured logging, operations become a bit easier. Then the Mox Quickstart. That makes all this stuff easy to do. Installing Mox, you take a new machine, you've got a domain, you run the Quickstart and you pass it an e-mail address at your new domain. Mox will generate, so the Quickstart will generate configuration file, decon keys, etc. Create a new account. We'll print all the DNS records that you'll copy, paste into your zone file, or you have to manually edit them in your web interface of the DNS operator. That's not so great. Then Mox also, the Quickstart, also Linux, generates a system-d unit service file, so you just enable that and start it. Then you've got a fully working modern e-mail system. All of this is MIT licensed, so you can do whatever you want basically. Then as developers, a little bit about code. As I said, it's a new codebase. It's a modern codebase coherent. All of this is in the same style. It's very self-contained, so few dependencies. That makes it possible to have it in the same style. It's about 73,000 lines of go and 21,000 lines of tests, mostly unit tests and a bit of integration tests and some fuzzing tests. There's 11,000 lines of type scripts, very strict type scripts for webmail and interfaces. The code is cross-referenced with RFCs to make it very, not easy, but to make it something more maintainable. You can look back and see why you did certain things. Of course, Mox is written in Go, so it brings a whole bunch of advantages like memory safety, standalone binaries, completely subtly linked, also includes a few assets, so it's really just one file that you need. Fast compilation time, great for developers. Dependency management is pretty much solved in Go. You get reproducible builds out of the box, and it also works with cross-compilation, which is trivial to use in Go. Now, there's not much to see about a server, but we have a webmail that I can show you. It's not pretty, but it looks mostly like a standard email client, I think. Mailboxes, message list, message view. Let's open up mailing list. There's some threading in there. You can select multiple. I'm using keyboard shortcuts as well. Mark some unread messages and mark them read. Then there's HTML support with or without external resources and tracking pixels. Then there's a little example of Unicode addresses. The search is easy to use. We've got some quick filters on that side. We could send a message, but I'm sending a message from another mail client that should be arriving. There it is. Select some text to quote as civilized people and send a response. That's a webmail. It's not pretty, but it mostly works for my needs of email, sending and reading. Then I would like to say many things about lots of features, but I just limit to one thing, spam filtering in Mox. Analysis for incoming messages is based on historic messages in an account, based on their junk and non-junk flags. It's always per account. Whatever one account does is not related to what happens for the incoming message for your own account. Of course, this means in order for this to work, you need to have the proper flags on all the messages or as many messages as possible. Email clients don't always help with this, but Mox does help with that because in the default setup, you get an account where incoming messages or messages moved to the junk mailbox gets the message flag. If you move something to a archive mailbox, it automatically gets the non-junk flag. Also, if you move it to the trash mailbox. Also, if you're in the webmail and you have a message open for five seconds, that's long enough for it not to be junk, probably, so it also gets the non-junk flag. That means most of the messages in the store will have these flags set properly. There's a difference in how Mox handles known senders versus first-time senders. For the known senders, they're recognized from sender address or just the domain of the sender address. Maybe it's another person at the same company. Or we look at SPF or Decombe signals in a mail message or we look at the IP address of the remote server or various subnets of the IP address. If there are recent historic messages from that same sender, we look at the junk and non-junk classifications of those messages. If the recent ones were junk, we reject the message and otherwise we accept the message. But if it's a first-time sender, we don't know enough about that sender. We do, of course, something else. So, the BEG analysis is also part of Mox. It's essentially a reputation of words. So, you look at the words in the message, then you look at historic messages and their words and their junk and non-junk classifications. If there are too many spammy words in the message, you reject. If there are enough hammy words in a mail message, you accept. Then you can also configure a DNS blog list in Mox, but it's off by default for a few reasons. One, often these DNS blog lists are centralized services. We don't want to rely so much on them. And you would be sending the remote IPs of those you communicate with to some central party, which is also not great. Then we don't want to break existing email flows. So, this is also one of the reasons why it's only on first-time senders. So, if you've been communicating with someone for a long time and suddenly someone puts their mail server on a blog list, you can keep communicating with them at one break, only if that person really started spamming you all of a sudden, then you mark a few messages as junk and then in the future, the mail filter will just adjust. Now, in Mox, being all in one mail server really helps with this because during the S&TP transaction, all this historic data, these messages and flags and words are available for analysis. Then a special handling for messages from mailing lists and forwards because essentially most of the analysis disabled and DMARC policies are not enforced. Now, what do you do with an incoming junk message once it's classified? Well, one does not simply deliver to the spam mailbox. That's not friendly for users and not for senders or recipients and senders, I think, because the sender thinks that the message has been seen and doesn't get a reply. The recipient may be receiving some message and doesn't get it, so they wait or they constantly check both the inbox and the spam box. I think it erodes trust in email. So, I understand that it's done to not give spammer's feedback about their spam runs, but users should come first. So, instead what Mox does, Mox rejects the message at the S&TP level while it's coming in with a temporary error code and a very generic message. So, the generic message means that the spammer doesn't know for sure why it's being rejected and a temporary error code means or causes the sending server to try again a few times and at some point tell the original sender, you know, this message cannot be delivered and then they know they can find another way to communicate. So, you don't have this problem anymore of lost messages in the spam box. But just like with the spam mailbox, Mox has kind of the same thing but different. It's the rejects mailbox. So, anything that's rejected is still stored in this special mailbox. It's a fixed size mailbox. Old messages automatically removed and so you're waiting for some kind of transactional email. Maybe you did a sign up to a website. It's not coming in. Then you can check the rejects mailbox if the message, because maybe the sending website used the infrastructure with a bad reputation. Then I can just move that message from the rejects mailbox to the inbox, Margaret is non junk and the next time because of the historic based filtering, next time those messages from that sender will be accepted. So, the important point is that you don't have to keep checking the rejects mailbox because the sender knows you didn't get it and that's different from the spam mailbox. So, this seems to be a graph for me but if you have ideas on how to improve on this, let me know. Then a bit about the roadmap. There's still a lot to do in Mox. So, I want to implement a simple HTTP based API for sending some messages and also receiving some feedback. Just so web apps for example can just with a simple call make some sense of emails. If you know of any standardized ways of doing this, let me know. Yeah, okay. But I said simply, really the dumbest thing. But I guess maybe it can be that simple. Then I want to add calendaring. It's not email but users and myself included expect it to come with email. I need some more SNTP and IMAP extensions. A JMAP will be coming at some point. So, so far I focused on IMAP because all my meal clients were using IMAP and I wanted to have a working meal system. But, you know, I, JMAP will be, will be coming. I want to encrypt all data at rest. It's not currently done. I want to be able to have a second Mox as a backup, Mx and a backup instance. In order to do junk filtering on the second instance, I will need all the data as well or the historic messages. So I want to synchronize everything to the other one. And then I, you know, the, the, once all the data is there, you can also use it as a, as a, as a failover machine. So that will, that will be nice. Forwarding to external addresses not yet done because it's a complicated, gets complicated quickly. I think modern email is not really set up for that anymore. So, Mox is a different way of applying rules to incoming messages. Then lots more on the list. Too much for today. So, final slide. So it's been a year since I first put out the Mox code. I've gotten quite a lot of feedback. So thanks everyone who sent in bug reports and made feature requests or sent in patches. Very helpful. Then also thanks to NLMet. They've been funding continued developer the Mox since August last year. So that's been instrumental to keeping working and being able to keep working on this. Also thanks to everyone who wrote all those RFCs about email. They're very, they're excellent and they match practice quite often. So my call to action today. If you're not doing so already start running your own mail server, you know, staying in control of your data and keep email decentralized. And you have many options ready. And now there's just another one called Mox. So give it a try. Send me an email. It's a great way to communicate. Thank you. Thanks. Oh, I saw you first. You only have three minutes. First of all, I think it's a quite incredible project for one person. And I was wondering how many third party libraries do you use and how much the code you write directly to implement all this? Yes. So I think the main external library is called Beebolt, which is a fork of the database layer. So the messages are stored in files. The database layer is a database thing and it's pretty much a key value store. But anyway, that's the main external dependencies. There's something for Prometheus. And then there are a few dependencies that I wrote myself. So those are not really all that external. And otherwise it's mostly the Go standard library and the extended Go standard library. So very few external things. So yeah, it feels a bit like a not invented here syndrome. So I want to rewrite everything. But it has been very instrumental because sometimes I've made sweeping changes and there's no one. I don't have to make pull requests, try to convince people to do something that suits my needs so I can do whatever I want. So it's really sped up development, I think. Fantastic project. I have a quick question regarding database. I don't know if it was answered right away because I heard the database was whether the data is a sort of a database if you like or could be changed or whether and whether whatever it is, could we use Unix normal tools to just go through them? No. No, you cannot use normal Unix tools. What I don't really want is say having a meal day that someone else also makes changes to because I have to do lots of work to make sure that I synchronize or chase as well. So I've chosen a simple approach. So messages are just stored in a file system individually at the moment and there's one database per account that has the index for all the messages in that account and that stores also the message flags, etc. So the database is also essential basically for all the history and all the data. I could talk for a long time about the database library but. Okay, quick one. What is your experience with scaling this up? How many does a MOX thing? I've not tried. It caught me because of the Bayesian per user and we tried that and that didn't work out. So I have no idea where the limitations are. So I would like to try to see where it breaks and I don't know at the moment. So I've only run it small scale and really targeting it small like self hosting a bit and not the tens of thousands of users or something. So I see many hands which is great. We have a little more time since we have the switch of succession. Anyway, when people leave in the meantime maybe be silent so we can use the time for a few more questions. Let's try how many we can get. I didn't see the order so forgive me. Thank you. Do you have any plans for LMTP support? No. But why would you use it? Why would you need it? I'm writing a small... Oh, you need the microphone. You need the microphone. Sorry. I'm writing a small like mandrel clone and now that they shut down and for that I need to be able to put an email message into the server. Yeah, okay. So maybe a better solution would be to put it in the go code and make like a fork or something. From what I've seen LMTP is almost like SMTP but just it has this improvement of getting reply codes per... It's just simpler. It's lightweight. It's just a dumb-down version for mail drops. Did I get that right? You reject mails but still deliver them in the rejects mailbox. Yes. Whoa. Wow. Scary. Yes. About the reject. So I think that's basically like grey filtering if you are... It's basically like grey filtering except that you will continue to reject them or do you do anything special if they come back or if they go... Yeah, so if they come back with the same message, I deduplicate it based on the message ID or the hash of the entire message if there's no message ID. But then we'll still be considered rejected. Yeah, it will still be rejected. Yeah, yeah. And does this interact with the junk no junk flag for Thunderbird and other IMAP clients? Well, so I think there's the flag dollar junk and dollar non junk and as far as I see Thunderbird sets it without the dollar. So it's not useful. But it does also interfere, I guess, because Thunderbird does not a client side and it would work but it's kind of duplicate then. So I disabled it. I disabled the automatic classification on my Thunderbird setup and I just let the server basically do it myself. So I now don't get a lot of junk. The filtering is okay. I still get a few pericas sometimes one a day and I just junk it and then it's okay. Okay. Thank you for your questions. There's still a matrix chat. Mikaela will be around. Thank you Mikaela. Thank you.
[Clients] Taking care of Roundcube Webmail - current status and future prospects
Welcome, my name is Anna and I've been with Next Cloud since 2020. Anna Focuss primarily in the backend development and I am responsible for the round-cube maintenance at the moment. I'm also at the security team with Next Cloud, so a bit about that later. So first things first, this is the question we have got most in all of the help forum, blog posts and everywhere. No we won't merge round-cube and Next Cloud Mail. Both products will stay independent as they have been and they will receive independent development and independent loving care. So don't worry about that one. Yeah, let's get into the development aspect of things. So we have hired a specific engineer for round-cube, so that's a person that will be responsible for the maintenance, for issues on GitHub, for contributions on GitHub. The thing is it's not gotten that much love the project itself. There's like 50 open PRs and like 300 issues at the moment which haven't been, I'm not saying not triage, but it's hard for the community contributors to look into everything, of course. I mean it's not their main job and we appreciate what they do, so that's what we want to take care of. What we also want to do is we want to do regular security and bug fix releases. This is really, really the main focus at the moment to get us up to date on security stuff, to get us updates, up to date on bug fix releases. There is one person who has been doing a lot of development for round-cube, that is Alexander. He has been doing most of the feature development for round-cube at the moment, but he is not working for round-cube, he's working for somebody else. So we want to help him get new features development, feature development done and do like feature releases and tandem with him. We really want to make sure that we're not edging out any contributors. We really, really, really appreciate what they're doing for the project. So please don't worry if you're a contributor or somebody who wants to contribute, we really, really would love for you to put more energy into this project. I know a lot of you love round-cube and have been using round-cube, so let us know what you think, let us know what features you want to see on GitHub and we promise we will take care of them and look into them and actually give you a response on GitHub as well and not just leave it there out in the open. Yeah, as I said, community-driven development is always appreciated. As with every open-source project, I'm sure you all have the same kind of thing there. Yeah, more care, more feature, more love for round-cube because it is an amazing product, it is really cool and I mean, it's been around forever, so let's keep it going. Another thing that's changed is how we handle security issues. Since I'm part of the security team with NextCloud, we already have an existing process for this, so we're using Hacker One. We haven't discussed yet if we're going to pay Poundty for this, but it is a possibility that we will actually pay you to pen-test round-cube and the advisories will be published on GitHub in the future because right now there is no established mechanism for this, so you don't really find security issues all in one place with CVs and everything. Yeah, that's pretty much everything from me for round-cube. I still have two minutes to go, that is a very short presentation, so yeah, let me tell you a little bit about how it feels to take over this project. Actually, it's really scary because I know a lot of people love the project and don't want to see it in a draw somewhere, not maintained and everything. As a developer, it's also a challenge to get into a new code base obviously because we have different coding standards at NextCloud than round-cube has. There's different expectations from how the community works with us or we work with the community. Of course, there's implementing the email standard, which is not easy, as everyone knows. There's IMAP, which is an old protocol and it has its challenges, but it also has its cool stuff. It's a challenge, it's an exciting time, it's a scary time and I'm really looking forward to working more with the project. Yeah, let me know your questions. That's basically it. That's me done. I see you have something. I cannot decide who was first, so I start here because you're just closing, so. I noticed as the developer of Snappy Mail that more people are integrating Snappy Mail in NextCloud because of the slowness of the NextCloud Mail app. They also want round-cube. Yes, there is a round-cube app. Will there be better integration with NextCloud? We haven't discussed it yet. I really can't tell you that it is for project management to decide. For me, I personally have worked on the mail app and I am partial to the mail app because it has seen a lot of blood and tears as well from the developers. But yeah, there is a round-cube app for NextCloud and with the code base in round-cube improves, then the probability that the round-cube app for NextCloud is going to be better is very likely. Does that answer your question? Okay. Okay, so this question we solved here. Is there any more questions? Yes? Sorry, I think that. Have you already gathered some experience with Hacker One and how is it like? Yes, I've worked with Hacker One for two years now. We handle all internal or like NextCloud security issues via Hacker One. It has produced some good results. Obviously, it's not always easy because it duplicates and stuff like that. But for how the reports are structured, how you can evaluate a security issue, it is actually pretty decent. Yeah. And it offers an integration with GitHub, so it's not that much work to copy it over to GitHub and then publish it. Yeah. Is there any interest by commercial ISPs in supporting round-cube? To use it as a webmail app for their own purposes? As far as I know, Alexander actually works for an ISP. So I think they might be paying him for that, but you would have to ask him yourself if that is true or not. I know that a lot of ISPs have forked round-cube and have their own kind of version of round-cube that they maintain. There is, Hans, if you mentioned this project from the French government that has their own kind of round-cube implementation. So I'm sure there is interest because it is a powerful tool and it works really well. People like it. It's easy to install. I've tried it myself. It was very nice, very easy to do when Docker wasn't doing its thing. Yeah. Yeah. So I hope there will be interest and I hope there will be interest in the community as well to get the product back and get it a bit more popular again. Yeah. That is my goal for this list. Yeah. Any more questions? As a wise, one thing that came to my mind when thinking about, we have seen 4K9, there is a list of actual features somehow blocking the big renaming. So I wonder, is there anything you discovered on the roadmap or on the bucket list in round-cube, which particularly you would like to address in next time? Not yet, no. We haven't done any sort of project management evaluation yet because we didn't have the developer for it. Now that we have hired a person for this, I'm hoping we can actually get some project management up on GitHub as well. So we're using the boards with NextCloud so that would be easy. We can actually sort issues into swim lanes and then work through them. Since it's only one person, progress will not be as fast, but on the other hand, I mean, it's not like nine people can carry a baby in one month. There's the owner from project management. So things will hopefully be getting done quicker than now. I also really hope to get through the backlog of PRs. We have 49 PRs open. The first one is from 2015. So there's some bug fixes in there, but I've also seen some features. I have seen some left to right and right to left text support so you can change for right to left languages. That would actually be a really nice feature because a lot of people have right to left text. If that could be merged, that would be great, but there is a problem with the CSS classes for different team implementations. So that would need to be thoroughly checked and that is probably something that the developer should and can do. It's all different themes, see how well it works. And yeah, also sync with Alex on this. She's had some input. So yeah, maybe you'll see some right to left text support soon. All right. No more questions. So yeah. Thank you. Thank you.
[Security] Thunderbird Email Security, plans and challenges.
So, welcome. My name is Kai Engert. I have been working with Mozilla and contributing to the Mozilla code since 2001, also including email. And I've been a full-time employee of Thunderbird since 2019. And today I want to talk about Thunderbird email security and some of the plans and challenges on that topic. We all know, yes, there are some creatures who could read our email. So, they sit on the service and some robots scanning some mass surveillance monsters and cybercriminals. Okay, we don't like that. So, the problem is that there is no protection while emails are stored on service. We do have some TLS transport security in the infrastructure, but it's not enforced. And it's... So I think we need more than TLS transport security. We heard about that earlier. Of course, we want and need end-to-end security for both encryption and digital signatures. So Thunderbird supports two separate technologies. There's been S-MIME. I've worked on that in 2001 before Thunderbird were born. And, yeah, we have... It's still supported. And we also have OpenPGP, which was previously supported using the Enigmail add-on, which is now fully integrated using an integrated code since 2020. I want to briefly mention some of the things we did in the recent past. We implemented unified status feedback so you get similar UI for both S-MIME and PGP emails when reading an email. When you compose an email, we also have some similar controls to enable or disable encryption. We have made it a bit easier to resolve the problem when you want to send an email, but you're missing the recipient's PGP key. So we have some interactive UI code to help you find the missing keys. We also added some reminders when you try to... When you start composing an email and Thunderbird detects that it can encrypt, then it will remind you if you want to enable it. And just most recently, in the new version from last summer, 115, we have added a long-ass feature which is you can tell Thunderbird optionally to please enable it automatically. If you see, we can encrypt just enable it. Some people have also asked to automatically disable it, but I think it's a necessity to pay attention so we have the option to have some warnings here shown to the user. So other things we did, activists asked for and people who are sharing their computers with others, they have asked that we do support some individual part phrase for the secret keys. We did that. There's some parts missing. We need to make it more convenient by adding a cache. We also implemented the auto-crypt compatible key distribution mechanism which simplifies group conversations by including all the keys of all participants of an email conversation that's called Gossip. We have that recently added. I think we will have it in the stable version soon also. And we added support of publishing keys to keysopentpgp.org. Let's look at the sum. I want to also mention a few general challenges that we've just recently seen. Since some providers now add fmime on the server-side infrastructure, we are now seeing messages coming up which mix two technologies. So people complain. They have a user has composed a pgp message to them and now the whole thing is suddenly wrapped in another fmime layer. And so that's a challenge for the user interface presentation, how you deal with that. What I have one idea is if it's just a signature layer out her most, maybe you just ignore that one, but I'm not sure it's the best solution. So we are still open for discussions if you have better ideas. So there was discussion, what should we do if a message arrives with a digital signature that we cannot completely validate as being good? What should we do? Currently, we do say, well, this has a bad signature, but some people say maybe that's not worth in a plain text email. Maybe we should just stop not showing any bad status at all and just treat it the same as a plain text email. So that's also a pending thing we should do because there was some agreement to do that in a recent pgp community meeting with other developers. And another big unresolved area is if you combine emails with digital signatures and where the content is nice and shiny with HTML and CSS, which many users want to have, the problem is that HTML can be used to manipulate what's shown on screen. So the sender of the email might have seen something different when composing than you as a reader see. So that can lead to confusion. Researchers have shown that. So what should we do about that? I don't have a good solution because nobody agreed to my suggestion to just revert to plain text whenever we have signatures. So but maybe we should show weaker signatures. I'm looking for ideas here. If you have ideas, please, please send them in. So now let's look at some more broader scale. We have the problem that only a small portion of all emails are using S My more pgp at all. They're not used much because there are barriers of entry to use them like Tobias presented. You have to get a third and it's difficult. And then when you have keys, it's complicated to manage. And using email encryption at all can have unexpected consequences. If you just set it up on one device, you have maybe a problem to access your encrypted email from a secondary device. Users can lose their secret keys. They will also lose to the archive of encrypted email. So I think it's still necessary to, we must involve the user. That means user must be willing to accept the consequences. Also user must be willing to take care of the secret key file or lose their archive. So what should we do? How could we get more people, many more people to use email encryption and signature? I think fully automatism is not possible because we have an heterogeneous ecosystem and we need the user to be involved. That means I think that we must better assist users. And that leads me to the question, which technology is easier to use? The past five years, Thunderbird Focus was open PGP in that area because it was necessary because we had to integrate it to ensure it's still usable. But now the question is, is that still a good idea to continue to focus on PGP? As we heard from Daniel, there is currently some, there are currently some disagreements. What's the future of PGP should look like? Daniel has presented a very optimistic outlook for the future. And I agree, many of the things he said would be nice and great to do. But we have the problem that there is a group of PGP in the ecosystem which is difficult to ignore. So I'm, and that's the problem because they are, Daniel suggested maybe everyone should do both. But well, that would also require that client applications support both keys from both specifications. And I see that as a big complication for users to have to manage different keys for different recipients. And I have suggested, I've tried to bring the group together with many discussions and I've suggested even introducing a common key format. But there have been no positive reactions from that. Well, from the IETF side, I usually get lots of good ideas and willingness to discuss. But both sides would have to agree and I don't see the, I don't need much openness to support these ideas from the PGP side right now. So I don't know what will the future bring, of course. No final word has been made. But at this time, to me, I have the worry that the future of PGP is a little uncertain. There are a little, there are conflicting specifications. There might be incompatible implementations. And I don't know how much hope there is for a unified specification. I still hope for it. I think it would be best and we really should see it, but it's not clear whether it will happen or not. And if that's the case, I'm worried that PGP might become less interoperable and more complicated to use in the future. And with that, is PGP the right way to go right now in West Moin PGP when we don't know what the future will bring? My suggestion is maybe we should wait a little and see how the developments in the PGP side go and whether there will be some more agreements in the future. And maybe we should, Sunderberg should wait. I think what we have right now is working. Both specifications have a common base. So PGP is working and you can interoperate right now. It's just that I'm not sure how quickly we should jump on these new ideas and implement them. Maybe it's time to wait. And I suggest Sunderberg should continue to both support both PGP and SMIME. But maybe one idea is I'm presenting that as a suggestion. I'm not saying we will do that and I'm looking for your feedback. Please provide feedback afterwards. And here's a suggestion. We could try to make SMIME easier to use for everyone. We could try to eliminate the barriers of entry that are currently there. And we could say maybe SMIME is an OK technology for users with a limited threat model. And we could say open PGP is more targeted for users with a broad threat model. And as a consequence, they will currently have to accept that there is a slightly higher complexity. Well, why is that? Well, let's look at SMIME. I think it's more widely available in email applications. And if you trust, as a user that certificate authority, do that job right, then SMIME is easier to use than PGP because you don't have to do manual checking of keys. And we don't have the transparency stuff yet that was mentioned. Maybe we can do it in the future. But right now, it's not there yet. And it might be appropriate for people with a limited threat model. It protects against passive reading, SMIME. There is a remaining risk that there are falsely issued certificates. We have seen digital notar in the past. But CAs are regulatory audited. And of course, they don't want to lose their reputation. So I think the risk of falsely issued certificates is not that big. Also, we certify transparency making it even harder. So I think that remaining risk might acceptable for many. So but we would have, in order to follow that idea, we really would have to find a way to get certificates to everyone for free. We would require, like Tobias implemented in his demo, to find a way to automatically obtain and refresh certificates from inside the email client. And then also, we would need something better for looking up the certificates of your correspondence. Maybe we could implement certificate transparency like way for SMIME certificates where we maybe even to protect against the spammers. I'm not fully up to date if the certificate, what's the PPP specifications as, if it's also redacting email addresses already. But yeah, maybe that would be necessary. And if we have some kind of cloaking with a hash, then we could maybe implement a certificate directory that is like a key server and that could consume the information from the transparency logs. And maybe we could use that to make discovery of correspondence certificates easier. So yeah, and PPP could be more dedicated or declared as the preferred technology who don't want to accept that remaining risk of false issues as my certificates. Yeah, and they could still do the manual key verification at the cost of having a little bit more complex technology. So if that idea, if we get a positive reaction to that idea, maybe we could say making PPP easier to use in Thunderbird. Maybe that could become a little less priority and rather focus for PPP on the higher, on the security improvements and interoperability parts of that. And rather focus on making SMIME easier to use and I plan to post some suggestions in the new future towards the Thunderbird email discussion this way, I present this idea in more detail and I will be looking for your feedback. Thank you. Thank you very much. I just want to add one more thing. I somehow expect that there will be many questions. So after this is finished, I will go outside and I'm waiting for your question and to have follow up discussions outside as well. Hi Kai. May I ask why are you still using R&P over the Sequoias version octopus like the crypto library? Well, your question implicates that I should prefer one side or the other. I don't prefer one side of the other. I think I don't want to give any of these conflicting specifications and advantage. In my opinion, Thunderbird should remain neutral. In my opinion, so conflicting parties should get together and find a unified specification and I would like to wait for that. And that if that switching implementation doesn't give me an advantage because I don't know what's the intention of the Sequoia. Will they fully support both specifications? I don't know. There's Nio. Okay. Rush. All right. Are you saying that if we implement V5, then you'll use Sequoia? I'm not making some promises. I'm just saying that additional other alternatives currently are not wearable and if things change, we can re-evaluate our thinking. You mentioned that S-MIME can have a lower barrier of entry than open PGP. To my understanding, the primary problem with encryption is that the user loses the key and he cannot read this email anymore. I don't see how S-MIME has any advantage over PGP in the sense because if I can just as well lose the key, the certificate authority cannot regenerate my key unless you want them to do so much rather than the key. I don't see the advantage. So I think the problem exists with both technologies. That's the same. But yeah, maybe we could introduce a key inscription key. Maybe we could introduce concepts that Thunderbird generates a key encryption key for users. They make it back up with a path race, maybe 20 words writing down which just a randomly generated symmetric key which we back up in paper form with 20, 24 words written down. And then maybe Thunderbird could encrypt all the users' private keys with that single symmetric key. A possible idea that could probably be used for both technologies. So I didn't spend it... Yeah, that's a general idea which we could do which would help both technologies. Alright, any final question in the room? Have you looked at the secure join standard and do you think it might be an option for Thunderbird users to have guaranteed internet encryption with verified fingerprints in a very user- I have not seen the project you mentioned yet so you would have to point me to it and we can have a follow-up discussion. Alright, so thank you again.
[StructuredEmail] Structured Vacation Notices and Structured Email for Roundcube
All right. So this time I have the pleasure to introduce myself. And somebody else needs to take care. I don't overuse the time. So yeah, my name is Hans Jörg. I'm from Odriga. I have two hats or two histories. So one history is in my main email history is in migration, portability. So you've seen some of our jamming work earlier that day. But actually, I have an earlier history in semantic web technology. So I was a semantic web researcher. I did some stuff on Sematic Media Wiki, if somebody or few is aware of that in the past. And this is a new project, actually, where these things tend to converge. Some people who read their email on the console typically don't like it at all. Recently, what it proposes. But yeah, I like any feedback on it. Some people might even like it because it maybe fixes something what HTML email also broke. And the whole idea is structured email. And I'll present you a reference implementation for RUNCube and a particular application, which is a structured vacation notice, which probably is compiling to email people in particular. So first of all, a claim. So email is sort of your personal API. But you're a little bit of a mechanical Turk in there. So you need to read it. You need to understand it. And you need to act upon what people ask you to do. Other services ask you to do. Second, the email is underappreciated. I think everybody here in the room would probably agree. And so one of the ideas here in order to bring these things together is to make email content, maybe not in general, but for parts of emails or certainly emails, more machine readable so that the tools you develop might help people in certain tasks to make them more efficiently or even to do novel tasks. And the very rough idea is basically like you have a multipart alternative text and text HTML in an email that you also embed structured data in RDF, which is a W3C-specified knowledge representation language according to certain so-called data models. So schema.org is a very popular data model which search engine vendors have set up for, basically, you find it in websites. Like this is movies. This is a song. This is an article. And the very idea is to also allow users or tools to include that in emails so that email clients can make sense out of what is in that email. So yeah, that sounds quite abstract. How could that look like in practice? So actually, this is not something I invented from scratch. So actually, Gmail, Yahoo, some other vendors, WebDE in Germany, are already doing it. So if you fly by Lufthansa or a certain airline and you have a Gmail account and if you opt it in, these airlines might already send that schema inside of that email. And you might notice there is a special display within Gmail that shows you a certain information on the flight, allows you some certain action, might automatically import it to your calendar or at some point also to your Google Assistant and so on. That's nice. The problem here is currently you need, this is only for select senders. You need to register with each vendor, actually, basically, to have this happen. It's only there for very few select use cases, like traveling and maybe ordering in the web. And it's unidirectional. So it's only from a service to you. You cannot by yourself use that. So it's a little bit against the idea of email, right? So I can. I mean, obviously, I would not send a flight probably to somebody, but maybe something else. So schema.org alone does have 800 concepts. And what Gmail supports is like six of them or something like that. But actually, there's already nice, very nice use cases for even this travel information. And there will be a talk just after this by Folk are sitting in the background. So I won't talk too much about that. Second example would be link sharing. So there is share by email, right? And this is how it looks like in K9 email, right? I mean, not blaming K9 for it, but you basically get a URL sent. And this is what you receive. And in this case, basically, you are stuck with Spotify. You click on it. You have said Spotify song. But K9 doesn't know this is a song. And you are with Spotify. And OK, you can listen to the song. But if you're on Apple Music, it's up to you to deal with that. And with structured email, the idea is you could take some metadata, which in the case of Spotify is actually even embedded on the Spotify link already. So nobody needs to do manual annotation. You could put that into the email instead of the link. And so your email client would not just have the link, but the email client would know this is a song, Brussels Jetem, and from 2003 by Al Jaleh. And it could even match, for instance, with your local media player if you have that as an MP3 or something like that. So you could basically give a dereference the kind of content that got shared in that sense. And you have a much better user experience, a little bit like you have an instant messaging when you send a link. Where also, like what's happened so on, they do extract the Twitter cards and kind of stuff. Another use case, maybe even more fancy, is location sharing or even live location sharing. Many instant messaging tools allow you to do so, but it's within their ecosystem. So you're bound to their implementations, their privacy rules. And it only works if you send to another fellow WhatsApp user. So it's also not really open and decentralized. So we built a prototype where you send a location based on the JSON-LD snippet. And we have a prototypical implementation where the client on the mobile can push the updates of the location to a URL with a secret UID, which the user receiving it can actually use to refresh it. So if you're receiving email client support, you could do this user experience. This is an example which we did. And of course, you can also do have some fallback. So you can get an HTML email. Of course, then it's not the live location, but you can do something like a fallback like you have in some newsletters. Click this link. Go to the browser. Even though this is, of course, not the best user experience here. And then another very familiar use case for you, vacation notices out of office messages. So it's typically something you enable for your email account while you are traveling on FOSTA. Maybe it's a weekend, not so many people will write to you. But maybe you arrive back in office on Tuesday. So you say, I'm staying in Brussels till Monday. Please contact my colleague in that meantime or so on. It's still something you need to act upon manually. But it would be interesting if your email client could actually understand this is an out of office message, a resistance at date. And probably this is the person I could redirect the mail to if I wanted to choose. And this is basically what we did. So we did an ITF draft for this to specify it a little bit, the process. And basically, you can even leverage most user interface data you have from the CIF vacation extension. This is how we implemented it in RunCube. We just take the date fields which you anyway fill in there and the reason and put this into the structured field. And if the receiving email client is capable of understanding it, it may store this information for the time which the user is away and it can highlight it. And you can even put it or choose as the user on vacation to include it in emails prior to your vacation. So you could say, even if I go to vacation tomorrow, include that metadata already in just any regular email if you want that. And so recipients can already see, ah, Michel will be in vacation starting tomorrow once he wrote me this mail now. And I might hurry up answering him or something like that. I'm not suggesting this is like it has to be, but it's just illustrating that you can even do additional things which you could not just do with regular out of office right now. And yeah, what is the current state here? So these examples I've shown you, there is currently an ITF working group that's very recently formed. Last November was the first meeting. There is a mailing list here. So even for those of you not familiar with the ITF, please join that list if you're interested in that topic. Any feedback, any questions, everything is very appreciated. There was already quite some good feedback from the community. So like Sunderbad Board Council made a decision like if there would be an RFC, they would be willing probably to implement this or to merge this into their code. First drafts already got adopted in the working group, still sometime under the form of RFC, but things will be going. We are working on a reference implementation where we graciously received money from NLNET and the NGIU program. This is published right now during FOSTA. So you can go to Packages. Not sure if it's already on Packages, latest on Monday. We will provide some guidance so that you can use our round cube implementation as a blueprint for your own vet mail probably, even some reusable code so you don't have to write everything on your own. And there is even first adopters. So for instance, the developer Fairmail, I got in touch with him and he implemented the very first beta of it like within a day, which was quite an awesome experience actually. If you hear that, I really appreciate. And that would be really great thing. So finally, maybe a little bit of an overview of how this currently works. So this is the URLs where you'll find more information. We have one library currently where we do the extraction of the structured data from incoming emails. This could be reused on the server side of your application. We have two libraries which basically are template libraries. It's a little bit user experience-ish. So we are still searching people that really are keen on CSS, HTML design stuff. So if you know somebody or so, please help us because we think it makes sense also to have at least a simple example for how to render these cards for very popular kind of information so that every client doesn't need to decide on its own how to render a Spotify song or something like that, or a music song. Even so, of course, every client could opt to do so. But we want probably to provide some examples here. And we do it both for the actual rendering, but also for this HTML email, which we want to send as a fallback for those that don't have the fancy client yet. And then, yeah, I say there's two RimeCube extension. One is for the structured email as such, where you can do the Spotify thing, for instance, or receive these kind of things. We also have working on the Next Clouds mail thingy, where you can actually interact with the Next Cloud Cookbook app, where you can import recipes that you receive by email. And there is a separate plug-in for the structured vacation notice. That's all, actually, already for the moment. Thanks for listening, and I look forward to feedback and questions. So, yeah, maybe somebody can see. Yeah. All right. Did I say a hand question? Is there concern? I mean, to me, it seems like we've had this discussion that this is just kind of in the background of a mail message. As long as it's not overwhelming data size, it doesn't really matter to people. But the question would be, is this the kind of thing where maybe you have a client that's not displaying really great, where all of a sudden you start having all these random attachments that would confuse a user? Because they can't do anything with this themselves. This is all meant to be machine readable. So there was an interaction between, but I repeat, so I understood correctly. So your question is, is this something that might confuse users if it's somehow mangled inside of you? What are the ideas around trying to prevent confusion of users if a client doesn't know how to handle it? Two things. So first of all, you can see it as a multipot alternative. So just like if the email client doesn't understand it, it just won't get rendered. And also, it's metadata. It will never be shown if the client just doesn't know about it. So you can use it with existing clients already. Actually, you receive those emails probably personally, because Lufthansa might include it already even in the mail sent to OX. You just don't do anything with it. I'll assume you're writing my emails. Sorry? I'm joking. Yeah, yeah, OK. And actually, the interesting thing is even the opposite is interesting. Because we had people coming to us that had exactly the problem where you get a PGP key or an email signature attached to an email, because actually, an email client doesn't even know what that is. And you could actually use this structured data also to provide additional information about what certain email attachments are about. So you could even help email clients to provide a better user experience in that case. What's the incentive for any provider to actually send structured emails? Because it seems that it's an activist opposite. They don't want to like Spotify, or they don't want to be able to send what song it is. They want people to go to Spotify and nowhere else. And same with the Lufthansa thing. I mean, they don't want to send. They want to publicize their brand. They want to upsell services. Yes, good question. Then they don't want to send just a generic message with no possibilities of those. So the incentive is to not use those. OK. So you say there is probably what is the incentive? There is no incentive for both Spotify and Lufthansa to send this. Point one, so for Lufthansa, Lufthansa does it, actually. You can try. I'm not sure about Lufthansa in particular, but airlines do it with Gmail. And the very reason is Gmail gives them a preferred visualization. And actually, it might even strengthen their brand appearance, because they might have a special. There is research being done that the click rate gets even higher when you have the special presentation. So that's at least one theory. I'm not saying I spread the truth here, but just giving you an idea. For Spotify, I was not claiming Spotify itself to send it. Because what I was saying is, you share it. You are in your web browser, for instance, or within Spotify. And you say share with. And you go to the email program. And Spotify does have that data on their website, in the metadata. And the incentive there is for search engine optimization. So they have it because they want to get into the Google ranking very high. And we just piggyback that data by using it in email, in that sense. But you said the share with the feature. The share with feature is controlled by the Spotify client, which is controlled by Spotify. Oh, no, no. It's not, anyway. Because it's obviously URL, because they want to have set for WhatsApp. They won't change that. But with the URL, we can actually pull the metadata from the website. Like the Google crawler does it. So you want to hijack that thing and then put it in? In a way. Which is fair. OK. Thank you.
[StructuredEmail] When is my flight? - Semantic data extraction in KMail and Nextcloud Mail
Okay. Okay. So, yeah, then we'll continue basically right where Hansjord left off. I'm Volker from KDE and I'll talk about how we do the semantic extraction in K-mail and specifically focusing on the travel use case. Many of you probably traveled here, so you might see why this could be useful. So, if you book your flight or your train or your hotel, you get the confirmation as an HTML monstrosity full of advertisements and fine print and somewhere in between is the information that you actually care about. So, you need to find that and transfer it into like your calendar or your travel app and that if you do it manually, right, there's tedious and error prone. So, why can't we have that automatically? And that's basically the point that got me into that topic. I was on the way home from a conference needed to find my departure gate and I was written in like light gray on white in that style. So, I did what you would do in that case like you read the email source code because that's easier to read and I stumbled about a nice compact summary of the trip and that was the schema.org Jason that Hansjord mentioned. So, just showing that in our email client right that should be easy. Six and a half years later, I'm not standing here and still talking about that subject, so as things usually go. So, Hansjord showed us already, right, it's the schema.org Jason that is something that I think Google proposed 10 or 15 years ago for websites and for HTML email. Meanwhile, managed by the W3C, so it's a proper open standard. As an ontology that tries to model the complexities of the real world, right, it has all the fun involved with that. But generally that is sane and something we can work with. Then, however, we got in touch with the harsh realities out there because there's not just that nice Jason format, there is also commonly used a micro data representation that basically embeds that tree of information in the HTML structure of the HTML email. Technically, that's possible and still well defined, but it then basically puts HTML parsing into your problem space with all the fun that that entails. Well, okay, so we implemented that as well. Then we discovered a third variant of encoding that information, basically syntactically invalid Jason. Comalist Jason is particularly popular, so we ended up adding workarounds for the Jason parser to deal with all of that. Then we found the actually much bigger problem and that is semantically incorrect data. I think the most extreme case was Air Berlin. They had the arrival and departure times for flights in the local time zone of the airports, as you would usually do it. But then they added the UTC offset of what is presumably their server location. So if you travel to the US, eight hour difference, you probably noticed that something is wrong. If you travel from here to Finland, a subtle one hour difference, super dangerous, you're under risk of missing your flight. Another common problem, there's an address and there's a geo-coordinate. They mismatch and not just by a few meters. We have to deal with that as well. Then of course the other big problem, this is by far not as widely used as we would wish. You find it with some airlines, some of the hotel and event booking platforms have it. It's super rare for trains. I think in Europe it's only a train line. In general, on a scale from Silicon Valley startup to 100 plus year old European national railway, it's clearly biased towards the former. It seems to be even less common in Asia than in Europe. That isn't really satisfying, but at that point we were hooked and we really wanted those features. We started to look where else we could get them from. There's actually a lot of stuff that we can extract data from in such emails. One particularly useful thing are flight and train ticket barcodes, which then moves PDF parsing and image processing in our problem space. It gets worse. That thing is an entire world on its own. I spoke a bit about that last year in the railways and open transport deff room. I tried to skip that here. Another thing commonly found on booking emails is Apple wallet parsers that zip files containing JSON. Parts of it is machine readable. Parts of it is visual representation, but at least for location and time in the barcode. That's a good starting point. Then of course there is the whole unstructured human readable part. For some of that we were able to build generic extractors. Something like an airline boarding pass. They might look very different from a visual and layout point of view, but they can all be very reliably identified using the barcode. The barcode only contains very basic information, like the day of travel, not the year or the time, and only the airport codes, but not the gate, and so on. All of that information that is really relevant for you is in that human readable text somewhere. It's possible to identify that and match it. For everything else we have provider specific extractor script. That's usually a few lines of JavaScript with regular expressions or X pass queries on the HTML. Not pretty, but it gets the job done. With all of those ways of getting data out, we still have the problem that the data quality isn't really on a level that we can work with. In particular we care about the very exact time, including the time zone. By time zone I really mean IA and A time zone ID, not UTC offset, because if you have a delay over a day-life saving time change, and yes that does happen, then you really need the exact time zone to know when your new departure time is. And the other aspect that is really important is the precise location. So as a geocoordinate, that in turn also helps with determining the time zone, but we want to have features like routing to your departure location or your hotel. And in order to improve on the input data, we use some external data sources like OpenStreetMap or VickyData to resolve airport or train station identifiers and get to the exact location. And we have a few things that apply domain knowledge. For example, if you email, we first to a flight from Brussels to Stuttgart, and mentions a flight time of about an hour. There's two airports with Brussels in the name. They are both close to, or at least both of them are in Belgium, so we know the country and time zone. There's also two airports with the name Stuttgart. One is in southern Germany, the other one is somewhere in the US. But based on the flight time, we know exactly which one of that is possible, right? And I may have uniquely identified the other airport and so on and so on. And then in the end, we have some validation and plausibility checks because they're still either incomplete or nonsense coming through, right? So if you would require time travel to make that trip, then it's likely wrong somehow. And that's then how it looks like in the integration. So we run the current email through the extractor. If it finds something, it shows a summary of that and offers you to add that to your calendar or to your travel app on the phone. This is in KML. Originally, the extractor started as a library for KML, but it's also available as a standalone command line tool by now. That's how we did the integration in NextCloud. Same thing, right? We showed a summary of what we found and you can add that to your calendar. There used to be a Thunderbird plugin, but Thunderbird changed the integration API and since then that stalled a bit. There's a lot of demand for that, so it would be nice to redirect that at some point. And then there's of course the dedicated travel app, it's a memory that we built out of all of this that Hans-Jörg had already mentioned, where you get a timeline of your trip and it then fills the gaps with local public transport and looks for the weather forecast and reminds you to bring a power plug converter if you're traveling to a country where you need that. And I mean, that is exactly the kind of high-level semantic features and workflows that we can build if we actually understand what you're dealing with in your emails or in your documents. So if you produce any kind of transactional email, you most likely have a machine-needable representation of what this is about, so please add that also to the email in some form, ideally in the format Hans-Jörg is working on, but as you have seen, we are not particularly picky in extracting, right? So anything that isn't regular expressions on human readable text would be a big help already. And then finally, I haven't mentioned that yet, all of that of course runs on your device, right? Unlike Google, Apple or TripIt, we don't read your email for this. That on the other hand means we have not as many training samples as they have, so we entirely rely on people donating us travel-related emails in some form, so that is one way to help. Yeah, and that's it. Thank you. Thank you. All right, again, we have our number one question to ask today. Do you have any statistics on signal to noise ratio? Essentially, how many times is the information wrong? Do you kind of any reviews or testing in terms of like, you say that incorrect information is better than no information, but does it ever get confusing to a user, for example? I mean, we try very, very hard to detect stuff that is not plausible or to fit out anything that we at least can detect. How much gets through that is not detectable and then confusing. I don't actually know because the samples we have work or are filtered out, but at least we don't get a lot of bug reports with I missed my flight because it showed something wrong, and usually it is individual providers and they are consistently wrong, so we can add workarounds for that to filter them out and not show anything for them, for example. But there is the risk for providers that we don't know. If they send out something that we can't detect, we might show you a wrong departure time, right? And that is a problem. But you could, I know you could not log, instead of not showing the possibly wrong information, you could not log it somewhere and then to make those statistics. I mean, log in the way that we get the information. Yeah, because it's not a website. That would go against the whole privacy idea that we are very... But if, I don't know, if user agree to send those kind of... We don't have like a data donation feature built into the app right now. That might be an interesting option. But some people send this to us then manually, basically. Yeah. I might, before I give some mic to Arndt, I might just comment on that because we talked already also at Mark to people and there is a lot of the email senders, right? So, and in general, there is some interest by them to support this in a way. So, I have a strong assumption, like, if there is such faulty data, there might be ways to incentivize at least the big senders, the big brands to do it right. So, I'm not so concerned about that. Yeah. Asking people to send bug reports is okay, but if you ever get a mail client to send something to you, to log it, you're going to get information about people's sex life. No matter what you try to get, you're going to get that. It just happens, trust me. And then you have GDPR problems because, well, you thought it was the name of an airline, but it actually was the name of a person. Yeah, I mean, that is, I mean, that's one of the motivation why we are so focused on doing this locally and with keeping control over this. Because, I mean, your personal travel is already quite sensitive. But if you combine that with everybody else, the amount of patterns you see, right, I mean, all of us travel to Brussels in the first weekend in February. If that happens once, right, that could be by chance. But if it happens in the next year as well, and after two or three times, that is not random, right? Then there is some relation between the people involved. And that allows you to do some scary network analysis. If you're looking for the structured data that's already there, it's the open travel alliance. First it was in XML horror. Now it's in JSON. So maybe that will be, can be implemented in the final structure. Open travel alliance. Yeah, I don't know that one yet. No, it's international. Everything is in there, the planes, the trains, boats. Okay. Yeah, we, from the scheme of the world stuff, we support flights, trains, buses, events, restaurant reservations, and ferries and boats. Yeah. But there's certainly more that can be done. One quick final question. I wanted to remark that the anonymization of data fields is possible without being able to trace it back to an individual human being. Because airlines are innumerable. So you can get to the proverbial shouts, whereas user names or people's names are not. And so you could hash everything into the WAHOOZA and still recognize whether or not you should have recognized the field differently than what you've actually rendered in a client in this case. Right. Yeah, but anonymization has turned out to be rather tricky on input data like PDFs, where we also rely on the proper structure. So as soon as you start to modify this, it's not sure that the extractor still detects it in the same way. And we often don't know what kind of sensitive information is even in there or what the fields in the back would mean when we start with a new format. Right. So it's very hard to predict what we need to strike out. Sure, yes. But I thought we were talking about the JSON. Once we have the JSON, sure. But the JSON alone is not really enough to fix the extractor. We need the source document in its original form without modification to see where it goes wrong in the extraction. So if there is proper JSON in the source, then yes, then the JSON is enough. But if our source is a PDF document attached to the email and the barcode in there, then I need the full thing to debug why we failed the extractor. I'm interested, but we'll take this offline, I suppose. Yeah. Yeah. Right. A short technical question is Bogo in the room. Ah, right. There he is. Great. All right. So thank you very much for that lively discussion. Thank you, Falka, for the presentation. Once applause again.
When Prometheus Met OpenTelemetry
So, hello everyone. I'm Pavel. I'm very excited to be here and I will speak about Prometheus and OpenTelemetry and especially how we can use OpenTelemetry project to scrape Prometheus metrics and what are the challenges with this setup. Quickly about myself, I'm Pavel, software engineer at Red Hat. I mainly work in the distributed tracing space. I'm contributor maintainer of the OpenTelemetry operator, Grafana tempo operator and the Yeager project. If you would like to reach out to me, you can do that on the Twitter, on the CNCF Slack. So, today I would like to do some introduction into metrics ecosystem so we better understand what are the projects we can use and then talk about the differences in Prometheus and OpenTelemetry from the data model perspective, how they do things. Then we'll talk about what Prometheus components we can find in OpenTelemetry project, both from the API SDK perspective and in the collector. The second half will be a live demo. We will deploy very simple Golang application instrumented with Prometheus client and we will gather those metrics with OpenTelemetry collector. All right, so why are we actually here? We are here because the ecosystem for collecting metrics is fragmented. There are different projects that provide different capabilities. So, there is a storage, some projects that can store metrics, some projects that can only define protocol for something metric data and some projects that can be used only as an API SDK, something that developers use. Prometheus sits in between, so it provides kind of end-to-end framework for collecting, sending, storing, visualizing and alerting on metrics. Prometheus is very well-adopted, it's very robust and people know how to use it. On the other hand, there is OpenTelemetry project, which is kind of new and for metrics, it provides kind of more limited set of capabilities compared to Prometheus. People still want to use OpenTelemetry for collecting metrics because they can use it as well for collecting other signals like traces, logs and it's better integrates with third-party vendors, your SaaS observability solutions. So the overlap, there is in the API and SDK, Prometheus has clients, OpenTelemetry has an API and SDK and then there is a protocol. Prometheus has its own metrics protocol and OpenTelemetry has OTLP protocol. On top of that, in OpenTelemetry there is collector, which competes with Prometheus agent. Agent doesn't store metrics, it can just scrape them and send them to Prometheus via OTLP, not OTLP, but Prometheus remote write. What I would like to highlight is that OpenTelemetry as well has the auto-instrumentation libraries, which are not present in Prometheus. I think it's a great innovation in open source because those libraries, as we saw in the previous talk, they help you to very quickly instrument your application without any code changes and a recompilation. So I think it lowers the bar of adoption of telemetry in your organization. So that's the ecosystem. Then we should think about how we can use these systems together because we want to combine feature set that they offer to us. So let's take a look before we go into the demo, what are the differences in Prometheus and OpenTelemetry. First of all, the most obvious one is how the protocol works. The Prometheus will pull the metrics from your process and OpenTelemetry, you have to push the metrics into the collector. It's not big of deal. Some protocol might be better for some use cases. So for instance, the push might be better if you have short-lived processes and you need to quickly offload the data before the process shuts down. On the other hand, pool works very well in Kubernetes. I don't think that's kind of a blocker when using these two systems together. However, the second point, the data temporality, I think it's kind of a big deal. The Prometheus uses cumulative temporality, which means that the last observation contains the previous observations. So if you have a counter in Prometheus, it will contain the sum, the aggregation of all the previous values. In OpenTelemetry, we can use as well cumulative temporality, but we can as well use delta temporality, which means that the observations that are sent over the wire will be just deltas. So if people are coming to this room, it will just send one, one, or maybe two if two people entered at the same time. And Prometheus cannot ingest delta temporality metrics as far as I know. So that's a problem. The second difference, or the third difference, is the histograms, or the exponential histograms. As far as I did the research, I think they are almost compatible. However, in the OpenTelemetry, the histogram as well contains min and max values. So in Prometheus, you can potentially lose some precision of what was observed. The next difference is the resource attributes. In OpenTelemetry, when you collect data, there is a resource object that contains information about the process that is sending the data, which is a pot. It contains pot label, deployment label, replic acid label, node label, and all those things. In Prometheus, the concept doesn't exist. All the labels go to the metric usually. There is a workaround to put these labels into the target info metric and then do the joint. However, it kind of complicates the user experience because you need to do additional join when querying the data. The next difference is float versus int. Prometheus uses floats, and OpenTelemetry can use float and int. I don't think it's a blocker because with float you can represent very well all the metrics. And last major difference is the character set that the system supports for metric names and label names. In OpenTelemetry, we can use UTF-8 in Prometheus, only a limited set of characters. So what happens is that when you are sending hotel labels, they will get corrected to the form that Prometheus can ingest. So if there are dots, they will be substituted to underscores, for instance. So as I said, I was working in the distributed tracing space for a long time and I started doing some metrics. And when I did this research, I was even wondering if these systems work, right? Because there is kind of a lot of things that can go wrong. And I think the delta temporality might be the biggest one. So I started looking into how can I solve this problem. And in the OpenTelemetry SDKs, the OTLP exporter that exports OTLP data, it can be configured to translate delta temporality metrics to cumulative with this environment variable that you can see on the slides. And then as well, you can set it to delta if your metric system supports delta or to love memory, which will use even more delta. You may as well ask the question like why we have two temporalities, right? There is a cumulative and delta. And as far as I understand, the delta temporality can be more resource efficient when you are instrumenting your process because the SDK doesn't have to track the summary, right? They will just quickly send the deltas to the collector or process that is collecting the data and doesn't have to do that processing that the cumulative metric store is doing. Okay. And then the temporality, okay, it's a problem. And then in the Prometheus exporter in OpenTelemetry ecosystem, it will do some delta to cumulative temporality translation for you. However, if you are using Prometheus exporter in the hotel SDKs, they will most likely drop delta metrics. So that's something to watch for. Okay. So what are the Prometheus components in hotel ecosystem? In the SDKs, as I mentioned, there is Prometheus exporter. However, if your metrics are delta temporality, they will most likely be dropped. As far as I was going through the code and looking at the exporter implementation, maybe it's not the case in every language, but I was looking, I think, at Golang and Java and that's what I saw. In the OpenTelemetry collector, there are three components. There is Prometheus receiver that we will see in a demo. Then there is Prometheus exporter that will try to handle temporality correctly. And then there is remote write, which will most likely drop your delta temporality metrics. Okay. So let's try what I prepared. It's a very simple hello world style application written in Golang, instrumented with Prometheus client. And then we will have an OpenTelemetry collector with Prometheus receiver scraping those metrics and exposing the same metrics on the collector slash metrics endpoint through Prometheus exporter. So we have receiver and exporter. And addition to that, we will print the metrics into the standard output of the collector. And we will compare if they're correctly propagated. So let me jump back to my console. I guess it's too small. I'm not sure I can change the color. It's better. Okay. So just for reference, this is the app. It's just main class. Using Prometheus client defines a gauge for tracking the version. There is a counter for counting requests and histogram for counting the request duration and some HTTP endpoints. So the app is running. I will just forward the endpoints and refresh the make request. It's a hello world, nothing special. We're going to see the metrics. We get a histogram counter and gauge and not many labels. As a next step, we're going to deploy the collector, which is again a very simple setup. We are deploying a deployment. And then we have a Prometheus receiver with a static configuration. So in a collector config, you can have multiple receivers of the same type. So I have two Prometheus receivers. One is called static, one is SD. We're going to use the static which will scrape the Prometheus example app service. And as you can see, this config is very similar to what you see in Prometheus. So you can potentially copy paste your Prometheus config into the collector config for Prometheus receiver and it should work. And last step, what we're going to do, we're going to enable the receiver in the metrics pipeline to make it active. And now I'm going to deploy it. As you can see, the collector is up and running. And I will pour forward again the metrics end points now of the collector. And we see kind of the same metrics, right? Here's 18, here's 19 because the Prometheus scraped end points with increased the counter. And what has changed are the labels, right? Now I see the instance label, which is the service name and the job which I defined in the collector config called app job. And then, yeah, we see the same metrics, the histogram, the version counter and the direct-quist counter. Okay, as a next step, we're going to make it a bit more automated. We're going to use the Prometheus service discovery in the second receiver. So we need to define the Prometheus as the config. And in this case, we're going to scrape all the pots that have the label that our app is using. Our pot defines this label. So we're going to enable it by just, you know, overriding the name of this receiver. It's the same functionality that Prometheus supports, right? I'm just using it in the open telemetry collector. It should restart. It's up and running. We're going to forward. And now, again, the same metrics. What has changed are the labels. The instance is the pot, right? Which makes more sense if we are configuring the service discovery for pots. The job name changed to Kubernetes. This is what we defined. In addition to that, now we get the target info, which defines the additional labels the receiver discovered. So here I see the namespace, the node name, the pod name. I think it's readable. And so what I can do right now, I can write Prometheus query that will do joint and get all these labels associated to the metric. Or in the collector, I could write a configuration that will put these labels from the target info into the metric labels directly, which will simplify the query. However, it will create more time series in Prometheus. Okay. And as the last step, we're going to use the pod monitor for the pod that we deployed. And we're going to use collector to get this pod monitor, configure the receiver, and scrape the metrics. So the way how it works in OpenTenometry operator, we have additional components called target allocator. And when you enable it, it will watch all the pod and service monitors in your cluster. And it can watch a subset of it. It depends on the label selector. It will get the scrape targets and then distribute those targets across collectors that you deploy. So if you deploy 50 collectors, it will distribute the scrape targets into those 50 collectors so that all the collectors get the same load. How does it work? The operator will deploy the target allocator and collector, will change the Prometheus receiver config with the target allocator service name. And then collector will connect to the target allocator to get its targets. Okay. So we're going to just enable the target allocator. For that, we need to change the deployment node to stateful set. Enable the target allocator. And now we don't have to do any config in the receiver. We can just leave this empty, the scrape config empty as an empty array. However, we need to change the Prometheus to, we need to just define a single Prometheus receiver because the operator will change. There is a convention that operator will find this receiver and change its configuration. Okay. Apply the manifest. And yeah, it's crashing. It's a demo. But it's just waiting for the target allocator to be running and then it will start properly. Sometimes it just takes some time. Okay. It's up and running. Now, if I refresh the same metrics endpoint from the collector, I see labels again they changed because now the instance is again the pod IP. The job name is what the Prometheus receiver uses by default. And then there's labels like namespace, pod directly on the metric. However, the target info should as well contain the metadata from Kubernetes, like what is the pod name, what is the namespace name and so on. Okay. So what we saw is that the Prometheus receiver works pretty well. We can use it to scrape Prometheus metrics. There shouldn't be an issue and it's as well using the Prometheus configuration. So if you are familiar with Prometheus, we can just directly copy paste the config into AutoCollector. However, what we haven't seen is if the process is instrumented with Auto SDK, then the Delta temporality metrics will most likely be dropped if you are using Prometheus receiver. However, if you are using OTLP exporter from the SDK and we set the temporality correctly to cumulative, then those metrics will be correctly propagated to the collector and then to Prometheus. So be careful with the Delta temporality. The Auto SDK should use the cumulative temporality by default. So that shouldn't be an issue. But if you are using something custom, then be careful with those metrics using Delta. So to wrap up, we saw the Prometheus receiver. It essentially contains the Prometheus configuration. However, the dollar signs in the AutoConfig, they are substituted to environment variables. So you need to escape them with two dollar signs. That's one difference. In the open telemetry ecosystem or in open telemetry collector and operator, there is no support for probe and scrape configs. And in the service and pod monitors in the AutoOperator, we don't support TLS. There are limitations. So where do we want to go with Prometheus and open telemetry? The Prometheus is planning 3.0 release. They want to improve the OTLP ingestion endpoint. So you can now ingest OTLP metrics into Prometheus, which is great. However, if you are using Delta temporality, those metrics will be dropped and they want to improve the support for it along other features. So yeah, feel free to help us to build this thing, to be more robust. On the open telemetry ecosystem, there is kind of two projects where you could contribute to improve Prometheus support. In the collector, there is the Prometheus receiver that we saw, Prometheus exporter and remote write. There is a lot of issues on the GitHub where you can help. And on the operator, sorry, we are planning the next CRD version. We want Alpha 2. And we want to create a dedicated target allocator CRD that will expose more Prometheus config. It's as well something that we are working on and we are very happy to accept your contributions. Okay, and this is all that I prepared for today. Thank you. Do we have any questions? No questions? Going longs? Okay. Thank you once again.
Unifying Observability: The Power of a Common Schema
So, up next, we have Christos and Alex and unifying observability in the power of common schema. Okay, thanks everyone and welcome to our talk. We will in this presentation talk about the conversion story of two schemas of open telemetry in the elastic common schema. But let's first introduce ourselves. My name is Alex. I'm leading the open telemetry initiative at Elastic and I'm a co-maintenor of the open telemetry semantic conventions project. Hi, I'm Christos. I work on elastic as well and I'm software engineer focusing on observability and specifically open telemetry where I am a contributor and a prover on the semantic convention project. Okay, we would like to start with a quite easy and simple question. How many of you do know exactly what open telemetry is? That's great. I can skip some slides later. How many of you do know what semantic conventions is about? That's what I expected. And how many of you do know what elastic common schema is? Okay, thanks everyone. So let's deep dive a bit on the history of open source tools and standards in observability to give us a picture where the standards come from. Let me. Okay. No. Does that work? Okay, around, do you hear me? That works well. Okay. Around or a bit more than 10 years ago when microservice emerged that also changed the observability market and industry. That's when like big tech companies started building their own open source tools for collecting observability data. So tools like Zipkin, Jega for distributor traces emerged, the Elk stack for logging, Prometheus for metrics. We heard a lot about this in previous talks. And based on this defective standard tools, then actual standards emerged like open tracing, open sensors later for distributed tracing, open sensors also covered metrics and the open metrics as a derivative of Prometheus format emerged and Elastic has its own ECS that defines the semantics of structured logging data. Since we will talk a bit more about ECS, a quick introduction what that is. So ECS stands for the Elastic Com Schema and it's basically just a definition of a set of fields that describe the semantics in structured logging data. So for example, if you're collecting a service name with your observability data, the Com Schema tells you that you should put this value into a field that is called service.name, not app.name or application.name. So you have common names that you can later on search for and this also allows you to correlate data across different signals. Now as you can see, we already have at least four standards here that are partially competing, partially complementary. Plus we have all the tools that also create some defective standards for collecting data. So it's ridiculous to have so many standards, right? We need one more that covers all of them. And usually what happens is we have one more that is competing with all the others. And yes, we have one more standard for observability. OpenTelemedia will come back to the comic later again. This is the slide that I can skip based on the Paul. So OpenTelemedia provides not just a standard but a full ecosystem and framework for observability. For collecting data, sending it protocol. One thing that I want to highlight here, there is a specification in OpenTelemedia that defines what data you can collect, like traces, metrics, logs. OpenTelemedia working group is also working on a profiling signal. And what we will talk more about in this presentation is the semantic conventions. Semantic conventions are very similar to what I've shown for ECS. And basically defines, yeah, attribute names and their semantics. Let's have a concrete example of how the data structure in OpenTelemedia looks like here with some logging data. Very simplified view here, it's a bit more complex. But let's say we have a set of log records, right? The OpenTelemedia protocol defines like the core structure of that signal with fields like severity text, which is basically the log level and body, which is basically the log message. In addition, you can collect with your observability data additional context information. This is usually represented in so-called attributes, and that's where semantic conventions come into play. The semantic conventions define which attributes exist, their names, types, and also the semantics behind this. For example, if you're collecting an HTTP access log, right, and you want to capture the HTTP request method, this is the attribute name that you would use for it. Now observability data is usually also captured in a broader context for some resource like a concrete service, a host, or other resources. That's why OTLP wraps the actual observability data into a resource wrapper, and a resource again has a set of attributes, so-called resource attributes, that describe the resource, something like the service name, host name, and so on. So this is the structure in OpenTelemedia for collecting observability data, and semantic conventions is just about the attributes basically in their meaning in this data. Now let's come back to our timeline of standards. There's one important thing I didn't mention before. Actually OpenTelemedia, and we heard this in the previous talk, is the result of a merger between open tracing and open sensors. OpenTelemedia also supports Prometheus metrics and OpenMetrics that we have heard in some of the previous talks, and just last year, Elastic also announced the donation of ECS into OpenTelemedia. So coming back to this, the question is, is it really that we have one more competing standard? I would say actually not. With OTLP we have less competing standards, and OTLP really succeeds in reducing the amount of competing standards and becoming the one and single standard for observability. Now as I said before, Elastic announced the donation of ECS into the OTLP's semantic conventions project. Why? Yeah, because there are great benefits to this. First of all, there are complementary parts and strengths in both schemas that we now merge into one single schema. And second, we grow two different communities by merging them and providing a bigger network effect. So it's a huge win I think for the community, but there are not only benefits, there are also challenges, right? First of all, the overlap between the two schemas is a potential for schema conflicts. And to resolve these conflicts might mean that we need to have either breaking changes in the one schema or in the other. We have seen the structure of observability data in OpenTelemedia, which consists of the protocol with the nested structure plus the semantic conventions. It's quite different to how ECS defines the fields because ECS is just a plain definition of fields without like nested structures or so. So there's some difference resolving that is a bit of a challenge. Another interesting thing that we discovered when we started merging ECS is that in OpenTelemedia before the merger, many times attributes have been defined in a concrete context. For example, we have here an HTTP server span and the attribute HTTP route is basically defined under the semantic conventions for HTTP server spans. The problem is now if I want to use the same attribute in a different context like let's say HTTP access logs, I mean there was always a means just to reference the other attribute, but it feels sort of weird because in the one context is a first class, right, attribute and the other one is just a reference that overrides some semantics. So learning from ECS, what we already achieved with the merger is that now we have in OpenTelemedia a dedicated attributes registry that serves the case of just defining attributes with their types, with their meaning and in the different semantic conventions and their use cases we are just referencing those attributes. So we have clear separation between defining attributes and using them in a concrete context. And finally another challenge is metrics. Metrics formats in OpenTelemedia follow the TSTB model. So we have a concrete metric name like system disk IO in this case with a type, with a unit and we have a set of dimensions modeled as attributes. In this case direction for example for disk IO read or write. In ECS previously the metrics were basically modeled as numerical fields on documents and you can have multiple numerical fields in the documents so you can have multiple metrics. That's the reason why often some of these dimensions that we have in OpenTelemedia are just encoded into the metric name on ECS side. So we have things like disk read bytes or disk write bytes. This is quite a big difference in modeling. This is a case where we are learning basically from OpenTelemedia and adopting this at Elastic now also with Elastic Search supporting TSTB. So we see we are learning from both sides which is a great thing and we are coming to the best solution possible for the community. And Chris will tell you how this actual merger is happening in practice. Thank you. Can you hear me? Okay. So as Alex mentioned there are a lot of things going on so the question is when is time to celebrate the merger that everything has been completed and the truth is that we are not there yet. There are things that needs to be done and actually everyone believed in the beginning that once the merger was announced that that's all. I mean we have not anything to add there but yeah the truth is that the actual work started right after the merger was announced. So yeah let's see some examples of how the merger is happening and how things are moving forward. So I have some real examples here from the upstream repository on GitHub with issues and pull requests. So this one for example is trying to add some new resource attributes for the container images and specifically the digest of the image. So as we can see that PR was filed on the 4th of July I think yes and it took it some time to get seen right. So it took us like many review cycles more than 20 blocker comments actually there so lots of back and forth lots of discussions but that one was actually merged after almost two months. And another example is about a very important attribute the IP of the host hosted IP as we call it and this one was really unique really interesting actually because this PR was filed by a non ECS contributor. So actually that contributor used to work for a company that it's I would say completely unrelated to the ECS project but it was quite nice because in that case the existence of the ECS project was taken into account and there were very interesting conversations and it took us like almost three months to have it in. So yeah it's quite obvious with these examples that the merger was not something trivial not something straightforward that can happen from one day to the other by for example writing a script that will transfer everything from one project to the other or something like that. So we have decided to take an approach to move let's say not so fast and pay attention to the detail and have the proper people work on specific areas so as to leverage their expertise and be sure that what we are merging to the up seem to the final project which is actually the sematic convention of open telemetry will stay there and everyone will be happy with that in the future. So that's more or less the areas of the sematic conventions. We have areas in area about databases cloud containers Kubernetes HTTP system metric system resource attributes and many others. And yeah so we have started focusing on specific areas some examples is the effort that we are doing on the system metrics area we have a working group working there focusing on the stability of the area. We are in a really good position now we are moving towards the ability really soon and the same for the process namespace the process area the process resource attributes and the same for container area we are close to achieving the 100 percent converges there the recent going PR that will add the final attributes final metrics excuse me same for HTTP and network areas we have good coverage HTTP sematic conventions were declared as stable really recently so we are adding on top now which is quite nice and yeah we have work in progress in databases mobile areas cloud Kubernetes so we have working groups getting started and focusing on these areas and yeah over the past months we are focusing on making the project as good as possible it's a community driven way so we as ECS contributes to the contributors donating this project we are not only focusing on the merger itself but we want also to ensure that the sematic conventions project will be there and will can serve us in the future so we are also focusing on other things as well like improving the tooling of the project working on the guidelines this is quite important because there are many times that the guidelines of the one project are in conflict with the guidelines of the other projects so in that case we need to take a step back and reconsider the guidelines and see what we want to have there as a final result and yeah also we work on restructuring the project before it was the sematic conventions within the project were grouped by signal logs metrics traces and so on but now we have a better organized organization there and we group the attributes by topic and yeah as Alex mentioned already we have introduced the global attributes registry it's actually a very big list with all the attributes there and then within the actual specification you can reference the attributes from there so yeah that's quite useful and we're also working on adding a new concept from ECS which actually the attribute nesting or reusing some namespaces that means that if you have a namespace for example always dot whatever you can nest it attach it as it is under the host namespace for example and you don't need to redefine it again so yeah these are some examples from the upstream most of them are closed some of them are really let's say close to be completed but we have some small blockers there but the work is moving forward that's a that's the point and yeah how the community is organized around these so as I mentioned before we want to have proper people working on specific areas leveraging their expertise so we have working groups working on each area and we're trying to first declare their the areas of the semantic attribute the sematic conventions as stable which means that all the semantic conventions that we will have there will be stable and then we can use them in the actual implementations so the next step is to tune the implementations accordingly which means essentially the open telemetry collector and the language SDKs and yeah some examples the system metrics working group the working group around databases we have a security semantic conventions working group which is getting started now we have also approvers areas for the mobile area containers Kubernetes and many others that I don't mention here and the process looks like this first once you want to create a working group or a specific project you propose the working group area and you mentioned there what issues you want to work on and then you will have people expressing their interest to join this effort you will need to find a sponsor from the technical committee and yeah once everything is decided we have a specific project board we have regular meetings we have people getting assigned to the issues there and yeah the work is happening like this and yeah regarding the merger itself in yeah technically it happens like this we follow this process so once we have to either introduce some new fields some new semantic conventions or we want to move something from ECS to the semantic conventions of open telemetry we first check obviously what we have in these two projects and we also check what implementations have so far essentially the open telemetry collector or the SDKs because there are cases that the for example the collector already uses some some let's say metrics there or some semantic events some resource attributes for example but those are not yet part of the semantic conventions of open telemetry so in that case we also check what there is there so we might find something interesting so we can use it and once we have everything considered we have a final proposal we raise an issue or a pull request directly and we start the discussion within the community we yeah particularly focusing on measuring the breaking changes because you can imagine that we want to avoid bringing frustration to our users on both sides so yeah that's really unique really important thing to consider and we go through the review process and then once we have a conclusion we merge and then of course we need to handle the breaking changes because they are there most of the times and yeah the summary for today is that the merger is happening feel free to join us contributors are more than welcome everything happens in the app stream so if you are interested please join and you will see that you will find that you will have real impact from day one there and the goal of everyone is to make the semantic convention of open telemetry the one unique straight one unique and straightforward standard for observability and security that will be there for the future so yeah with that you can find us on csf slack channels or by using our github handles and some project meetings on Mondays we have the semantic events working group meeting same our next day Tuesdays we have the specification sig meeting and on Thursdays we have the system metrics working group 530 30 central time and yeah without any questions I think we're out of time do we have any questions hi thank you for the talk this this was really interesting and clarified some things for me I have one question about what's how what are the benefits of these semantic conventions in terms of like front-end tooling that that we are using because I know that you know there's this idea in open telemetry project that you have semantic conventions and you have common attributes for different signals and then we collect all this data in all these different signals in some observability tools and I imagine in like front-end we could automatically correlate different signals if we have this like common attributes I'm not up to date with the current state of this this area so yeah this is my question what are the main benefits of following this semantic conventions yeah I would say there are two actually one is I mean open telemetry is an open source standard right and there are many vendors adopting this so we need common semantics of what the data represents to build features higher level features on top this is the first thing and the other one is correlation as you already mentioned cross like different signals to also have correlation cross or through the resource attributes for example so you can drill down basically on different signals into the same resource and yeah I would say these two things and also cross signal correlation not only through resources but things like trace ID to have them you know both on locks and traces and later maybe in profiling data this kind of things okay thank you so are you doing something like that in elastic like in front-end at the moment is there any work going on in this area like correlation of different signals yeah of course like I think that's that's the goal for for every observability vendor to bring all all these different signals together yeah okay great thank you very much any other questions going once okay cool then bingo plus okay
Linux load average and other silly metrics
We'll see something very basic, the load average, the thing that you have on top, on top when you look at the performance of your server. Very basic, but with a lot of misunderstanding and the goal is really to understand if it's useful or not and at least how it works. I usually do that as a live demo, but I'm not sure about the Wi-Fi. I think I've lost the connection, but I have some recordings. Basically, what we will do, we will look at what we have in top. So this is not moving because I lost the connection, but we will see later on recordings. You can start to think about it. I have run something that you can see in the processes there. I have two CPUs. I have a load average of 32 for a long time. I don't know if you care, but I have 99% of weight I owe. Basically, my question to you is, do I have a problem or not? I am bound on a resource or not. If I'm bound on a resource, am I bound on CPU or I owe or memory or whatever? This simple question, I see a lot of people who cannot really explain it. The goal of the presentation will be to tell you that you can mostly ignore the numbers that are on top of top because those are about the systems, the processor, what you care about for your application performance is more the tasks that are running and this is probably more useful. Going back to the slides where I have the recordings of all the demos, so we will not try to reconnect to the Wi-Fi. Also, so that screenshot of what we have seen, people using the cloud, cloud providers like to provide nice graphs about performance and usually they put first the load average, the CPU usage. Typically, I have two processors. I have a load average of 30 and my CPU is doing nothing. Memory is 100%. What do they want to tell us with that because most systems will have usage at 100% and that's probably cool. We will look at that in the next 20 minutes. First, this is the recording of what I wanted to show you. That was what was running exactly the same. You see the load average, the number of CPU, the weight, IO, there. What do you think about it? Who thinks I'm bound on CPU? Who thinks I'm bound on IO? Who thinks I'm bound on IO because I have a weight IO? Less people. That's already good. Here, we see a high weight IO, but maybe I can advance on the recording. What I show in this case, when people think that I have a problem with IO, is just to run something else. Let me check where it is in the recording. If I have the wrong recording, I will just explain what I show usually. Sorry, maybe it's in the next recording. What we see is load average, high weight IO, but the most important, what I really care about is this, the state of the tasks. Who thinks I am bound on IO because of the D state? For me, this D state gives me a clue that most of my processors are waiting on IO. Probably. We see that it's not so exact science, but that's something that can give some clues. I'm lost in my slides. This is the next one. I'm running yes. You know the yes command? It displays yes. I'm still running the same IO there, the same throughput. I'm doing exactly the same, and my weight IO has decreased. This is how to solve weight IO, just run something else. I show that to explain that this weight IO is not about what your tasks are doing. It's about the CPUs. When you do IO, you don't need CPU, so you wait. If no one else wants to do something on the CPU, then the CPU state just remembers that, okay, I'm idle because someone is doing some IO. Now I'm running something else that uses this CPU. This CPU is not idle. This weight IO just means idle, and idle because the last one did some IO. The only information I have from weight IO is that the CPU could be used for something more useful than weighting, but doesn't really give me the information that I have a lot of IO because depending on the other workload, I will sit there or not. The state doesn't lie if my processes are all on the D state. At least they are not on the R state, the renewable state, so they are not using CPU. In the next one, what I do to understand better the kind of IO I'm doing, the kind of system call that puts this D state, I just run S trace on my processes, and I just did the S trace dash C to count them, and you see that most of the system calls are P writes. That's actually what I'm running there. I'm doing writes with the P write system call with direct IO. That's basically what I have there. If I want to understand really what is behind a state that is not the R state, the renewable state, I can trace the system calls to know exactly why. I will explain why I'm looking at that because even if D looks like disk, you can do some IOs that are not in D state, and you can have D state that has nothing to do with IO. So it can be misleading. The D state is uninterruptible calls. So your process has something to do that is not in CPU and does it in an uninterruptible state. Depending on the system call, it can do it uninterruptible or not. Often IO like the P write is using this, but there are some other kind of IOs. Any questions so far? Any remarks? Okay. So next one. I will run something else if I remember exactly what I'm doing here. I will run FIO. The difference is that I'm not calling the P write system call. I'm calling the Lib IO, asynchronous IO library. Basically I'm doing the same writing to the disk with direct IO and you can see the throughput is mostly the same. However, I'm not in D state anymore. So there are some IO who put the D state, but there are some IO who just put the sleep state, which is not uninterruptible. So very misleading when you see those things and try to guess what happens. If you are stressed, there is no guess. You know exactly the system call. And I think this is what I do just after. If I stress, I see that most of the IO calls here are IO get events and there is some IO submit. This is our asynchronous IO works. P write just ask the kernel, I want these blocks and wait to get those blocks. With asynchronous IO, it tells the kernel, I will need those blocks. So that's the submit and then can you work on something else and come back and say, oh, do you have my IO? If not, I will wait. The submit goes in this state, but it's very short because it's just a submit. The get events, if it waits, goes in sleep state, the S state, and not the D state. Depending on the kind of IO, you will see it at this state or not. And the wait IO there depends on the state, but more important, I don't know if I can go back. Well, I'm sure I can go back if I replay it. I guess that the load average was lower when I was running that because the D state counts in the load average, the S state doesn't. Means that some IO counts in the load average, some IO doesn't. Means that with load average, you don't really know what happens. Okay. The next one, I'm running something else. So those were direct writes by passing the buffer cache. And here I'm running reads and more I set direct equals zero to FIO. FIO just simulates different kind of IO. Typically I work with databases. I'm a developer advocate for UGA by DB that is a distributed SQL database compatible with Postgres. I've been working also a lot with Oracle. They do those kind of IO, Postgres does not do direct IO. It goes through buffer. Oracle, you have the choice. So really depends. Here, what I would like to show you, I don't see it from here, but I'm probably in the running state. Yeah, it was not sorted. But here, I'm mostly reading from memory, from the cache, from buffers. And this is why you see that much faster. And a difference, I'm using more CPU there. You access memory more than you access the disks. And then this is in CPU usage, the kernel part of the read. I mean, my application is doing the same. Just an IO call. So the user space CPU is still low. But on the system, on the kernel, what Linux does is read from memory. And this is where you have some system CPU there. That counts in the load average also. I just, okay, in the meantime, I did this trace to see the reads there. So I have periods, the same system call. What is different is what is behind. That it reads from buffer. And I don't know if you have seen it. When I was attaching with S trace, the state here was T. That's the state when you attach. And of course, it has a little overhead. You do that to troubleshoot. The important thing is the runable state. I'm saying that either I'm running in CPU or I want to run in CPU. And I don't know which one from those metrics. That's the point. I have only two CPUs. So I know that I cannot have more than two tasks running in CPU. They are running able. They are waiting in the run queue to be able to run on the CPU. Top will not show the figure. Load average will add those rating and those running. If you want to see the difference, you need to look at the statistics from the scheduler in slash proc scheduler statistics or VM stat is showing you the run queue. I'm saying that because I've seen a lot of people comparing the load average with the number of CPU. Like if load average is higher than the number of CPU, I have a problem. Maybe not because if the load average is due to IO, you don't really care about comparing with the CPU. And if the load average is high because you have a lot of processes in the run queue, then probably you have a problem because you have tasks who need to run something on the CPU and just cannot and are waiting in behind. So we have seen different kinds of IOs and they look differently. Many times where I've seen, especially on databases, where I've seen different teams, the Linux team looking at the system and the DBA team looking at the database. And in many companies, they don't really talk together. So one is guessing what the other is doing and a lot of misinterpretation on all that. It's very important if you look at the numbers from the system to understand what the database is doing. And also it's very important for the database administrator to look at the system because many things in the database metric will be different if the system is overloaded. I give a quick example on Oracle, you have wait events where you can know exactly how much time you spend on IO. But it's not exactly how much time you spend on IO. It's how much time between the time stamp it takes before the IO and after the IO. If your process is in the run queue, the database thinks that it is doing IO, but maybe the IO is done and it's just waiting to go back to the CPU just to set the counter on the time stamp. So that's also the message. I say that to database administrator, but applications, if you run on a system that is overloaded in CPU, then probably all of their metrics, because they require CPU cycles to get the number are probably wrong. So why did I call that silly metrics? I didn't came with this. If you want to understand what is what low-dverage measures, Linux is open source, so just look at the source of it. And you can look at the source, but more interesting are the comments which can explain the intention of the function. And so in Linux, the load average is defined as this file, so the source for load average contains the magic bits required to compute the global load average figure. It is a symmetric, but people think it is important. So you see why you see that first in top? It is silly, but some people think it is important, so let's give them something. And we go through the grid pane to make it work on big machine with T-class kernel. So the load average idea comes from Unix systems where it was really measuring the load in CPU and where it was easier to measure it because you just counted the ticks in the scheduler. Linux works differently and means that it is difficult to measure and maybe it makes no big sense. So yeah, good to know why this metric is there just because people coming from Unix were used to have this single graph showing the load and compare that with the application and what is done in the application, but if you don't look at the state of the processes, then it can be misleading. It's easy to understand exactly why we see this state, these IOCOLs in the load average, just the way it is calculated. There are two things that are interested in the way it is calculated. First, it is an average and that's also a problem. If you look at the load average, you will not see a peak of activity of five seconds because it is average. The other thing is that it counts the number of active, so the running state, which is more renewable because if you are in the run queue, you are not really running and it has the uninterruptible calls just because they thought that if we show only the CPU load, is it really the load of the machine? For example, you run a database doing a lot of IOCOL. Then we say that the load is low if everyone is waiting on the disk. Let's add an interoptable because in many cases, we have seen that those IOCOLs are uninterruptible calls, but they are not always, so it can be quite misleading. It doesn't mean that you don't have to look at it, but if you look at it and know what is behind, then it can give you some clues like the clue about IOCOL looking at other things, but more interesting is the process state. A process can have something to run in the CPU and then look at the scheduler statistics knowing if it waits for the CPU or there is CPU available and when it has some calls to do, they can be done in this state or as state and they will be accounted differently by the load average. Any questions so far? Okay, the next one is more about memory just because it's another thing that is misleading in some cases. I think it is quite clear in top that you can look at the available memory, but I see cloud provider showing the use memory or the free memory and here I just want to explain for those who don't know, if you do buffered IO like I did with direct equal zero. Okay, I thought we have five minutes now. Okay, perfect. So I will finish quickly on that. Do not look at the free memory. I'm just showing that if I do some IOs, it will take some free memory, but that is easily freed if it needs look at the available memory. That's the memory that is available to your process, but also think that it is available. You can use it, but if you use it, then another process doing buffered IO may not find its data in the case. So if it is available, doesn't mean that it's free from any impact on the others. Okay, I just put the last one while I'm talking and taking question. The idea there was just to show a really silly program doing V fork that has nothing to do with the data, but just to show that it will go to the state, it will increase the load average and that's the case I've seen in some system where the load average was thousands on a database having its file on NFS and network issues and then those uninterruptible calls increased the load average, but without any consequence because they weren't doing nothing. The only thing is that it's ugly when you look at the load average and the other thing is that they are uninterruptible. You cannot kill them. So you want to restart the system to have nicer numbers, but of course you wait for it. So just be careful, load average accounts some IO and accounts some CPU and you have some IO that you do not see there. Okay, do you have any questions, remarks? Thank you. What about pressure stall information? Very good question. If you have seen at the first screenshot I was running pressure stall information, which in my opinion is a better picture. The pressure stall information is counter telling you during the last 10 seconds, for example, how many, not how many, if there were some processes with pressure on CPU, so to run on CPU to get IO or to get some memory. So it really gives you an idea about the pressure itself. The only thing about pressure stall information I have is that in most of the kernels, the distributions I've seen, it is compiled in the kernel but not enabled by default. And then because it's not enabled by default, I've not seen it a lot. And then I think it's a good idea. Each time I used pressure stall information, it was giving me the right idea, but it's just a subset of the systems I've seen because it's not the default. And then maybe there are some cases that I don't know where it's not perfect, but I try to encourage people to enable pressure stall information where instead of looking at all that, you just see that you have some processes that could be faster if they were not on pressure, on RAM, IO, or CPU. Okay, I think we are just... Another question? If it's okay? So looking at a very generic use case, if you were to redesign the cloud provider's graphs, would you change it? What would you change it to? Could your list maybe the five most important metrics from a generic use case that you would put on a dashboard? On a dashboard, I think pressure stall information can be really nice on a dashboard because you can show that to user. User running on the cloud, for example, they want to know if they are on pressure on CPU or on IO because they pay for that. So those ones I would put that. Load average, maybe with a clear description that it is CPU plus some IO, and memory, available memory, not use memory because a system doing some IO, some buffered IO will always use all the memory in Linux. Maybe we have...
Implementing distributed traces with eBPF
Thank you so much. My name is Likol Goczewski and I'm here with my colleague Mario Matias. I think I pronounced your name right. Yeah, you pronounced it very well. We work on open source project at Grafana called Grafana Vela about software engineers. We didn't practice this presentation much because we live on two different continents so you get what you get. It's always not too bad but yeah, we'll give it a shot. Let's go. So we will first do a very quick introduction to what is distributed tracing. I know most of you already know but just to try to get a common mindset even for people that is new to observability or to distributed tracing. Then we will explain a bit how it is implemented and how do we implement it in Grafana Vela using ABPF. So if you want to instrument a server, you might add an instrumentation library like for example the OpenTelemetry SDK and insert some instrumentation points in your server to get on each request a span containing data like the start and the end or some extra information about the request like client ID, the path of an HTTP, the response, etc. Then you can send that to an OpenTelemetry collector and visualize that. If we have a distributed service in which one service calls another, gets responses and so on, you could still do the same instrument each point and then send them to an OpenTelemetry collector for example. But the spans themselves could give information but separately may lack a lot of context. So if you get just a bunch of front end database back in the span separate, it will not be as useful as for example getting for each span which is the request that invoked that other request so you can see everything in context. This is what we say name distributed tracing or context propagation. In OpenTelemetry concretely we use the W3C standard that is using a trace parent header in the request. So you can insert into your request, you can insert headers with the trace ID and the parent span ID and then their services getting these or receiving those invocations can read this trace parent and add it to their own request. So that way you can always track the context. This is not any real SDK, any real language, it's just an example on how could you do it. You have a service and on each request you can read this trace parent, create your span, the part of the trace and when you have to call other services you will add this trace parent in the headers and then in the span. This can be manually done by code, be an SDK or this can be injected by your instrumentation or SDK agent like the OpenTelemetry Java or OpenTelemetry.net agents. Bayla, those are another or follows a similar approach especially for these services that are written in a language that is not so easy to instrument be an external agent. I'm thinking of for example, compiler languages like Go, Rust and C. In that case, Grafana Bayla can be deployed in the host, in the same host as the services you want to instrument and it will use the EVPF technology, we will talk later a bit about it, to hook and inspect the runtimes and libraries of your application or the functions of your application and as well some points of the Linux kernel. Then compose metrics, traces and forward them to your OpenTelemetry collector. What is CVPF? I mentioned it before. It's just in time, virtual machine that is shipped inside the Linux kernel. This allows you to efficiently hook programs to multiple events in the kernel, libraries and the user space programs. For example, Bayla can hook every time an HTTP request is received in the instrumented application. Bayla can execute immediately a piece of code, a probe and then inspect and even modify the memory, the runtime memory of your process or even the kernel. This way is able to know when request, service request starts and ends and even inspect some arguments about them. Bayla has two ways to provide a span information. One is to inspect at the language level. At the language level, we only currently support Go and it hooks user probes into the Go runtime and the Go libraries to inspect them. To support other languages, this is compiler languages but also Python, Ruby or other interpreted languages. It hooks K probes in several kernel functions and libraries to know when connections are started to read the arguments of the requests and the responses and so on. We are able to do that in Go. We are currently inspecting HTTP, HTTPS, GRPC, HTTP2 and soon SQL. At the kernel level, at the moment, we are inspecting HTTP and HTTPS but other protocols will come at some point. We will talk about how to provide the spans but Nicola will talk about how the context is propagated with Bayla. I think you can hear me here. You can hear me, right? Yeah, this is working. We showed a previous example where we had this done by manual introduction in that logic in the program about reading the trace information coming in on a request and then how we send that over which is effectively what most of the open telemetry SDK instrumentations do or the agents in Java or .NET, they do that injection for you automatically but we do it with eBPF so you don't have to have an SDK added to your packages or languages when that doesn't exist or languages where maybe your library dependencies don't quite work with the SDK because of different versions or it's not up to date or whatever the reason. We hook into the program like Mario mentioned in different ways and when a request starts we actually read the memory with eBPF and what is in that trace parent. If there isn't one, we'll generate one according to the W3 stack. Then what we do next is that we notice an outgoing call and then in that outgoing call, if we can find the information about the headers, we will inject the outgoing trace header just like the SDK would do. This is what happens in Go currently with Vela. This is exactly what we do. Now internally, how this all works? Well, to make sure that we can tag an incoming request on a server accepted something like slash ping for example and it did an outgoing request to slash ping me too and in that case we need to track that this incoming request matches this outgoing request by the call maybe async. Maybe somebody wrote a library and said, well, I don't want to wait for this request. I just want to do it async for whatever the reason. I'm using some reactive library. In that case for Go, we track the Go routine creation and termination essentially. Because the Go runtime and the standard libraries are very standardized and everybody uses that, we're able to do this kind of stuff. It doesn't need to be the first argument, needs to be the context. None of that stuff. We just track Go routine creation and we're able to match it later on. That's how we propagate the context. Now for the other languages, we thought, well, how are we going to do that for other languages? People use number of libraries. How do you do this on compile languages? Somebody does just think on time compile language. It's kind of hard. For that we wrote additional support that does something more sneaky or if you will, something more interesting. Land 2 servers or two processes talk to each other over HTTP, for example. They have a unique pair of information and they identify every connection. I have a client, opens a remote connection to a server. It has a source port, which is typically a femoral. I have a destination port, which is a server port. When we see that connection pair, we use it as a unique key and we say, we'll store it in the eBPF map. Then when the server on the other side gets that request, they look up that map and say, well, I have this connection pair. Does that match any client that made this connection? It does require that one single baler monitors both processes. If that is true, then we can actually tag these requests between servers without actually using this transparent propagation. For languages where we haven't written additional support to inject the headers information, we use this as a backup option. This context propagation correlates internally requests through the kernel. Here's an example. We start the client call. It may read the transparent information that was present from a previous call, but if there isn't one, it's just going to generate it right there in eBPF and then store that information. Then later on, when another server request happens based on the client call, we'll read that map, read the transparent information, create the spans, just like if you will, that transparent logic flew through the HTTP headers. More or less the same. There's restrictions, of course. Obviously, for this to actually work, we have to have a single node. Now, these eBPF maps can be shared on a volume and maybe there's a way to use that, but we don't do that and support that right now. This is also not released yet, so we just have it in the main branch. It's one of the newer things we added. But with this, I think I'm more of a person. I'll believe it when I see it. I think we want to try to do a demo to show you everything's running off the laptop that Mario has here. We're not going to connect to any cloud services, but what we want to demonstrate is a few HTTP services here. And GRPC also. They're using GRPC in this case. They're returning Go. We're going to have one Bayline instance. Look at all of them. We're going to use this little tool that actually Fabian made, this little Docker Compose LGTN, which has the full Grafana stack with all our open source products, with the OpenTelemetry collector setup that it can ingest and do traces, metrics, and everything you need. Very convenient for testing. Very convenient for testing or spinning up your own Dockerfana cluster at home. So it's just one Dockerfile with all of it. I also wanted to mention, because we didn't say, it's obvious the presentation is about distributed traces, but Bayline does support metrics too. So HTTP metrics were included from the Star Door product. Traces distributed traces is some of the newer stuff we're working on. Okay, so for this demo, we will show a simple distributed application. It means a synthetic application is just a frontend sending a request to a backend, and the backend asking for distributing some load on the workers and then getting a response. Do you need to hold that? No, it's okay. It's okay. Thank you. Then I have added everything into a Docker compose file just for facilitating the demo in my laptop. So we have this OpenTelemetry collector, which is the hotel LGTM container that Fabian did. And we just dropped Bayline as a container. You can drop Bayline there as a host process, but for convenience also as a container. We need to give access to the pit name space of the host, because it will have to instrument all the processes in that host, and also privilege access because loading EVPF programs requires administrative privileges. Then we set here some the OpenTelemetry endpoint in a standard configuration. Bayline accepts the standard OpenTelemetry configurations for setting up many values. And also we are providing a configuration file. Basically here we say how to group the HTTP routes. For example, there is a route that calculates a factorial, and you will pass in the request, you will set factorial and the number to calculate. We don't get a cardinality explosion because we don't want to create a different route value for every number we calculate. So we say, okay, just group all the URLs matching this pattern, group them in factorial number. And then we tell Bayline how to discover the services to instrument. We have a frontend, a backend, and a worker container, and then we pass that. This accepts any regular expression. So if we say just a dot, it will instrument or try to instrument all the processes in the host. But in that case it will also instrument some parts of the Docker API, the Docker Compose API. So to not generate noise, we are just providing the services we want to instrument. And let me then run this Docker Compose file. Okay, this application is a very simple application. It's a huge factorial calculator application. I will just write a number, and it will calculate the factorial. And if you need more numbers, okay, you calculate. Boom! This is an error introduced as on purpose because I also use this application to track errors from Bayline. But it usually works. Then, doing that, we have, Bayline was already running. We have been generating some traces. So let me go to the local Grafana. Let's see. I go to, for example, explore. Here I selected the tempo, and let me search for all the traces. Okay, beautiful. It's strange because here we can see that Bayline... Oh, yeah? Okay, let me check. No data. Okay, it happens in the best families. No, but we have this... I mean, it is able to... Okay, I don't know what happened. But... For sure, it's a book in Grafana. So I have here many, many requests. Or many traces. Let me just instrument this, submit trace, which is the one that triggers the backend and the workers. If we enter here, you will see the trace information. How the front end invokes the backend. You can track also an internal status of the request, like how much time the queue is in... Or the request is in queue or is being processed. And you can see how, for example, the backend might invoke the worker multiple times. So we got distributed traces automatically. We can even see the node graph of all the requests. How this process invokes or the relation of all the traces as a graph. How the front end as a server, because we instrument either server or client side spans. How the front end invokes the backend, the backend invokes different workers and so on. I just want to add something here. So we're here, if you see, when you look at the Bayla stuff that we produce, we produce these two spans for some of the server requests. We have in queue and processing. And for most people, that's like, what is this two things? Like why are you tracking two times? And if you have a typical application server that saves with an in-go, and you accept the request and as soon as that happens, go or launch a go routine for this. But how long before this go routine gets scheduled on a physical thread, which is M in the world of go, and how long before this physical thread actually gets CPU time? So from a traditional instrumentation, you instrument the handler of the server request. This handler of the server request is the time the handler started running, not the time that the runtime accepted the request coming in from the kernel. Well, at the EVPF, because we're at a low level, we can actually track that time. We can actually see where the request actually came from the kernel, when the go routine was launched, and when you finally got the handler to run. So in a situation where you have a server which is overloaded and it's not able to serve the request, you'll get the actual request time, much closer to what the client sees on the other end. Rather than the fake time, which is what the application server would see normal. Okay, so that was the demo. Let's summarize something that is that, using EVPF, you can capture distributed traces, as we, as Nicole explained it, with some limitations. The advantage is that it requires almost no effort from the developer or operator, in the sense that you don't need to reconfigure your service, you don't need to change the code, you don't need to redeploy, just drop it and get whatever Rela can get. Yeah, and it's, another conclusion is combining this packet tracing with language level support, is what we, we allow Bayla get those distributed traces. So if you like it and want to give a try, Bayla is available to download freely, to test it. You can, you can connect to our GitHub page, or, and then you will see instructions and links to the documentation or the main open source page of Bayla. Yeah, and on the GitHub page is what we start with, we have a link to our community Slack, if you want to chat with us, and we also are soon going to start organizing the community call. So once every month we have a call where you can just join in and chat or yell at us, for whatever reason, but yeah, that's it. Thank you. Thanks a lot. Oh, so many questions. I'm running. You said that when you're tracing in Go, you, you are, you are tracing the coroutines that are, that are handling requests, but in Go you don't have ideas of these coroutines and you don't have the relationship between them. And to, to make it worse, the go around time actually reuses coroutines for something completely different. So how do you, how do you do that without constantly handing pretty much all the coroutines all the time in order to get your trace? Yeah, okay. So like with EVPF you get superpowers. So from a regular goal developer perspective, you never actually have the access to this information. Yeah, for whatever reason, they won't give it to you. But with EVPF, I attached the go runtime. So the address in memory of the go routine is my ID. Now I can tell when the go routine starts and when it gets parked back, when it's reused for something else, it can be reused and that's fine. But at that time I'll clear up all the information because I know the go routine is done. Because like superpowers. Hey, thank you for your talk. I'm one of those guys that manage a lot of infrastructure in code in general. And when you say that, hey, you just have to eat that and just work sort of a box, it's kind of scares me because potentially it can cause problems. And one of the issues that we saw with both kind of solutions usually is if you inject into request a tracing header, potentially the request might be changed. And some protocols do signing and request like AWS signature free, for example. And they don't really like you injecting headers in the middle of request, especially at a lower level. So how do you envision if you have some kind of like agent in the code itself, then you can work on that by disabling the tracing on both specific endpoints. But if you do that at a lower level, then you don't really have a visibility to be able to disable that or recognize that you are creating a request to such a back end. How do you envision like working around those issues in the future? Because this is one example, but this will happen many, many times. Yeah, yeah. So that's true. So if you need sign some IDs and whatever, it's not letting you change the header information, then disable that feature. Don't use what we do right now for propagating using the headers. Use the black box. This is the back boxes are sort of the full back. We've been toying with the idea that maybe in the future we'll let it work with an external storage of some kind that we can actually make past the one node restriction we have with the black box right now. But that's the very reason we're designing for because in so many environments, injecting the header information is just not possible. I'm dealing with interpreter language. No compiled methods, no dice. So I can't do anything with you. Thanks. Good question. Thank you. Thank you.
What’s possible in observability when we have frame pointers
All right, so yeah, what's possible in observability when we have frame pointers is kind of the talk. But let's start out with like a kind of like actual use case of observability, right? So we have these workloads. We can like graph the CPU cores and we can see some things happening and we might be wondering what's actually happening at these spikes, right? And we can use profiling, I guess, to figure out what happens at these individual spikes just to like understand, okay, like in this scenario this was happening in another scenario or like at another time something else happened. We can like get profiles manually and compare them or we do something called continuous profiling where we just like all the time over time, yeah, profile, hopefully a little overhead we can even do it in production or not hopefully, but it's a reality. We can do it in production, right? So we can then store all of these profiles and over time kind of like ask questions when we want to in retrospect and we don't have to worry about missing data points and we have kind of the security or yeah, the ease of use that we can just click on some spike and then get a frame graph or in this case an icicle graph because it's like top down and not the other way around. We call it icicle graphs and you can see all the stack traces and you can like instrument very nicely, introspect what's happening and I don't have a slide for this but we can also kind of like this flame graphs and then we can see in like red where things got worse and in green usually where things gotten better and it's pretty obvious most of the time if you have such a big spike like that's right the point where we need to look in such a flame graph and where we need to like check out what's happening in the code. So yeah, that's kind of a pretty good use case for observability, right? But yeah, what our frame point is but before we come to that quick introduction I'm Matthias Läuwe, I'm a senior software engineer at Polar Signals, I work on Parker which is like the open source project doing a bunch of these things but I also work on Thanos, Prometheus and lots of other open source monitoring projects. Yeah and hey everyone, I'm John Seger, I'm VP of Engineering at Canonical, I have a kind of interesting journey to open source but at the moment I am leading the development of Juju and a whole suite of kind of enterprise apps which we call Charm so if you want to get access to like the best Postgres on your infrastructure or the best MySQL or the Grafana stack or Parker or you want to build an identity stack with ORI and with OpenFGA and products like that, that's kind of the effort that I'm leading. The orchestrator is called Juju, it's been around a really long time, Charm's all written in Python and we're kind of building out a big catalogue of operators that allow you to not just deploy those things but actually compose them all together and integrate them in a really common way irrespective of whether your infrastructure happens to be bare metal or Kubernetes or VMs or on EC2 or on Azure or some combination of the whole lot so that's kind of what I'm up to at the moment. Awesome, yeah and I'm looking forward to hearing more from you but before I do that, let's talk about profiling again or like what profiling data is made up of and you can see these like points in time just T1, T2, T3, at some point in time, we basically want to look at the current stack trace or what the program state looks like and we can see that like at T1 we had ABCD, at T2 we had ABCNE so slightly different and then at T3 we had the same thing again so kind of like just for the sake of the demo or the example, one was like executed twice so maybe it was like executed 20 milliseconds in total and the other one 10 milliseconds so we kind of like count how often we see these stacks and then kind of can make assumption on how much it is running and this is kind of like a sampled profiling profiler, it basically only like every so often looks at these stack traces but over time we can really nicely like see the big picture of things happening. The good thing is because it's only happening so often the overhead is pretty low which again I touched on earlier for our use case figuring out what's going on, it's pretty nice due to being pretty low overhead. So how do we get to these stack traces, how can we see these stacks that we then get all the memory addresses for and then we can like nicely format them using the function names for example in the icicle graphs. So the best case and that's kind of the whole point of the talk right are frame pointers and frame pointers looking at this bit of C code it's hopefully not too daunting in a monitoring observability room but we can see we have the main function at the bottom and that calls a function and so on the functions call each other and then at the very top it just goes into an endless loop. And kind of the important part in all of this is looking at the assembly on the right hand side we can see that okay I omitted like the main function and the a1 but then we can see b1 and we can see that at the very beginning we are pushing and moving some registers around and those are actually the instructions to push the frame pointer onto the stack and then we are calling the next function right and the pushing of the registers so that we know once the next function is done executing we can come back to exactly that previous function and continue executing. The one thing I want to mention here is in the past there were a couple of discussions about the overhead of using frame pointer so we have the push and move instructions and then once the function is done it needs to pop that frame pointer so there were a couple of extra assembly steps involved especially on 30 bit systems it wasn't great performance wise but I think unless you are a really really special case it should be fine for almost all workloads even in production and that's kind of the point of this so basically our binary on the left hand side we can see our set up frame pointer so that's kind of the first instruction that our assembly is executing it is putting the frame pointer onto the stack before then going and doing the actual call to the next function right and before doing that we have to add the return address to our stack so that once the function that we are calling is done we know where to continue in our current function right so we need to know where like this other code we need to execute after calling the function we are calling right now where we need to continue so that's why we have the return address and we then actually do the function preamble and we run that function and eventually we return the function we are at the pointer that then actually tells us where to go back to right so the function that we called eventually returns and we want to go back to the original function however we are then executing after that function call right so previously we were can you see my mouse no we were over here and now we returned like one step and after that right because we don't want to call that function again going into an endless loop we want to continue afterwards however we want to know what called us right so basically what we want to do is whenever we have a stack we want to know which function called us and do that all the way such that we eventually end up in the main function and we know all the functions that we see that we have on the stack up to the point where we are now basically that's kind of like working the stack here and the really really cool thing is we can do this in ebpf so I don't know how many attend the previous talk ebpf kind of a hot topic right now for us it's really really cool because what we can do is write a small program in a C dialect and then get that through the verifier and compile it into ebpf code and then load that into the Linux corner and the way it works is then we actually don't use syscalls like the slide originally says but what we then do is like tell the Linux corner to every so often run this snippet of ebpf code and what we do is do the same things like stack unwinding that you are stack walking that I told you about like two slides ago so essentially what we do is we start or we start in ebpf we get kind of the context we get the current stack pointer and we look at the leaf of the stack so like kind of the very top like the currently executed function and we can then use that to essentially read that instruction pointer and from there get the frame pointer and the special occasion here is the instruction pointer has to be the return address minus one because of the thing I just told you about two slides ago right so basically that's how we can then know where we were called from and we do that all the time up until at the end we do that we get an instruction pointer or that zero so this one then means basically we reach the end of the stack and we know we can terminate or we reach the end of that stack trace. In between for profiling you can see over here we do something with the stack with the frame and what we actually do is we kind of like just get the memory address of that executed function and we basically have an array of all the frames that were executed at the end and have the memory addresses and those memory addresses we can then use to get the function names for that function. So having frame pointers in ebpf makes regular profiling super easy and we can then do profiling super simple we don't have to worry about like special compiler configurations because we can just assume that frame pointers are here for us to then basically use them to figure out the entire stack of the currently executing function. There are ways to do exactly that without frame pointers and shout out I think it was in this very room one year ago there was a talk by Javier and by Charlie who were talking about stack unwinding without frame pointers using Dwarf I highly recommend it it's really really interesting but yeah something for another time and then obviously not only like the profiling use case but if we have frame pointers in the executables in those executing stacks we can also use all the other debugging tools right not only for profiling we can use the bcc tools bpf trace perf etc and they also have the kind of same benefits. So essentially what that means is that the possibilities really become a lot more broad and open or like we can do a lot more things because we only have these like two memory reads and for example in bpf trace we can use the like one liner here to essentially also build a really simple but working profiler that uses the use stack to get the user space stack unwinding and count how often it sees things and that's super cheap then but also like the go execution trace actually traces everything that's happening and because unwinding is so has so little overhead we can also do things like that and once we have profiles continue and kind of like the performance aspect we can do something called profile guided optimizations and just making profiling so cheap that's something where I think a lot of innovation is also going to happen in the future and some outlook like some super new papers the context sensitive sample based profile guided optimization so something we are super excited about because yeah it will allow a lot more things to happen as well but maybe another Boston talk is going to happen about that in a year or two so bringing frame pointer to the masses I'm super excited to have John talk. Hey all right so I'm here to tell you about now we've seen all of the cool stuff you can do when you have frame pointers how we at Canonical are going to make this available to all of you much more easily and so if you didn't see this on an outside our blog a couple of months ago we have decided that from 2404 LTS we're going to enable frame pointers for the entire Ubuntu archive on 64 bit platforms. The caveat on 64 bit is because back in the day 32 bit CPUs obviously had far fewer registers and so sacrificing a register to hold the frame pointer came with a much higher performance overhead in reality these days with 64 bit you're looking at on average kind of less than 1% unless you're in a very specific group so if you're doing like turbo pants on head HPC stuff or high frequency trading or real time things where kind of that like 1% could really really matter perhaps this isn't for you and we can make exceptions in the archive for those packages but in general for 2404 you can expect to see frame pointers for the entire archive through main and universe etc. This is pretty exciting because the LTS I probably need to tell you is going to be installed on many many millions of machines right and then supported for at least 10 years by Canonical so this is going to make a big impact for people who need these things. This stuff is often already enabled by the hyperscalers so people like Amazon, people like Netflix, people like Microsoft are already doing this in production and now you kind of get it for free as well just by using Ubuntu. So I mentioned there will be some you know pretty much negligible barely noticeable for nearly all use case performance impact we're kind of willing to wear that because what it actually enables in the medium term is for us to do a lot of work on our distribution right so we're in the process now of running benchmarks on a kind of pre frame pointer Ubuntu and a post frame pointer Ubuntu ready for the release and that will hopefully help as I identify any outliers so if we hit certain packages where we feel like the performance hit is too much then we will disable it for the first release for 24.04 or we will try and work out what other optimizations we might make to that package to make it work better with the frame pointers enabled. So this will really really help I think downstreams to enable or to gain the benefit of frame pointers and optimize their own workloads. If you are someone who just uses Ubuntu as a platform and you build your own code and let's say you use Python or you use go or use no JS or whatever suddenly those big holes in your frame graph graphs are just going to disappear when you move to 24.04 without you having to do anything. This is really just the start which when I make 24.04 a really focused release on kind of performance engineering and performance itself so what does that actually mean having the frame pointers is one thing but you also need the tooling to actually utilize the frame pointers and kind of inspect the stack and the folks at PoloSignals with Parker are one part of that but we are also looking to include tools like BPF Trace and SysStat and the Perf Tools on Stable by default in Ubuntu. Not in every single image so those of you that are about to screen map me because you use the minimal image or you ship 100,000 container images a month and you don't want to ship BPF Trace and all of them don't panic we are essentially going to enable all of these tools by default anywhere where we ship a kernel. So a Ubuntu server image, a full size server image that doesn't include lexd images, it doesn't include OCI images but if you install Ubuntu on a server or in a VM you will have BPF Trace by default, you will have SysStat by default. Essentially a huge majority of the tools that Brendan Gregg describes as crisis tools will be there by default and the reason that is super important is because if your system is in crisis it doesn't matter whether the tools are in the archive. If your system is right on the edge and then you hit the system with a whole bunch of network IO and disk IO to go and get a package from the archives it is potentially going to put that system over the edge. It may not even work in production, the system may not have access to the package archives and so you just need those tools to be there and we are going to make sure that happens. For places where we don't ship a kernel all of these tools will get wrapped up in a new meta package so if you do want it in your lexd containers, if you do want it in your container images, in your debug images then you will just be able to see it really really easily with a single meta package. We are looking at what other compiler optimizations we can make across the archive as well so this might look like rolling out GCC03 for a huge part of the archive, we are not going to do that in one big bang go because there are some trade offs there and we are also looking at essentially not maintaining a low latency kernel and a generic kernel and just shipping the low latency package by default. None of these are firm, 100% definitely going to happen in 24.04, these are the goals we are working towards before the release in April. Finally, some of you may have seen we have been doing some work on working out how to get Ubuntu and the archive to take advantage of the newer instruction sets, AMD64 v3, AMD64 v4, v5. We actually have a build of the entire archive that uses AMD64 v3, you can get it in a PPA and test it in benchmark, it is faster like TLDR but we need to do a bunch of upstream working apps to work out how we can essentially kind of multiplex that right so that you still just go ubuntu.com slash download, download an AMD64 ISO and it does the right thing without you having a massive long list of different instruction sets to choose from for AMD64 so that work is coming but probably won't land for 24.04. We also continue to introduce new patches into things like GNOME, we are still trying to get the GNOME triple buffering stuff landed ready for 24.04 which gives a much smoother experience on the desktop as well. This runs really from Ubuntu server right up through to Ubuntu desktop and these tools will be available to desktop users too. You as a developer on Ubuntu should have access to the same debugging tools that you find in your production workloads in our opinion. On a side note, we are trying to do this at a really big scale at Canonical, we are hiring practice leads that will sit in a central team to build processes and tools and essentially give advice across our 40 or so products and we are also hiring dedicated performance engineers for every single team whether that team be doing Go, Python, NodeJSC or whatever. If you are interested in that talk to me afterwards, check out Canonical.com slash careers, there is a couple of Canonical folks in here as well who you can talk to. If performance is your thing and you want to come and make use of frame pointers and make Ubuntu blazing fast then that is always an option to you. Finally, from my side, we have done a bit of work with Polar Signals, they have been helping us along this way. We have snap packages and charms available for Parker both for the agent and the server. On any Ubuntu machine you can see this in a cloud in it file with a single line. You can snap and store the Parker agent, give it a single config with a token and start continuous profiling out into Polar Signals cloud or you can host this over infrastructure yourself on machines, on Kubernetes, on containers, whatever it is with Juju. We will continue to make improvements to that over time. It is a super easy way to get hold of this nice continuous profiling hotness in Ubuntu. That is it, get in touch. Thank you very much for that. Looking forward to the Ubuntu release. Are there any questions? Questions anyone? Once, twice, nobody? Okay, then thanks again and next up we have QuickWit I think in 20 minutes. Thank you, bye. Cheers.
iputils project introduction
Whether you have used pink or trace road or trace path, some of those implementations, I just wonder, does anybody use arping? Okay, you are network administrators, I guess. And clock diff, has anyone used recently clock diff? No, that's a nice question. Thank you. IPv2 is a very old project. It was started by Alex Seikyuznetsov in 1999. He was a Linux-Cannell network upstream developer. He was also IPv2 upstream developer at the time. He ported some BSD sources from Linux and he wrote some other tools for IPv2. And he maintained the project till 2002. He also used net-death-linux-cannell mailing list. Hideaki Yoshifuji was the next maintainer. He was also Linux-Cannell network developer. He was doing IPv6 at the time. Hideaki improved the project a lot. He started to use Git, so we have some history now. He moved the project to source for Chnet, which was popular at the time. And he still continued to use net-death-mailing list. He introduced use-illipsee support, so it was not just for G-Lipsee. Although he made his last release in 2015, the last widely-adopted release was probably the previous one from 2012. Because IPv2's development slowed down, David Heidelberg forked IPv2 and moved development on GitHub in 2014. The initial goal was to upstream various patches from Linux distributions. Still at that time, I did also muscle-lipsee support and other things, because the tools were very old. License cleanup was done, which people from Linux distributions approved or were happy about that. There were other people at the time, for example, Janssen Aček and Pavel Šimetdá. They were both from Redezhet. Pavel improved a lot, modernized the code. He started to use the new C-functions, get other info instead of the old ones, which were for IPv4 or for IPv6. And there were other improvements. Semi-Carola was the next maintainer, starting in 2017. He modernized the code a lot. And he also introduced Messonbelt system. There were other people at the time, Noach, Myron Hans and Yuri Hornovian, who still maintains localization. There could be another question, who needs localization for tools like Pink? Really? I guess not really many people, but I got approached that people really like localization. So I kept that. I came in 2017 and actually there were obviously many people in Git history. There are nearly 140 contributors. And there was history before. So current tools. IPv2 tools have currently Pink, Arping, Tracepath and Clogdiff. Pink sends ICMP a correct Vest to network host. It's very old-called from 1983. I think it's the most important IPv2 tool. And it supports both Sockets, raw socket and ICMP datagram socket, which is more secure. Unfortunately, not all distros use that. I see some of the people from the Bien. So I would recommend to stop using raw socket. But the reason why it's used is system D, which is not used on all systems. You know, the Bien supports other, other in its system. So that is reason why Pink wouldn't work by default. Yeah. Below we have example, pingingsusa.com. That's very basic example. I'm sorry. Pink supports obviously a lot of functionality. So there are loads of switches. So just a simple example. Arping. It sends ARP requests to network host. It was written by Alexei Kuznetsov. And it supports IPv4 only because the protocol itself is for IPv4 only. So, again, basic example. Trace path. It traces path to network host discovering MTU. Again, it was written by Alexei Kuznetsov. There's a small example. Tracing path to suce.com. And clock diff. That's again very old quote from 1985 from unknown author supports IPv4 only. We removed some obsolete tools in 2021. Those tools were using some experimental protocol which were not relevant later. Or there were much better implementation of other tools. So there was no point to maintain something which is not really well used or it's kind of buggy. Because those tools we have in IPv2 are basic network tools. You know, written long time ago. There are obviously other projects which are implementing similar tools. So just to highlight some of them. F-Ping is very enhanced ping. It's written in modern C. It allows to ping any number of targets. Its output is designed to be parsed. So it's good for using in scripts. Also it doesn't perform reverse DNS lookup by default. Which is in some cases faster. MTR, my trace road. It's a tool which combines trace road and ping. It uses QE and N-Curses. And it's also for free BSD. Very nice tool. Those two projects are collection of tools. So busybox is for low power embedded devices. It has many tools and among them are ping, ping and trace road. It's somehow compatible with tools from IPv2 but it implements just part of the functionality. Inetutils is old GNU project which also has RHS and stuff like that. So very old project. Not that active nowadays and it has also ping and trace road. So future, IPv2's future, what we should do. We should rewrite the code to the modern C. We concentrate mainly on ping so other tools are neglected. I wonder whether we should keep clock diff. Also trace path, it's questionable because my trace road is much better. There is trace road, the original project which is also better than trace path. So it's a question whether to keep this. Project would need reviewers and real network developers. We should write tests because we have CI tests but we don't have functional tests. So sometimes regression slips in. Tools could have JSON output and color output. So that's it. Do you have any question? Sorry, I didn't quite understand how system D or lack of it can force to use row circuits. There is a sys control tool which handles kernel parameters for networking. ICMP data gram socket is by default allowed just for root. So if you want to have ping just for normal users and you want to use safer ICMP data gram socket you need to set something. And this row says that with ETC, CCTL config or somehow is called that file. And this works differently for system D and for other tools. So if you want to use busy boxes in its system then you would lose this configuration. I would say mainly there should be a solution just not to block this and there is the band bug report. But no one works about that. Any other questions? Hello. So I have one question. What is the future of the IP tools? So what's the next feature or roadmap that you are actually getting on? What's the future? Or like five years or ten years? So those tools are very old. So one would say the work has been done. But the problem is there are bugs, there are improvements which can you know, broad regression. My motivation to join the development was to keep ping working because I need that for Kana network testing. So I would say there is no future otherwise someone finds interesting to rewrite the tool as an exercise to rewrite them into more than C because the code is terrible. It's 40 years old or something. So no real future but I think JSON output would be a good feature and color output would be also good. So some of those. But mainly maintenance mode.
ZeekJS: JavaScript support in Zeek
Hello. If you can hear me well. Thanks. My name is Arne. I work for a company called Corelight. I work on the seek project. Quick information, who of you is using seek? Anyone? Three, maybe. I want to talk about JavaScript support in seek. But first, if you, well, there are not many people that have maybe heard of seek. It's a passive network security monitor. It's existence, well, a long time, 95, was development started. It's open source and BST. It was called Bro until 2018. Bro isn't really a name that you should use for a project anymore. So it was changed. And if you look at it from a high level, you sort of feed it packets at the bottom, either from a live network traffic, like live interface or from a PCAP file. And what you get out at the top is sort of a set of logs that describes what kind of activity is in your network traffic. If you look under the hood, there are a few more details. So it's an event-driven system. It has a custom scripting language. We have some, we call it broker. It's a messaging interface to talk between separate processes. Yeah. To give you a flavor of the logs that were at the top, sort of, those are single entries for single connections. So on the right-hand side, there's the con log, which is the most central log. And, well, there's the identifying five-tuple. We also support IPv6, but that's an IPv4 example. The service field indicates what kind of protocols Seek was able to discover within that connection. And then the bottom is sort of statistical information, like packet counts and duration. On the left-hand side, you see more protocol-specific log, in this case the quick log, which has been recently added. And for example, there's the, so you can see the server name in the client protocol. And if Seek is able to decrypt the initial packet of a quick connection, it forwards the crypto payload to the TLS analyzer, which can then extract that kind of information, and we put it in a log field as you see. That is sort of the data that you would push to elastic search or Splunk and then do your analysis there. That's sort of not Seek jobs, we just produce logs. Okay. It's a fairly old system. It has a custom scripting language, and it looks sort of, that's just a sketch. It's not actually going to work like this, but it sketches how the quick log entries created. So there are two event handlers, one for initial packets, so whenever there's an initial packet, that event is raised, and we create a info record, which represents the quick log entry in the end. And then there's another event that is the SSL extension server name that is raised whenever there's an SNI extension in the client Hello. And you can handle it and basically enrich that quick log entry with the server name or with the first name. That's just a heuristic here. The bottom is a sort of log write call where we actually then produce that JSON entry. So yeah, but it might look a bit unusual in the beginning. It's a fairly powerful language that has some network domain specific features that also allow you to write detections with Seek and sort of build advanced analysis also within that scripting language. What's not so great is sort of interaction with the outside world that log write, for example, is the thin layer above the whole C++ logging framework. So that is not implemented in Seek script, but then you have to do that in C++. And usually any extension that you want to do, you have to resort to writing a plugin in C++. Yeah, we do have so if you don't go to C++ route, we do have support for asynchronous HTTP requests. And if you look a bit under the hood, then the thing is spawning is read and it's launching for writing stuff into temp directory and into a file still and then it reads them and gives them back to the script. So it's a really scary implementation of an HTTP request. So the idea was to, well, why don't we use a language that maybe does provide all that stuff and sort of has a rich ecosystem and has is well known as well. And particularly the Node.js, because of the libraries and the NPM system, so that there was sort of the idea. And as a twist, we are doing this as a plugin and not by patching Seek source code base. We just want to build something external to add support to Seek to also use JavaScript. So quickly about plugins. They're basically shared libraries that seek loads at startup and within that plugin you can access Seek's C++ API or also hook into certain execution paths. For example, whenever a new connection is, so new connection state is created, you can implement the hook set up analyzer tree and attach something to that connection usually analyzers, a protocol analyzer we would say. They also really made components where basically implemented against an interface. There's no component for a whole scripting language, so we sort of resort to the first two to implement the JavaScript support. Okay. So that top hopefully doesn't look too unfamiliar if you have some JavaScript. There's an event object on the left that is called Seek, sort of a global object. There's a well known on function where you register that an additional function for a certain event in M. So that that looks more usual problem in the Seek script example. And as an addition, there's the there's the HDT module from our HDPS module from Node and there's also an example how you could put how you could post the connection you had the end those SSR server names mentioned before to an HDP endpoint just from within Seek script. So we want to get there. And the first step is to, as you prevent Seek from interpreting .js files as Seek script, which it would do with default. And you can implement hook load file and basically check if the if the file name that Seek is attempting to load is ending with .js and return one basically says well don't bother about it I'm taking over and we are stashing away those JavaScript files. And that works for files in the command line or also those with directives loaded. So the add load directive. Step two is sort of to initialize the whole JavaScript engine, sort of the V8 engine and the Node.js environment. There's documentation about that. There's a link here. This is sort of a sketch. It's a bit complicated but I have good documentation about it. What is happening at that point is also that we are loading the JavaScript files and so the top level Seek on calls are actually executed. So we need to provide this Seek on call already. So I'll say this is just step three. I need to slow down a bit. Just for myself. So step three is the call to Seek on is basically getting an event handler name and listener function. And with that event handler name we can use C++ APIs to look at the event handler object which is a Seek specific object representing that, well, belonging to that event name. From that we can get a script function which usually has a list of bodies and each of the bodies contains a statement list and then there are further statements. So usually the script execution is interpreted. So it just runs down all those statements and executes them. What the plugin can do is add another body into that list of bodies and provide the custom statement subclass which when executed really just calls into JavaScript and executes a V8 function. So when this first happened it was really exciting. You see a hello printer from Seek and a hello printer from console. It was nice to get done. What was not so nice is that you need to map types between those two languages. So there's different types on the Seek side and JavaScript has other types. For example the address or subnet type on the Seek side we currently just mapped to strings in readable form. It's not the most performant but it was nice to have Jason stringify and have IP addresses like that. I'm not going to talk much more about this. The last step was to integrate both of the IO loops. Seek has its own IO loop that is KQ based and Node.js has also an IO loop which is libUV based. Usually the Seek IO loop is blocking on an event call waiting for a packet to be served or a block of packets or a timer has expired or something else happening and an act on it. What the plugin can do is register something called an IO source and in the case of libUV the plugin takes the backend file descriptor of the libUV IO loop and installs it into the Seek IO loop which means that whenever something has to be done on the Node.js side like a client is connecting on a listening socket then the backend file descriptor of the libUV loop becomes ready and the Seek IO loop is waking up. Recognizing this is Node.js file descriptor that became ready. I need to transfer control over to that loop and the plugin runs the Node.js loop non-blocking until there's nothing left to be done and control is then transferred over back to Seek. Yeah, that was the most tricky part of the whole plugin. I didn't talk much about the picture before, the architecture, but where I would position that is sort of, it's not completely technical to correct, but if we have extended the event engine a bit with Node.js event engine down there and then also the Seek script language, so we have extended everything with being able to also use JavaScript instead of the Seek script language. As a summary, I find it really impressive that we could do that without actually patching Seek. Everything was in place to pull this off which is testament to how Seek was built over the years really. We're not going to replace the Seek scripts that are existing with JavaScript, that is not sort of the plan. The integrations you wanted to build or maybe just wanted to have proof of concepts of things that you previously needed to quickly use C++ and find some C++ library to do whatever. You can now tap into NPR ecosystem or JavaScript and try it with that. That plugin is sort of coming with Seek 6.0 by default, so if you have LIT node installed and you compile Seek it will just be supported really. And our container images also have it built in by default as well. Any questions about that? Any questions? Hi, Armin. Have you evaluated the performance of this? Does it impact performance a lot? I would say it runs slower than just Seek and interpreted scripting, mostly because we need to translate between those two types. I would also currently position it to not necessarily run JavaScript in the packet path unless you are really adventurous. We have also Seek processes like the proxy and the manager that don't do packet processing. They have a lot more cycles there. If you run JavaScript there and do sort of pulling in IOC information, that's one use case, that you can do in a node that is not in the packet path. We would be interested in performance numbers. Thanks. Have you explored other languages as well, apart from JavaScript? Not explored, I sort of have in my mind as a proof of concept Python, but JavaScript was sort of asynchronous, it's non-blocking. That's a paradigm there and that's what we needed as a replacement for Seek script. Thanks. Any more questions? Thank you very much.
Multi-network in Kubernetes: No batteries included
Check one two, check one two, all right. Thank you everybody for coming to our talk about multi-network in Kubernetes and how there are no batteries included. My name is Doug Smith. I'm a maintainer of Multis CNI, which is a multi-networking plugin and also part of a working group related to it and I'm joined by Miguel. I'm Miguel Duarte. I'm a software engineer working for Red Hat, particularly in the OpenShift Virtualization Networking Team. I'm also a member of the Network Blumbers workgroup and yeah, sometimes work with Doug on this kind of stuff. Awesome. So, we've got to rip through this pretty rapidly and it's a pretty complex problem space, but we're going to run you through it as quick as we can. So we're going to look at what exactly multi-networking is in Kubernetes and kind of show you what the problem is that we're looking at. There's also kind of like current set of solutions and then also future solutions that we're looking at as well. And even if you're not necessarily interested in the multi-networking problems in Kubernetes, we kind of hope that you're going to be interested in sort of the problems that we've identified that we think are really common to a lot of engineering problems in general and especially for open source communities. We also have a demo for you to watch at home because we have some short time. So the first question we should be asking is exactly what is multi-networking in Kubernetes. So the thing is it kind of isn't because it's not something that Kubernetes actually is interested in solving. What do I mean by this? So the Kubernetes networking model pretty much says that any pod on the cluster can reach any other pod in the system. Cool. How does it do it? Like one interface on each pod connected to the same network. One interface. If you need more, well, it's outside of Kubernetes. The community pitched in together and implemented that out of three. But well, first, why do you want multiple networks in Kubernetes? For instance, like network isolation, let's say you want to meet the compliance requirements of like you need to separate traffic not only in software but only physically in the network. This kind of thing happens every day. And for that you need multiple interfaces. Or for instance, you want to implement like a firewall or something. Well, you'll need at least two interfaces. So this is a reality. There's a need for it and Kubernetes does not do it on its own. So the problem, you don't have batteries for this. You can do it. The community has provided ways for this to happen but it's out of three. And you need to deploy a bunch of stuff for this. So you need to deploy like controllers on the nodes. You need to add more and more pieces. So it's solvable but it's not entry. It's not native to Kubernetes. Furthermore, while it works, its user experience is challenging to say the least. Like it's cumbersome to use. It feels clumsy. There are a lot of ways for you to get it wrong. Like if you just put an attribute that does not exist or put a typo on it. Well, it depends on the implementation what will happen. And at the end of the day, if you have something that is error prone, a lot of people are going to make errors in it. In one word, this is pretty much like Arcane Knowledge that needs to be used in it. So current solution for it is multi-CNI. So multi-CNI is a CNI multiplexer. So CNI is your container network interface. It's an API that allows you to specify how you are going to run plugins that talk to this API in order to plumb your networks, how you're going to connect your network interfaces into your pod to the rest of your network in Kubernetes. What MULTIS is designed to do is to multiplex CNI plugins. So you use custom resources, which are extensions of the Kubernetes API. They're not natively part of the API. They're a way that you extend it. And they give you a way to quote unquote kind of trick the platform. So you add MULTIS CNI into your network. You populate these custom resources with CNI configurations, but CNI configurations are JSON and Kubernetes resources are YAML from a user perspective. So you kind of mix both of those, and I'll give an example of that in a moment. But we also have an effort that's ongoing for a Kubernetes native multi-networking. So what this would do is take kind of this concept that we have out of tree and get these pieces natively into the Kubernetes API. So we would actually extend that. And probably as a building block, we may actually implement them as custom resources, which is a detail here. The one thing to keep in mind, though, is that this will be an extension of the API without an implementation for you natively. So it will actually still require an add on itself, which is also a bit of a challenge. But we really like the idea of extending the API. If you take a look here, you're going to see this is a Kubernetes PodSpec. You've probably seen them before, but we use an annotation. And those are freebies. Anyone can add an annotation. We have a specification for how it should look, and we see that specification. But it's got JSON in there. So if you're walking through this object in the API, you hit this. What do you have to do? You have to parse it, which is no fun, no fun in the least. Now with the Kubernetes native, you are going to just have it all YAML. So if you're writing a Kubernetes controller and you're using client go, you're just going to walk through this easy as pie. So it should be a lot easier. But we have to ask ourselves, what does the future look like? It's kind of a complicated scenario. Number one, we'll probably still have multi-CNI. It's out there. People use it. They're going to continue to use it. And then we might also have Kubernetes native multi-networking. So we've got these two things. But there's a bunch of other projects in the space that may be up and coming as well. So CNI 1.0 has been around for quite a while. And CNI plugins run on disk on the host. They're not actually containerized. CNI 2.0 would make a step forward to being able to containerize those and also would give us an opportunity to kind of update this API. We also have in the works the Kubernetes network interface, which is a proposal which would bring CNI and Kubernetes potentially closer. CNI itself is container orchestration agnostic. It doesn't actually relate specifically to Kubernetes. It was actually kind of invented in a parallel time to Kubernetes. So before Kubernetes won the container orchestration engine war, there were a bunch of different container orchestration engines. So it tried to fit the needs of all of them. But Kubernetes is the winner, and we kind of need a way to get a little bit closer. So let's figure out the lessons that we have learned. The first one is sometimes you have political problems. We want to extend the pod network and the pod object to get these items here natively in there. And maybe it's not so much a political problem as it is like a people and intentions kind of problem, like what are we trying to solve here? And not everyone sees this exactly the way. And this is a very core part of Kubernetes. If you've ever used Kubernetes, you've definitely spun up a pod, you've definitely touched a pod object before, or an abstraction of it like a deployment or whatnot. So to extend this, to add this network is hotly contested. And there's more than that. Like first, as Doug said before, things is that APIs are forever. Not in a sense that you have to maintain stuff backwards compatible, but pretty much that Maltes exists, solves the problem. And if you want to have multiple interfaces in a pod, Maltes is already doing that. Are you actually going to update all the manifests of your deployments to comply with this new API to stuff that is running in production? Well, maybe not. So there's this to kind of have in mind. Next scope creep. Like everybody wants to solve a different problem. And it's very hard to get quite focused on the, let's say, the least common denominator of the problem space. I mean, just doing that has been extremely challenging in these last six, seven, eight months, year. I don't know, like I lost track of it. And finally, handling a technological problem is a lot simpler than dealing with people and opinions. It's very, very, very easy to clash on those. Like it's hard to get to choose like a restaurant between your four friends to go out tonight. So it's even harder to get you to like agree on what the API would look like on something so critical and so, let's say, central and paramount as the pod spec, for instance. And here's like the, we would really like you to take a look at this demo. But again, like, yeah, better do that at home. Like just hit or scan this or hit the link there. And it's short, couple of minutes. And you'll see how the native current effort for native multi-networking looks like from the user's perspective. And yeah, that's pretty much it. So any questions you have? Well, fire away. Any questions? Can these additional interfaces be used to connect devices that are outside of the data center via VPN, for example, which is a problem I've been trying to deal with and couldn't find manageable solutions? Yeah, thank you. Okay. If these interfaces can be used to connect the pod, for example, to via VPN to external networks. Oh, yeah, absolutely. So this, something that you can do is to use these to connect to existing resources that are already in your network. So I guess a VPN absolutely could be an example. And so oftentimes the reason that people use these additional networks is they have existing infrastructure and they go to deploy Kubernetes to become more of a like a cloud native approach. But they have legacy systems that they need to integrate with. So if you've got existing networks, this would be a reason to do that kind of as a sidecar that you could go out to a network like that. Great question. Can you talk a bit more about KNI and how it relates to the multi-network problem? So that is an excellent question. So I would say that one thing that we're trying to solve is that the way that, for example, multis CNI works is somewhat inefficient. You have this flow to create a pod and in that flow is a call to CNI, which then is called today you would usually use multis to do that. When multis is called, it stops that creation of the pod and goes and queries the API, the Kubernetes API itself. And KNI may be one opportunity to make this flow linear in order to pass information directly to some type of multi-network solution that already has the information from the API, for example, instead of having to interrupt that flow to call the API. So that's one possibility. Another possibility is that, at least from my perspective, as Miguel said, APIs are forever. So multis, I'm a maintainer of it. Certainly it'll be around for a while, but as we may get the Kubernetes native multi-networking, it may also be a layer to do compatibility as well between kind of the new way of thinking and the old way of thinking. Is that helpful? Nice to see you, Gargay. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you.
Declarative Networking in Declarative World
So welcome to the next one in this track. My name is Mateusz. I will be talking about declarative networking now. Yes, that's very good. Yeah. Yes, we spent quite some time talking already about Kubernetes, how networking is done. I'm very glad that people from MULTUS took the hard part of explaining, you know, multi-networking at the level of containers. I'm also glad they didn't say anything about host networking because this is what they don't do, this is what I do. So we are going like very smoothly lowering the stack. So I work at Red Hat like they did. I'm best in Switzerland when I'm not touching computers. I do farming. I actually like it much more, but it doesn't pay my bills so well. Here we are. I don't do AI. That's something, you know. Everyone does it, but no. So I will skip why multi-networking because Federico was talking about this and, you know, if there are reasons for you to do multi-networking, you know that you need to do it. And if you don't, then you don't. It all started because, you know, clouds never care about multi-networking. You go to AWS, GCP, FB, ICIA, you pick your three letters. You get a VM. It has a single network interface and that's it. But at some point you realize you need more network, bandwidth and all this kind of stuff and you're just going to start doing bare metal. It won't fly anywhere else. And once you start doing bare metal and network configuration, probably more than once you've seen, you know, the very ugly network manager config file. It's just a static file and, you know, the syntax is somehow opinionated. It's okay once you learn it, but it's still a static file and it flies if you have one server. It flies if you have three servers, but does it still fly if you have 300 servers? I'm not sure. And one problem is that, you know, those are all files and they don't apply changes automatically. So you modify your file and until you restart the interface or the machine, you may not even notice that you've made a mistake. So you may have some configuration that flies for, you know, last five years, but in reality it shouldn't and the reason is just because you've never rebooted. So, yeah. There was another talk about this before, yeah, you shouldn't have your servers run for two years at a time, but, you know, that's another story. So what is done to somehow change this? So you don't need to modify this file manually. Network Manager gives you command, which is NMCLI and you can modify those configurations using somehow nicer syntax. You can say, you know, modify connection, IP address, yada, yada and it has slightly better error handling because you can see in this example, I never distinguish slash and backslash. Sometimes I will just write, you know, I will write slash 24, but it's not really slash, it's backslash and I will see an error, you know, invalid IP address. That's super easy. But then I fix that, well, I think I fix, but I'm putting IP gateway, which is not in the subnet of my IP address. It cannot fly, like this configuration is utterly wrong, but syntax wise it's perfectly fine and system will allow me to do this. So, you know, is it really the state that we want to have? Well, we could discuss. So we have some basic protection about some basic bugs, but yeah, we could do better. So we got this tool, which is NM State CTL, so we still live in the realm of, you know, Network Manager, but we want to try to be a bit more declarative now. We want to change this syntax so that, you know, at the end we do this for Kubernetes and Kubernetes got this very nice notion of APIs, everything is well defined, everything is declarative. So let's try making cost networking declarative also. So how about we create some API which would look almost like a Kubernetes CR and allow changing this. So let's fix a YAM in which you define your state and I think this is the biggest improvement over the previous file, that you define how you want your network to be configured and afterwards let's event a tool which will be taking care that this configuration really works. So I don't want to dig into details of, you know, this example here because it shows some basic configuration. IP address, IP, IP routes, DNS server, so in general something that you always need, it does that I claim that this syntax here is much nicer than the syntax of this file. We can argue afterwards that I still claim it's nicer and, you know, at this moment there are no containers in the game. We are talking about vanilla Linux, you can do it and you may not know about containers. But now how about we wrap it around API and kind and let's take it to Kubernetes. So let's make CRD out of this and use everything that we built in the previous three minutes to have something that is declarative and something that Kubernetes will be reconciling. So in this scenario and I think that's pretty descriptive use case, you know, you have multiple network interfaces, you want to bond two of them and doing this using all the static network manager yada yada, it's ugly. So how about you just write that you want to create a bond and let something else make it happen and let something else make sure that this bond is all the time there, that no matter what you do, you start SSHing to your Kubernetes nodes and all this kind of yada yada, let this be the safeguard that once you define this configuration is there. When you define a pod, you go, you may delete the pod, but if you have, you know, deployment, demon set, all this kind of stuff, something is there to recreate this pod. Why cannot we have something similar about the networking? Well, we can, so let me do the very short demo on that. So what I have now, I created and we will go through the examples in a moment. So first of all, this is something I didn't mention, but you know, Kubernetes CRs and Kubernetes API in general tries to protect you from doing stuff that won't fly. And you know, you have very extensive validations at the level of Kubernetes API and it's super amazing. I also would like to have something like this here. For example, I will try to configure on one of my worker of the Kubernetes cluster some stupid DNS server that simply doesn't exist. For people not familiar with IPv6, I defined here link local IPv6 address. So there is a very, very low probability that something like this actually exists in your network. And on the other hand, I have this worker and let's just look at the DNS configuration. So I'm running on my default setup, it's there, and I will try to now apply this configuration, which I told you is wrong and you should trust me that there is no DNS server behind this address. Okay, so we created this CR and okay, now it's not what I said because we see in the ETC results, which we watch all the time, that this address appeared here. But this is only because we are doing some kind of reconciliation of that and I have a time of 20 seconds on this controller now. So you can see that this CR is in the state progressing, yeah, 20 seconds already passed. So failed to configure and my configuration is reverted. I won't go into logs of what happened, but you need to trust me this server doesn't exist so it makes no sense for my real server in the cluster to get this configuration. So that's it. So we think you revert that and you get the feedback that, sorry, we cannot apply this configuration because it's nonsense. Apart from this, what I can also do is I have another file in which I will simply take some additional network interface and I will add IP address. Very simple, we do this very often when we are provisioning servers, but you know, maybe you just got some additional network interfaces installed or whatever. Doesn't really matter. At some point you want to configure the next network interface. So this server, we don't need it anymore. And so the output is big, but you want to look into this part. 3S0, we don't have IPv4 address, we only have the IPv6 because you always get this one. So I'm going to apply this configuration now. Now that should not be a magic. This address appeared, but that's boring. You apply some configuration and it's there. But what I will do right now is I will manually on this server delete this IP address and I will make my controller, which is behind every CRD in Kubernetes, to reapply this so that this IP address is back there because if I define something via CRD, I don't want some admin going around my servers and change this configuration. If we agree that we are configuring cost networking via Kubernetes API, let it be like this. So I'm deleting this. We don't have that. So we have the previous state. Now I will do some small hack for the purpose of this demo because I realize that the timeout is set to five minutes. So we'll need now to sit five minutes and let the controller to realize that something changed. I will just kick the controller and... So we were in the worker two, which is this one, so I will just kill it. But not the only thing I did, I deleted this pod. So it's not like I somehow magically apply this configuration again. And we see that the IP address is back. So again, if we just sit here and wait for five minutes drinking or whatever, this would be the same. So that's it. And also for the sake of completeness, I have a setup with the proper DNS server. So, well, I already applied this one, so there is no point in doing this. But you've seen the wrong one, so you have to trust me that the good one would be also configured there. And the slide back is here. So that concludes the demo. So some final remarks because, yeah, that was really lightning talk. So all this stuff that I showed you, the backend, which is nm state, this is written in Rust because why not because we can. It uses network manager as a backend, well, pretty obvious, but we could discuss, and this is something that should come afterwards. That today I have this, it works using network manager because this is what I do and this is what customers that I have behind want. But if there is someone without network manager with a strong reason not to do network manager, but would like to have this, we can discuss and I would be very happy to discuss. Of course, there is a Kubernetes operator behind this because this is what I just demoed. And you are not bound to using this as a CLI tool and this kind of stuff. There is usable API, so you can get Rust create for that. You can get Golang library, you can get Python, probably something else, but those are the most popular and this probably should, I assume, those three make everyone on this audience happy. And yeah, we have a moment for questions. If you want to talk more about this, you can find me on the Kubernetes Slack and yeah, that's it. My personal wish would be that, you know, Kubernetes, and we know it from previous talk, previous two talks, never really cared about managing cost networking. No one really wanted to take this into the realm of Kubernetes. Well, it's not like I wish that we got this API now into Kubernetes upstream, but I wish. So yeah. Maybe we have time for just one question. With networking, you can do the worst thing and pull up the network up to what you want. So what if, for example, you misconfigure the IP address of a node and the node is unreachable from the controller? All of it can be fixed. Yeah, so this is exactly what I showed with the example of DNS. I could have shown it with the example of IP address, but if you wanted to create a CR so that you configure, for example, IP address and the gateway and applying this configuration would make your node unreachable, then we will revert this configuration exactly like reverted DNS because that's the purpose of the operator, that it has a safety checks so that it applies configuration. It checks if all the connectivity works as before. In this case, we had DNS, so it applied new DNS. It was checking in the background. Can I still resolve names? After 20 seconds, it realized, no, I cannot. I'm reverting this change and the CR is marked as degraded. So exactly the same would be happened if you have IP address and you don't get connectivity there. All right, thank you. Great, thanks.
Remediating thousands of untracked security vulnerabilities in nixpkgs
Okay, up next we have Del Roth. He's going to be talking about improving security in Nix packages. Thank you. It's a microphone actually working. Yes, good. So I'm Del Roth. I've been working on Nix packages for a few years now. I've been involved in some security mediation events and recently working on Nix source infrastructure. So it all comes from a story. So let's start with a story. Sometime last year, this vulnerability dropped kind of silently. Chrome released an update saying, hey, we released an update saying, hey, we released an update. You should really update today because people have actually been exporting to the world and it's giving code execution. So the interesting thing is we actually patched that in Chrome really quickly, but also what Nix packages, Nix OS and some other distros started realizing that that's not actually a Chrome vulnerability. It was reported everywhere as a Chrome vulnerability at the time, but it was actually vulnerability like a dependency of Chrome, which is WebP, which is an image parsing library. So we patched WebP and not just Chrome and everything is solved. Everyone depends on WebP, so whenever the version is updating Nix packages, they'll pick up the update. It gets rebuilt. Everything is magical and we don't have to do anything else. So then it's viewed about a month of work to actually make this happen. So this is the tracking bugs that I actually linked at the bottom for trying to actually fix this vulnerability in Nix packages, not just in Chrome, not just in WebP itself, but in actually everything else that's in Nix packages. I've highlighted part of it here, which is that some applications bundle their own version of WebP. Each of these need to be updated separately by Nix packages' maintainers. And that's not just Chrome, that's including some other web browsers, for example, that was not including Firefox, but for example including Thunderbird, because the packaging was slightly different, etc. See Bigo for a guess of all the non-applications that need an update in their status. So this is Bigo. And yeah, so as you can see, this was about a month of work. I'm not going to go through the whole list, but there's a lot of stuff. This list is probably not complete, because as I'm going to get to, we lack tooling on this, we lack statistics, we lack data. And so this talk is trying to, given overview of this problem, try to bring awareness to it and try to bring up a few solutions and how we could actually do things better. So why is this happening? Why did we have to fix so many things? So this is a phenomenon known as vendering. Vendering is when a piece of software decides that instead of depending on the library that they get through, like, you know, package config or through the general build environment providing to it, the software is just going to copy the source code of that library, put it somewhere in their own source directory, and then, you know, it's easier to build because people don't have to install dependencies. The problem with that is, since they've pinned the version of the dependency, whenever there's an update that needs to happen to it, then the update doesn't just need to happen to the dependency, it needs to happen to everyone who has pinned the version of the dependency by copying it into their source repository. To some extent, vendering also happens with rock files. Rock files aren't exactly the same thing because technically you're not copying the source code, you're just enforcing that your software can only build with one specific version of a library, and sometimes even if you're providing the hash of the source code or the binaries that you need to use to build the software, which means that in practice, you're not copying the source code, you're just copying the hash of the source code, you're still basically pinning it and making it impossible to do any kind of update. So, for this specifically web people ability, so we spent about a month on this, we did not, but it should fix everything. So that's a sad part of this, is that tens of people, I don't actually have a full list here, but tens of people spent probably hundreds of hours combined on this, trying to fix software that we have index packages but maintainers weren't super active on or just chased on upstream, because if upstream has copied the version of a package of a webp into the source code, then you need them to actually go and fix the problem, or you need to apply patches, but patches are fragile, they'll just break on the next update, etc. And yeah, even though we spent hundreds of hours on this, we did not actually fix everything. We fixed, I think about like 50% by count of a new number of packages, but what we did actually is we spent some time categorizing and saying, okay, these are the actual high risk stuff that's likely to get exploited, that's connected to the internet, pulsing and trusted input, and then there is the rest, which is like, you know, maybe we can get away with not updating it now and sometime in the future upstream will realize that they actually are vulnerable to something and we'll maybe fix it. Even if you look at only the high risk stuff that we categorize as high risk, we had some packages in there that we did not actually get fixed. We had to mark them as insecure index packages because even though they are like internet facing software that passes and trusted image files, for example, like email clients, they did not get an update within a month for critical vulnerability that Chrome people were saying was exploited in the world. So, let's play a little game. Let's have some maybe, well, I don't know if we'll do audience participation. Give WebP copies index packages. So I've counted, I've actually been building some tooling as part of the remediation for this Give WebP vulnerability and we have a better idea now of how many packages are copying libraries that we have in Ex packages. So for Give WebP, we had about 116 different packages and by packages here, you know, we had a whole talk about what the package is. I'm talking about something that is built by Hydra. So, next package is stuff that is not non free, that is not insecure and not marked as insecure. And I'm not counting like, I'm counting only one architecture, I'm like grouping by package name. So we have about 116 Give WebP copies, but I think WebP is actually fairly recent. WebP is a modern image format. What about the PNG, which is significantly older? 2037 WebP JPEG, which is maybe even more common than with PNG, 2053. ZGib. So ZGib is a really small C library that people have been using for maybe 30 years to decompress gz files and zip files and stuff like this. We have about 761 of those in Expack, copied throughout multiple Ex packages things. So let's say that for example, there would be like a vulnerability in the PNG. How do we go and fix it? Well, given that we took about a month for 116 packages in WebP and we got 250% of them, I guess the PNG would take about like two months maybe and we get also 250% of them. It's not really a great outcome. So is this actually a problem? How often do these libraries actually have vulnerabilities? Also, okay, we have copies of them, but maybe they're actually being kept up to date and it's not actually that bad. So here is actually for gpng, this is a grouping by version. It's another that we actually have enough information that we can figure out what version of the PNG is being embedded within all the packages in the next packages. And you'll see that the top of the distribution is very much like recent versions 1637, 1639, 1640. That's actually pretty good. Next packages unsurprisingly is at 1640 right now. We have the gate. Well, I don't know if it's a gate version because you want 170 in there and I don't actually know it's definitely not as used as the others. What I've also looked at is the biggest date for some of these versions and these are also versions really small than 10 years ago. So we actually have two packages in next packages right now using a gpng version from 2004, which is kind of impressive. It's like, you know, was x6664 even a thing at the time? I don't know. But somehow it works. I don't know actually I've not tested it, maybe it doesn't work, but it's in there. You can next build it and you'll get a binary that has a gpng from 2004. Does it have vulnerabilities? Yes. There's like about like 12 different critical CVs that give code execution, buffer overflows. Some of this might be might be mitigated these days because we have vulnerability mitigation that's part of operating system. That's part of compilers and stuff like that. So it's not exactly clear how many vulnerabilities apply to these old versions. Another thing is that there isn't really anyone right now who like finds a new vulnerability in gpngs and goes and say, oh, I'm going to test it on this version from 2004. Just to see if it actually applies. So a lot of the vulnerability databases out of date and doesn't really contain the right information to even check against that. I've mentioned block files. So block files are kind of a new problem. It turns out if you go back to 10 years ago, we didn't have software in Rust and go in JavaScript. At least we didn't have as much as we do now. Java kind of did block files a bit with Maven even at the time. But it's mostly a new phenomenon. And the good thing with block files is it's actually really easy to get a full transitive list of all the dependencies because they're in the rock file. That doesn't mean that people are any better at actually managing their dependencies unfortunately, even though there is good tool game to do so. So for example, for Rust, there is this tool called Calgo Audit. And Calgo Audit is a tool that takes a Calgo.rock file and tells you all of the vulnerabilities that apply in this Calgo.rock file. So I used some tools that I wrote as well to go through every single Rust package, current index packages. And that's kind of like looking at every single derivation that has a Calgo.debs in it, correcting the rock file from that. And what we find by doing this is that there is 62% of all Rust packages in the next packages right now that have at least one vulnerable dependency locked in a rock file. This is, I mean, I'm describing as an X packages problem. The problem is it's not entirely an X packages problem. We're just fetching it from upstream. It's just people are locking dependencies. We don't really have control over that. We did for Python, C, C++ dependencies. And upstream is just not doing as good a job as distributions were doing. I mentioned a thousandth and a hundredth vulnerable dependencies. About 750 of these are actually higher critical severity based on CVSS score, which is a third metric, but it's about as good as we have. So yeah, if you get a Rust packages in X packages at random, you have a 40% chance that one of these dependencies has at least one known critical, like higher critical vulnerability. Doesn't mean it's exploitable, but let's say that even like one percent of this is exploitable, that's seven packages in X packages with higher critical vulnerabilities exploitable. That's still not good. And one percent is just a random number I picked. Yeah, so as I mentioned, it's an ecosystem. It's a general open source ecosystem problem. I don't know that this specific log files thing is something that we can fix at the next packages site. Next packages has some fault. We have some rust software that's just out of date, for example. And so when we do that, you know, well, the log file is also out of date. But from the ones that I've manually inspected, this is not the majority of these cases. In the majority of cases, this is that next package is packaging the latest version from upstream and is just containing insecure dependencies. What is causing vendor in X packages? A few things. We don't actually try to prevent it. I've checked. I was really surprised next packages does not have any documentation, any policy against vendor. There's nothing that says, you know, if a software has an option to use system with PNG itself using its own empty version, there's actually no documentation that says that we should prefer using that option. A lot of people do it because it's good practice. Not everyone does. We don't really have, yeah, like we don't really have a way to prevent it for newer language ecosystems like for Goura, JavaScript, as I mentioned. You don't have a choice. You just have to vendor stuff because we don't have a way to, we don't have Rust libraries in X packages. We just have the GIF software. Same for Go, same for JavaScript. Well, now same for JavaScript. We used to have Node 2.x for a while, which kind of added NICS derivations for libraries, but it was automatically generated anyway. So it's not like we could really do much about it. And finally, like until recently, we didn't actually have any tooling to try and detect and measure this problem. So it was just like hidden, big of the water. And we couldn't really go and say, oh, hey, there's a new derivation that's being proposed by someone. Like a new package has been added into NICS packages. Is it actually rendering anything? People would have to go and manually check. And nobody was doing this when reviewing packages because it's just a lot of effort. This potentially stuff we could do now automatically with some newer tooling that I've been writing. As I mentioned, we don't have policies against rendering, but it's even wasn't that in X packages. We don't have really policies against even building from source. It's preferred, but there's actually no preference expressed anywhere. I've checked again today and I could not find it. So people just go and fetch things from app images, for example. Upstream this with an app image. It's too complicated to build. I'm just going to fetch the app image, run patch health to fix a path to dependencies. And then the problem is, well, you don't really know what libraries, what dependencies upstream has been using to build these app images. It's usually not great. So this is something we had with WebP, for example, where I think Anki, for example, like the flashcard software is using. We just using the app image for it. And it was vulnerable because it was using some build environment from 2018 or something. That was, of course, not receiving any security updates. We fetched dev files. We fetched... People are very creative about how to get binaries. We fetched targz. We fetch static go binaries. We... Let's not even talk about JavaScript because you can just fetch a targz and unpack it somewhere. And it's fine because when would JavaScript software ever have vulnerabilities? And, yeah. So some of the distros have famously strong preferences for building from source. Debian has really good policies regarding rendering, which have always been kind of the gold standard, I think, in the distribution world. We should probably do some of that. How do we address Rust, GoNPM, et cetera? I don't think we can. I think it's an upstream problem. But what we could probably do is make it clearer to users that they're actually using intellectual software. It's not really a problem that any other big distros have been... Has been hitting much because just... NICS packages are much bigger in terms of scope. We put everything in NICS packages. We don't have a UR. I mean, we have any UR, like some people use, but like the bar for what goes into NICS packages is very low, right? We don't actually have many policies against like, you know, this... Let's just keep this out of NICS packages because it's not well maintained by upstream. We don't really do this much. So, you know, by being a huge package set, we have the problem that we have pretty bad software that's not really being kept up to date. We have stuff that's just not maintained anymore by upstream. And I feel like this is some... Like the way we should fix this in NICS packages is like, lockfile insecurity problem is just by making sure that if upstream is actually maintaining the lockfiles, we should probably inform the users and make them actually aware of the risks. We currently have this non-rune... Like non-runeabilities bit that we can put in the package. The problem is that it's extremely cold and it's extremely annoying to work around. Like, people have to manually... It stops evaluation. It's not a warning. It's like a critical error. You're using like an insecure package. And so what people do is they just allow every insecure package because the easiest way to work around the error. So, yeah. Tool Gang, as I mentioned, until recently, we didn't really have any way to detect this. I have actually written a bunch of things to try and detect rendering. So, one tool which is called GrepNICSOS Cache, and what it does is get rid of GrepNICSOS Cache. It goes and it takes a list of store paths which we get from Hydra. And it runs... It will just go and fetch every single store path that Hydra has built, which is like a few, usually 100,000, and it just runs some signatures on it. It looks for some strings that you find as part of the implementation of some libraries. And if the library has been vended or statically ganked, you will find a string in there. And sometimes you can even get version numbers and stuff like this. And some other projects I've been working on is... I've kind of called it NICS packages Vendor Drone Scan because I wanted to also make it include the above, like, new rendering thing. But this one is currently specifically doing log file analysis. So, finding all of the log files for Rust, JavaScript, and doing automatic vulnerability detection based on the log files. Conclusion. We have new tool gang. We have a better idea of how rendering goes in NICS packages, and it's not great. And it's a problem because we cannot actually fix security vulnerabilities in base libraries right now. We tell ourselves that we did by, you know, fixing the library itself, and we say, I have 100 different instances of the library being unpatched. How to fix it? Well, awareness. Now all of you probably know about this, and when you review new packages, and maybe look at whether this is happening, more discipline. I think we should have policies about this. I have not thought about the exact policy we should have yet, but we should probably have one. And better reporting for cases that we cannot fix ourselves, which is the insecure log files, most of the time. Yeah. Here we go. If you have any thoughts or comments, please ask any questions now. Otherwise, this is a contact info and some links to tooling. Thank you. Thank you for that. That was terrifying as someone who's come from the Debian world. Exciting as well. It's a really simple social approach that we could take to this to add another tick box to the default pull request template to say, have you checked that there's no vended crap that you could be using a system library for? I think it would help. At some point, if we just continue adding check boxes, people are just going, people are already ignoring a lot of them. They say, has anyone actually checked the sandbox thing in the pull request template anytime soon? It's like, you know, I see two people check, raise their hands. The rest of us have never touched a check box. Yeah, I kind of, I'd prefer if it was automated through tooling, if we could detect some of it automatically. And I'd prefer if at least we fix the policy first and then, you know, maybe figure out the actual edge cases before we start taking people to look for stuff without actually being accurate about what to look for. But yes, we probably should be doing some valiant of this. One of my favorite things to do is package and archive and preserve old software. And some of that work has been done as a pull request in next packages. Sometimes it doesn't get merged because, you know, it's got an old dependency on like Qt4 or something. So they say, oh, no, we can't merge this. And it does prevent some software from getting into next packages, but there is still a lot of software still in there that managed to sneak in. And since we don't have a policy, it's kind of like ad hoc, like some things get in, some things don't. Some people launch Crusades against like old Python, and they don't want that there. And yeah, it's kind of, yeah, messy. So what do you think about it? Because I think there's like a real value proposition for archiving old software, because Table is not nix-source.org, will archive the source code. It will be around forever. You'll be able to reproduce it in 20 years. Yeah, so what do you think about like, you know, banning old stuff and only striving for perfection versus keeping everything in next packages and just accepting everything? Yeah, I don't think we should be striving necessarily for perfection. Like, I don't think we can anyway, but the problem right now is, so for the case of old software, for example, usually one of the things that's blocking old software is when they use library versions that, you know, when they have so many dependencies in next packages that having this old software will induce costs on other maintainers, because they have to care about this old software that will never actually be updated to, like, you use a new API or something like that in the library. And so I think that we should figure out a way to include this old software or it can use this, like, you know, less-maintained software in a way that doesn't use a bandwidth of all the maintainers. Right now, we don't have a way to distinguish this software from, you know, stuff that more people care about, more people use, which means that whenever there is security radiation that needs to happen, the people doing the security radiation don't have a way to distinguish these things. And we use our bandwidth on stuff that maybe is, like, your old software and stuff like this. That's why I think that we should have better query categorization, better ways to inform the users about which category a given software falls into. We don't really have any of this right now in next packages. And I think it's, I don't know how we've managed to survive that long without having such a system. I think we just burn a bunch of maintainer time on stuff that, really, we shouldn't. And we should just accept to be broken. Hi. As you went around interacting with upstream to have these sorts of issues fixed, I'm pretty sure some of these things were also things that other distros were also dealing with and working with. As you encountered them, what sorts of things did you see in terms of those interactions with dealing with upstream, where you had maybe requests coming from other people, kind of a similar position as you, but just from other distros? Yeah. So a lot of the cases where I've actually had to contact upstream myself were stuff that was not actually packaged in other distros, just because, you know, Debian doesn't actually package .NET software, for example, or doesn't package much Go software, surprisingly. Like, if you want to get Grafana from Debian, I think they still won't have that in every repository. I mean, it's not free anymore, so they have a good reason now, but, you know, they never did, right? Do they even package Prometheus? Right? Like some pretty base software that people would expect to be able to apt-get. You have to use some external repositories because they don't have the right tools to package Go codes. So I think because the next package is much broader in scope, we have a lot more things to care about, and we've had to do a lot more of the talking to upstream. There is stuff that has been useful to all these distros and, like, you know, we, like, other distros have contacted upstream before us in some cases, and usually when we do contact upstreams, they are receptive to this. The problem is when they just don't reply. So, for example, for WebP, we had the issue that the main libraries that people are using to use WebP in Go was just unmaintained, and we failed the bug, and the maintainer has just still not replied to it to this day. So you have 500 users of this library that indirectly has a vanduid, vulnerable WebP version. What do we do? So we had to go and manually contact some other users of this and say, like, hey, you use a vulnerability that's not actually maintained anymore. You should fix that. And suddenly it's like, you know, like, the tree of things you need to contact grows and grows. It's a complicated problem. But does that value? It does. It does add value. Yes. I mean, it's general, like, you value to, like, the whole software ecosystem. It's just not the next packages thing, but it's tiring, right? You know, it's not feasible that we would be the only people caring about this for every single vulnerability. Great. Let's have another round of applause for Delo. Thank you. Thank you.
Nix for genetics : powering a bioinformatics pipeline
We have now up next Alexi talking about NICS for bioinformatics pipelines. So thank you everyone for coming. For five minutes I will try to make a kind of different presentation and try to say how NICS can help safe patients. It's not a clickbait title I promise. So I have a doctor in training but I also have a background in computer science so it's a kind of a mixed presentation and I'm working in France in Besanson Hospital. So when we are dealing with patients we want basically three things. First we want to give accurate results because for these patients diagnosis can be life changing. Second we need to be reproducible because all the doctors trust us with giving accurate results every time. Finally we want to be as fast as possible because there is a high demand for results. I'm working in a rare disease setup where obviously things are rare so it's hard to find and how do we do it? Well it's a mix of computer science. And expertise and state of the art technology. So here is a very worth scheme of how everything works. We start from a blood sample of a patient and we try to extract the DNA and sequence it on this machine thing. Unfortunately the machine doesn't do everything and we need some bioinformatics in there. And also the bioinformatics doesn't do everything either. We need a human at the end of a pipeline which is why there is a CSV file that a human has to read. And basically what the bioinformatics setup does is that it figures out a list of candidates for diagnosis and try to filter down the results. For example it can go from one million candidates to a thousand. If it filters too much we can miss the diagnosis. If it doesn't filter enough, well the human will have a really hard time trying to pass the CSV. When you say pipeline it's a really fancy word for just a set of common line utility tools but we also have databases in there that are in our setup just text files compressed. And when I say pipeline we just feed data from one CLE tool to another. And now how does Nix can help it with this? Well as a medical lab we have to be reproducible. It's like in the law. So Nix is a perfect fit because we can fix the software dependency and the dependency either like byte by byte dependency. So that's done. So it would be great if you could run on the high performance computing cluster. And in our region the folks in our cluster agreed to install Nix. And now we can run our current production with Nix there. Two things we didn't do with Nix was to manage the whole workflow. There is actually a tool for that Nix but it's more like a niche thing so we prefer to use a more common tool. And the final things what we could do in Nix but we didn't is to manage this large database because in our setup it's a different folder for Nix so we cannot install it. But it's there in Nix. Last last thing. I really enjoyed the community. It was a really nice interaction. I'm sure everyone knows. But it's also kind of a slow process because I tried to package something myself which is not easy at the beginning. And as you know there is like 5,000 pull requests on GitHub so feedback can be sometimes a bit slow and also I'm working on my spate arm either so it can also take some time sometimes. But for example the support for large databases has been added after a few conversations on Matrix. It was really fast. I hope you take some key points there but if you want to know more you can send me an email and I'll be glad to answer. Thank you. Thank you. Thank you. Thank you. Thank you.
Automatic boot assessment with boot counting
Hi, can you hear me? Up next we have Julian with automatic boot assessments. Okay, hello everyone. So today I'm going to, my name is Julian Malca and I'm a PhD student at Telecom Paris. And today I'm going to talk about automatic boot counting, automatic boot assessment with boot counting, sorry. And so what I will talk about is why we need automatic boot assessments. What is automatic boot assessment and like one of an implementation that is a system-deboot counting, I'll show a demo. So why do we need automatic boot assessments? Because we are using NixOS, we have like something I call the NixOS benediction. It's very difficult to break your system. You really have to want it to break your system. And even if you mess up your NixOS configuration, you can just roll back to a past generation and just be solved by the NixOS magic. But sometimes this benediction has limits. And let's say if you are administrator for a remote server, you perform some kind of server update, let's say kernel update. And you mess up. You choose a kernel that cannot boot your root partition. And at next boot what is going to happen is that it's going to fail to boot. And if you don't have any physical like BMC, then you will need physical intervention to revive the server. So this is this kind of problem we solve with automatic boot assessment. So boot assessment is any kind of technology really that can, and I can only assess if a boot entry is bootable or not. And we have one example that is system-deboot boot counting. So boot counting is a feature of, as I said, system-deboot. And the idea is the following. Each boot entry has a counter when created. Each time system-deboot tries an entry, the counter for this entry is decreased by one. If the entry is booted successfully, and I will define what booted successfully means, then the counters are removed permanently. But if the counters for an entry have ever reached zero, then the entry is marked as bad, and it's sorted in the boot menu at the end. Just if I get just a little bit more in depth of how this works. The counters are embedded in the entries' name, file names. So you have the file name, then you have the plus separator, then the number of remaining trials, then the number of failed trials. So this is generation nine. It has four remaining trials and one failed, and it adds, at the beginning, five trials set. Counters are decreased by system-deboot when it booted the entries by simply renaming the file. And you have to define some definition of a successful boot by scheduling whatever you want. You need to be started successfully before the boot-complete target. So when the boot-complete target is reached, then the entry is renamed by the system-deboot, the bless entry unit, that is going to remove the counter. And we are done with this entry. We consider it good forever. Okay, let me show you a demo. Right, so here I am in a VM, I am booted, and I'll show you that in the configuration.nix, I have enabled the feature and set the number of trials to be two for any entry. The VM is booted successfully, but I will do a massive mistake. I am emulating a mistake. I have a BKHFS file system and I will rename it as X4. So it means that now this partition will definitely not get mounted, and when I will rebuild, it will even change the kernel to a kernel not supporting BKHFS. So now it's rebuilding my configuration. You see when I am done rebuilding, I get no error, no nothing. I think everything is good. I show you the boot entries. The boot entries, you have five boot entries, and the last one as a counter, you see the dash five plus two, two trials for this entry. And now I will reboot this VM. So what happens when I reboot? At the beginning, everything is fine. My generation five is sorted first. It will try to boot it. Kernel crashes. It reboots. Now it's still sorted first because we have two trials for this entry. Again, kernel crashes, reboots. And now you will see it's now sorted last and we are going to boot generation number four. And of course we are going to boot it successfully and that's it, that's the feature. It's available currently as a PR. It will be merged very soon and be available in the next lesson table. Thank you.
Typhon: Nix-based continuous integration
Hi everyone. So today we're going to speak about Typhon, our software for Knicks-based continuous integration. Let's say, for the sake of the argument, watch your enthusiast, and you're asked to set up CI at work. So what do you do? You convince your boss to use Knicks, because that's great. And you install a header. It's the de facto software for CI with Knicks. So your job is fantastic using Knicks, but soon you realize that not everything is perfect, because first you need to install the thing. And it's not easy. It's a full. And so you want get upstate choices. Then you need to configure the plugins, and each time you change the configuration, you need to redeploy the thing. Also, it's hard, because when you want to change a plugin, you actually need to write a poll for scripts, and you need to redeploy it again. Last thing, when you want to do deployment, all you get is this rank command thing, which is a bit hard to use, and a super staple, which you don't really like. So you start to dream about something much more simple, something declarative maybe. Maybe you want your plugins to be defined, user defined basically, with Knicks maybe, and you would like some better deployment, more in line with Knicks philosophy, with declarativity and the productivity. Okay, so in this dream, how does it look like to configure CI for a project? Well, at first it looks a lot like it does in Hydra. You set up an attribute set of derivations which are going to constitute your jobs. But then you write a Knicks expression for your project that looks a lot like this one. So here the makeGitUpProject function takes all the information that needed for a GitHub workflow, with the repository, of course, some arbitrary deployment rules. And of course you're going to need secrets like GitHub tokens and SSH keys to set GitHub statuses and do remote deployment. This expression is fed to Typhoon through the Flake URL. And once Typhoon spawned your jobs, it's going to use the project expression to build actions. So actions are scripts, which are user defined and Knicks built. They are run in a sandbox and triggered by Typhoon on various occasions to provide features that will be provided by Hydra's plugins. For instance, the most important hooks triggered by Typhoon are before and after every job to set statuses, of course, or do any kind of deployment. In a little bit more detail, an action is sandboxed with only access to the store and to Internet. It does not have access to the local machine. So for instance, it does not have access to secrets for other projects. It takes JSON as input containing the decrypted secrets and of course contextual information about your job. And it outputs JSON to communicate with Typhoon. Thanks to actions, Typhoon is completely for diagnostic. Actually all the communication between Typhoon and the forge is done through actions, meaning Typhoon can fit a lot of different workflows. But how do you write actions? Well, of course, you use the Typhoon's Knicks library that lives in Typhoon's flake. It would be quite frugal at the beginning, but soon it would go to fit a lot of different forges and various kind of deployments. And the goal would be, of course, to have an ecosystem of actions like we do for GitHub actions, but much better and using Knicks instead of YAML. A few words about how you would code something like this. Of course, you would use Rust to get some like technologies like Actix and Dissol for the back end and a nice web app using Leptos. And so you would start coding and soon you would have a prototype. Soon the prototype would run CI for itself. So it would be time to present the project to the Knicks community at FOSDEM and tell people to try it. You would still want them though. It's still a prototype. Everything you talked about today is maybe not yet fully implemented, but still it's ready for beta and you're waiting for feedback for issues, a lot of issues, maybe a contribution even to the actions library. And all that would be left for you to do is to thank everyone for listening to you.
rix: an R package for reproducible dev environments with Nix
Alright, hello everyone. My name is Bruno. I'm a statistician and data scientist, data janitor, whatever you call it, in Luxembourg. Are there some people that use the R programming language here? Statistics? I will see some of you. Okay, cool. Maybe this will interest you then. So what is R very quickly? So R is this programming language that's been around for 30 years. It's like a floss implementation of S and it's mainly used and mostly used for statistics, machine learning, data science and all that kind of thing. And it comes with all these built-in objects that we like very much when we work with these things, which is data frames, matrices, formulas, models, etc. So that's all built into the language. There is like a little hello world. You can, with the base language, do linear regressions so you can load data frames or CSV files very easily. You have formulas that define like your model very easily and you can do that with the base language. But you can also extend the language with packages and these are really called packages. So you have deeplier, you have tidier, these are very popular packages for data manipulation but there's many others. And this here is like a typical data manipulation pipeline in R. So you start with your data frame and you keep passing functions to that with arguments and you do your aggregations, you do whatever you want. And so we have, as of writing, around 23,000 packages that are available through the two biggest main package sets if you want, CRAN and Bioconductor. I wrote that all are available through NICS packages. I don't think that's fairly accurate. I think not all packages are available but most of them are available. Personally, I've never found a package that wasn't available through NICS packages. So what this means then is that we could use NICS to set up an environment with R, with our packages that we need, etc. and use that to work. But that's not really a thing in the R ecosystem like this per project environment. If you use Python like for data science, very typically you will see people start with a virtual environment with a specific version of Python, specific versions of packages. That's not really a thing. At most what people do or our users do, they do like per project libraries of packages, right? That's a thing. And if you need more, people would typically use Docker and there's been the Rocket project for that that really popularized the use of Docker in the R ecosystem. That being said, I wrote with a colleague called Philipp Bauman. We wrote the Rix package. So Rix is itself an R package which provides this really familiar interface to R users. It's a standard function. You can specify the R version that you want. You can specify the packages that you want. These packages can come from CRAN, can come from Bioconductor. They can come from GitHub as well. If they are hosted on GitHub only, you can set up tech packages without typically a thing that R programs want as well. And system packages, we called it like this. Maybe it's not the best thing, but this would be kind of other tools if you need Git, if you need whatever, you can add it there as well. And you can specify IDEs because for R Studio, which is a popular ID for R programming, there's like a wrapper that needs to be installed as well. So this would take care of that. And it generates that expression that I'm not going to show you, but it's like a Nix expression that will install all of these things. It will look automatically for the right revision. And if you put in Git packages as well, it will also generate for you the hash because there's like a little server that we set up that downloads the package there, computes the hash and then sends it back to the user. You can also use this with Nix function within R. So you could execute any function or any R script inside like a sub shell with a specific version of R. And you could then within your interactive session that you are currently running, you can then get that result back and continue working with it. So this is useful if you are like doing a reproducibility study and you just want to execute one particular function from a paper, for example, and you just want to get that result. So you can do that as well quite transparently. If you're interested, there's this website that you can check out. It's still not released on CRAN or on CRAN, but we are aiming at doing that in a couple of weeks. Thank you for your attention. Thank you.
Preparing a 30 year-long project with Nix and NixOS
Hello everyone, my name is Rémi Nicole, I'm this dude on the internet and I work for the CEA which is the commissariat for atomic energy and alternative energy. But the CEA is quite big so I should say technically that I'm CEA, DRF, IRFU, DISC and all the way at the bottom. What do we do? Well, we do control systems for big physics experiments like particle accelerators. And so what is a particle accelerator? Basically, it's a bunch of hardware. There is a plasma chamber that just exposes protons and then you need to give protons some energy, you need to control them and you need to do some diagnostics. For example, if you want to make the protons turn, you need to have an industrial power supply and an electromagnet. And so you need to control the power supply to control the strength of the magnet. And so we use a framework which is called EPIX, DayLake acronym in this space. So it means experimental physics and industrial control system. And it's quite old software. I'm showing you the old logo because it quite explains well what it does. It's a single protocol which is represented by the line and some clients and servers. So we have, for example, the input-output controller which does the control of the power supply, for example. And we also have some graphical clients, alarm system and some archiver. And so what do we do when you're an EXPAN? Well, you package it with NICS. And so you can see the logo of NICS kind of eating the EPIX logo. And I'm not going to talk too much about that because chances are you don't have a particular accelerator at home, so you won't really need this project. To be fair, someone did use EPIX to control a beer brewing system. Yeah, beer people are weird. And so what in terms of network? So you need a network as isolated as possible so you don't exactly need to do this much update. And usually you don't want to update something. If something works, you don't want to touch it because it takes a lot of money to restart the accelerator. And so what you need usually is a good resilience of the system. You have a lot of assumptions to rethink. And we could ask you, we could be asked to modify some software 10 years after it was in production. And so what I'm going to present is how we use NICS and EXPAN for this kind of resilience. So first things is we use Flake for pinning projects, which is good because anyone can just pick up the project back up and it should compile and work. There are some exceptions that you have when you have such a large time scale. For example, some software might not be available in 10 years. Maybe GitHub went down because Microsoft or something. And what we have as a solution is to do a lot of CI and use our own cache server extensively. And by caching, I mean caching really everything. So usually what you want to cache is the runtime dependencies, but what we want here is we also want to cache every build time dependencies. And so what we should have as a system is that even 10 years after it was deployed, we could modify anything down the stack and we could pick any project back up. We also need to cache fake inputs, which is a bit weird to do. We also need to cache NICS itself because maybe in the future NICS won't evaluate, we will have some deprecation and won't evaluate the old NICS code. And so the system that we have, thank you Maurice for working on this, is that we have a CI server, in our case it's a GitLab CI, which will build our derivation and we also build a build time derivation, which depends on all the build dependencies of the software. And then the CI we call a webhook in the cache server and the cache server will actually pull all of those dependencies. And why do we have a separate cache server is that with this system we can use profiles because over time the cache server will fill up and so we need to figure out what old version of the software we need to clean. Yeah, I have a hopes that NICS can be used for building resilient systems. And yeah, if you're curious, here's some links. And if you want to build time derivation, there's an example code here. Thank you.
Running NLnet on NixOS
Alright, thank you everyone who moved and gets some space. We can now start the next talk. Josse is going to talk about using NixOS at NLNet. Hello everyone. So, yes, my name is Josse van den Noeven. I'm an employee at NLNet. NLNet is a Dutch foundation. And who here has heard of NLNet, by the way? Are there any hands down? Wow, this is amazing. That's very cool. Yeah, so it's an honor to work at NLNet. And this talk is going to be about how we use NixOS there. There were so many hands. NLNet is the organization which here at FOSAT might be known for, you know, spamming stickers everywhere. We have the stand in the K building with so many stickers. And each of these stickers here is a project we have supported. But not all of the projects that we have supported have a sticker because, you know, command line tools might not have a logo all the time. As you can see, NixOS is up there as well, as well as many other projects. I'm wondering who here has ever had funding for NLNet? See, that's less hands. But we have funding for open source projects. So if you have good ideas, if you're part of a community that has these tenacious bugs that nobody is coming around to help fund to fix, or if you have a protocol which has not been implemented in your particular library, or whatever good idea you have, just look on our website what other projects we've been funding and, you know, write your own proposal. Proposals for writing to NLNet are not difficult to do. It's one form. You say who you are, you say what your plan is, what the outcome is, what it's going to cost, or what you think it's going to cost, and then you press send. And every two months, there's a new call. So this is the tagline that we use since this week, actually. We have a PR person now, and she, you know, she says the message should be simple, it should be clear, and it should be to the point, and so we try to, she tried to fit it into one line. We fund those who contribute to the open internet, you know, because that's what it's all about, why are you here at FOSDEM? And, yeah, we're just very happy that we can help there. So what do we mean when we talk about the open internet? Well, we should be able to communicate directly, right? Get rid of big tech, which is in between our communications. No dependencies, no lock-in, just get the source, compile it yourself, and that way we can have a good democracy, we can be independent and not have to be living in fear that some service is going to be taken away from us, because, you know, we can run it ourselves. So self-hosting is a thing that we very much promote, yeah, free software, free society, and this logo here, Next Generation Internet, is, that's the thing that makes me standing here, because that's the fund by the European Union that is providing over 90% of the funding that Anonite is able to give out. We have been given out money for decades now, but we were always very minor operation until the EC decided that, you know, there's so much software in this world, we're running on it, we're depending on it, we should, you know, also be the owner of it and invest in it. So that's what the EC is doing now, and we are one of the facilitators that, you know, seek out the right projects that are to be supported. So we fund open software, hardware, standards, documentation. When you submit a proposal to us, it has to be something that you can deliver, you can, you know, get pushed or you can publish somewhere, not, for example, server maintenance for that or having a meeting for that you have to go elsewhere. We like to, you know, we like to check what the money is being spent on, and that's also what we have to report to the people that give us the money, which we will mostly be doing, so we try to keep the bureaucracy very low. Yeah, self-hosting. So self-hosting, of course, means system administration. Who here likes system administration? 50-50. Yeah. So, yeah, it doesn't always go well with system administration. You're in the basement in some organizations. In the Netherlands, we're only small, so I get to sit with the other people. It's not all that bad. Once a year, you know, you have system administration appreciation day, which is awesome, right, if people remember it, and, you know, if they're not on holiday, because it happens to be in the middle of summer. So, yeah. Not everything's perfect. Okay. How do you use Nixle's in a small organization? That's what this talks about. And in the Netherlands, we're currently tendon police when I started, we were four, so we're growing. Also, when we started, we were running a bunch of different systems with backups sometimes, no commits of the configuration, so no history of what was running. It was a mail, for example, was running in a BSD system with set-afers, so it had snapshots, so that was pretty good. And our requirements are really not that crazy. We need mail, website, telephone, you would think. But then if you drill down, there's quite a lot of stuff that you need to keep running, actually. So here's what we have that is open source and which is not free and open source software. So a website, obviously, it's run by EngineX. Our email server is self-hosted, mailing lists. We have our own code forge. Well, what makes us tick? Our grant management system. That is running using open source components and chat, video, micro-blogging since a short while. We are also hosting it ourselves. But not everything. For example, our router, which we could do, of course. We haven't gotten around to that. Printer, open hardware for printers. That's not worth it right now. We have some people using Apple devices, so it's not completely open there either. Biases and chips. I mean, we support people designing chips. We're not yet at the stage that we can also dog food those. But we have quite some components that we do ourselves. So when we choose a system to get rid of the whole collection that we had before, what options are there? Well, there's Nixos, there's Geeks. We could go to a closed cloud, but obviously that would be very bad for our image. Or we could go to an open cloud hoster, which there are more and more of those now. But we said, well, we are funding projects. Projects are sending us that code. It would be great if we could also try to keep our knowledge about all these systems up. So let's try to do it all ourselves. And then Nixos has quite a lot of advantages, also some disadvantages. But the declarative part is, yeah, it takes some getting used to. But it's really useful, right? It's just nice static files. It's mostly reproducible, and mostly it's mean 99.99% for the stuff that we use at least. Extremely many packages, as you've seen in the talks just before now. You can mix versions of stuff. I'll show you a bit later how we actually need to do that. The Nix language, well, there's a lot of discussion always about it. But personally, I really like it. So you have to get it. But then it's great. So it's familiar to us because before we decided to switch all the systems to it, we were already using it on our laptops. So that's a bias over there. The Flake lock is very important to us because we can lock down the dependencies and we can be sure that whenever we update, it's a conscious choice to do so. Propriety packages are packaged, but they're disabled by default. So we don't have to worry that by accident we are starting to depend on closed software. Yeah, there are some downsides as well from our perspective. So the community is organized on a proprietary system. A lot of open source projects these days are. And we really promote self-hosting. So if a project is self-hosting, that's a plus in our book. Another thing, not everything is as polished as it could be. I'll show you that we are using an officially unstable feature. So yeah, and there's no storage handling. And what that means, I'll get back to that as well. So there are a lot of green flags there. Full disclosure, Nixos is a partner with us. When people get funded at an Lnet, they also get services. So they get free packaging and Nixos is providing that. So we are a bit prejudiced when choosing Nixos. For me, Nixos, I've been using it a long time, but I always find it very difficult to write the packages until one day I had to explain to a colleague of mine how these files work. And I was sitting there and suddenly it clicked that yeah, everything is a function. I mean, it's called a purely functional package, but still somehow it didn't click. But then I had to explain it to him what are these brackets at the top with the columns. And yeah, that's the arguments to the function. And the rest of the file is what comes out of the function. There are many Nixos developers who are thinking, wow, this is a newbie here. But I feel a bit embarrassed to say it, but once that clicks, it's really a very nice system because it's like JSONnet or Haskell, other functional languages. It's very predictable in what it does once you get it to do what you want. So is it just Nix? Is that enough? How do you deploy it on many systems? So there was a talk by Sir Leanne Rappen a few years ago on all the possible options that there are to deploy Nixos to a number of systems. So it's a whole list here, and in her talk she explained what the pros and cons of each of these systems were, and that was very helpful to us. So that's why I wanted to highlight it here. That was really amazing work that she did. And in the end, what we chose is to keep it simple and do everything with Nixos rebuild. And that's the basic command that everybody's using when you're using Nixos. And it turns out you can just manage your service with that. So all of our systems are defined in one Git repository. They're all defined in one nix.flake.nix file. Each machine has a configuration nix, hard configuration.nix, but there are a lot of placeholders there for stuff that we import from another directory where most of the services are configured. And we try to keep the simple readable for everyone. We use a JSON file that has sort of the structure of our setup in there, and then that's imported and readable as variables further on in the system. So if you do a nixflake show and the flakes are the not yet completely stable part of Nix that we are using, then you will see Nix configurations has five servers in our case. And what we do to deploy that is we type nixos rebuild, switch, and then we say here's the flake for the server, and it should go to that server. So that's how our deployment system works. It's just built into nix. And this is our machine's JSON, so it tells us what should be the IP number for the different machines, what name servers should it talk to, where are the secrets. Secrets management is really done by rzink by us, so we just, when the machine reboots, we don't store the secrets in the nix store, we just copy them into the slash run directory with rzink. And yeah, here's the flake. So we are mixing an old version of nix packages because we haven't completely switched that, I'll explain later why, with nixos, the current nixos, I mean you can just do that, you can put it together, and so these are the inputs, and then here is the function that defines the outputs where these things are coming in. And this is then a very simplified version of how we define each of our machines. So we have a function called makeSystem which takes the hostname and the definition, and we define our systems by looping that function over all the machine definitions. It's a bit more complicated because it has to know which inputs to use on which machines, but this is sort of the magic that makes us able to just use nixos rebuild. Now when you're setting up your system, this is the thing I think is most important, is the alerts. The computer has to do stuff automatically for you, and you would like to make sure that it continues to do so even while you're sleeping or while you're giving a talk at FOSDAN. So I'm very happy that this box in my mail folder has not had any unread messages for a very long time now, so that's great. Our alert board is green very much of the time. We have a very particular alert here which is called the nixos flake committed. So if somebody doesn't deploy without committing first, it gets read because then it's undocumented what our system is doing. This was a zoom in, but I think it was good enough to read. Yeah, so backups, that's the second most important thing for your system. We use Borg for backups and Butterbackup, to do snapshots every hour, and here's a small point of critique for nixos or actually a feature which is not really there at the moment. When you do anything with software and it also needs data, you have to say where the data is and everything is declared in nix, except the folders have to be written by hand or they're set by defaults in the services. Doing backups, there's no enforcement that there is a backup or an easy way to do the backup. In the setup of your backup system, you have to repeat all the directories again or you define them at the top level and then you use the variables for those directories at the top level. This is a thing that might be a bit more polished, it's an opportunity for a new system, a new extension. So mail, who here is hosting their own mail? Wow, that's not enough. We need more people hosting their own mail. It's so important, it's still the backbone of all your communication, it's email. We really want to self-host, we were self-hosting, so when we're setting up a new system, it would feel like a defeat to stop doing that, so we continue doing that. And nixos has a project which is called Simple Nixos Mail Server, which ties together Dovcott Postfix, LDAP and Rspundi together, it didn't use to tie up LDAP, but we needed that, so we paid a contractor to add this support and upstream it. So that's what we're using right now. However, we're announced, so we're funding a lot of projects. We're also funding Stahl words, that's simpler, all included Rust implementation of a mail server, and we're also supporting Mox, that's a go implementation of a mail server. And we're soon going to try out Stahl words on a less important mail domain of ours. Yeah, and then you get these wonderful 100% scores. If you fill around long enough, well, actually we didn't have to fill that long because the Simple Nixos Mail Server really configures your mail properly, and this wonderful website, internet.nl, is what you can use to check if actually your mail server is configured correctly. One highlight of Nixos that we really value is the testing. So to test two computers working together is made very easy in Nixos because there are Python scripts that you can call and you set up both computers and you tell them how to talk to each other and what the expected outcome is, and all of these scripts, many of these scripts are just part of Nix packages. So you can read how these systems, how this testing is done and for your own setup you can also write those scripts, and that's great. And we run that in CI via Flake checks. Well, sometimes something can go wrong. You don't have to be a genius to see what's going wrong here. We are sending the configuration of server one to server two, and this is where the system that we saw earlier comes in handy, how to fix your booting because this really killed deployment one time. So when I say we keep our system simple and we try not to build on top of stuff, here we decided that it would be a good idea to make a small alias script that only takes one argument so you don't confuse the two servers with each other anymore. We recovered from this in five minutes so it wasn't that bad, but I did get a big fright. How do we do updates? I'm just putting this command here. It's not that interesting, but I just want to have it documented somewhere because it's a bit long, but we have a number of inputs to our Flakes, and if you want to update just one input, which is often something that we need to do, for example, when one of our software packages updates that we write ourselves, then you can do only that Flake input with this command. So conclusions. We like to keep it simple. We just use the basic tools of NixOS, and we put most of the configuration, or we try to move as much as we can into JSON files so that it's easier to read. So technically, NixOS is really great for an alat. However, for the average office, it's probably quite complicated to do this. So I think there's an opportunity here for open cloud providers to use a system like this and make it more user-friendly. And in fact, there is a project currently called NGI Fidiversity where the EU is funding us to help create a new hosting stack that will be using Nix. And that has just started. We have the planning phase for this. So if you're interested, look it up. There will be this, or talk to that guy over there. That's a raving, but this will be probably a talk next year for them. And with that, I'm done, and I'm open for questions or tips because there are many people who are more expert than I am. Thank you. Do we have any questions? Hello. Thank you very much. I'm just wondering, you said that you are thinking your secrets to run directory, like why you are not using like, Agenix or SOPS for that, which will do it for you and you don't need to do it manually. So the reason we're not doing that is there were so many options which made me confused. And then also some of them were putting the keys encrypted, but nevertheless in the Nix store. And I just felt more comfortable doing it with Arsene. That's the whole explanation for it. Hi. You said something about Nix not being aware of storage locations. I didn't really understand that. Could you kind of explain that a bit more of what's... Yes, so Nix defines where all of your software is coming from and how to compile it, and it puts it all in the Nix store. But of course the software is interacting with data. And there's no sort of, you know, a type or a class which defines where the storage is. So you could say, I'm doing backup now. Just backup all of my systems. Or if you pass a directory into a service, that directory is an object which has been defined elsewhere as it needs to be in the file system. It needs to be... the file system needs to have, you know, a type of file system. It needs to be mounted. All of those things are something that you have to take care of. And because Nix is declarative, you know, once you hammer it down, it's fine. But it would be great if at compile time you get an error for that. Any other question? There's a question in the back. I just wanted to react to the storage location thing because it's interesting. So in NixOS there is a problem which is you want to declare things, you want to be declarative. But when you deploy a software, software often comes with automatic migrations. So they proceed to the operation on your state, on your files, at every new deployment. And this breaks the rollback system. Because if you rollback to previous version, you don't rollback the data. You just rollback the configuration. And what could be done here is that the NixOS modules themselves could learn about where the state is. What does it mean to back up an application? What is dependency in the PostgreSQL database or whatever. And it would start to provide a solution for the problem you're mentioning. Yes, exactly. Databases are a whole extra level of complexity and possible, you know, data corruption. I think we can take one last question. Hi, thanks for the talk. I might have missed it because I joined a little bit later. But in the configuration, do you have a way that you are happy to pass secrets? Yes, so the way we pass secrets is we have a top level JSON file. And there we declare all our secrets. So for root, it needs these secrets. So wire guard needs a private key, the mail needs a password. So these are files that have to be under slash run slash root. And when the Nix evaluates it, they are stored. Where do they get? So Nix doesn't do anything with that. It just writes in the configuration where that file is supposed to be. And then when the machine starts, some services will say, hey, I'm missing my password. So I copy in the password in there and then I restart that service. But that doesn't happen very often. We are fairly small office. So we don't have 100 machines. So automating it more seemed like complexity and overkill for our situation. Okay. Thanks. Thank you very much.
Dune 3D - the making of a maker's tool
Hi, I'm Lucas. So when I am not writing the CAD software of any kind, I usually do hardware projects, some of which I've shown here. As you may see, they're pretty much all the same thing. They are a circuit board in a 3D printed case. So for designing them, one needs basically two softwares. There's CAD software for the printed circuit board and software for the 3D printed case. What both of these things have in common, that CAD is pretty important there, since what you're drawing CAD is what you're going to get. So when you're doing, for example, woodwork or metalworking, if you need an extra hole, well, you just drill it. But that obviously doesn't work for PCBs or 3D printing. So yeah, it's pretty important to have a proper CAD software there for the first thing, for PCBs. I solved that issue for myself a couple of years ago by writing Horizon EDA, but that's not what I'm going to talk today. But for the 3D stuff, I found myself oscillating between FreeCAD and Solvespace, since both some things great, but neither of them covered everything I needed. So let me elaborate on that. So FreeCAD itself, pretty much all the features I needed, some of which are step important export and support for chanples and fillets to make the things look more prettier with a little effort, but it falls short by not, by the peculiarities of referencing stuff, the sketch of being modal and not being able to easily make constraints in 3D. For Solvespace, it's pretty much the other way around. It has significantly less features, but these features work really well and I found it really pleasant to use. So at first, it's dismissed it, and since it doesn't do step import and export, but everything else works really well. So is there anything that does all of that? Unfortunately, I didn't find anything, so I thought, well, it's not the first time I've written a CAD software, so maybe let's try writing a 3D CAD software. So after all, so what do we need to make a 3D CAD? So first of all, we need to show something to the user. For that, we need a 3D viewport with all of the usual stuff like shading, navigation, and selection, but fortunately, I already did more or less that for Horizon EDA, since Horizon EDA has a 3D preview, and it's basically all of the OpenGL boilerplate already done. So we have that. Next up, we need a geometry kernel that takes care of all of the Boolean operations and extrusion and stuff like that, and for that, there are some of you might know it, OpenCascade, also from the talk before. It has some words, but I had some experience with it from Horizon EDA, and it works okay. And it's also pretty much the only game in town if you want to have jump-fast forwards and proper interaction with stuff. So that one's there as well. And next up, we need a solver that takes care of solving all of the constraints and entities and stuff. And for that, there's also something that we can use, in particular, the solver from false space. The solver from false space is available as a library, but that's with a small asterisk. The library itself is a C wrapper around the C++ internal from false space, but the wrapper itself is pretty limited, so I ended up not using the wrapper and ended up using the internals from false space myself, and they are pretty easy to use, actually. So we have that one as well. And last but not least, we need a user interface of some sort with all of the boring stuff, like preferences dialog, way to select tools, the general tool handling, and all of those little, lot important stuff, such as the access lollipop that shows which access goes in which direction. But fortunately, I already had all of that in some way or another from Horizon EDA. It's a 2D CAD, but well, undo, redo, and stuff like that pretty much doesn't care if it's a 2D or 3D CAD. So yeah, then I realized, well, I had all of the building blocks to make a 3D CAD, so I started with it, and that was back in August last year, and now I'm here to talk to you about 3D, a parametric 3D CAD. So I said it took about six months to get from basically a blank window in GTK to where we are right now. As probably expected, it's written in C++20, and it's about 33,000 lines of code, and it uses the, you use GTKMM4 as a GUI toolkit. Using GTKMM3 would have probably been a slight bit faster, since I've already used that for Horizon EDA, so I would have been able to directly copy-paste code. But yeah, GTKMM4 was the last version at the time I started it, so I went with that, but that's probably a topic that I should write a book about, since there are quite a things that were a bit annoying about GTKMM4. And same as Horizon EDA, it uses UUIDs for everything, and uses JSON as a data storage format. So yeah, I pretty much reused all of the concepts that worked well in Horizon EDA for GTKMM4. And yeah, just a couple of days ago, I released version 1.0, and yeah, so it's already packaged for, in Fatpack, for the Windows folks, there's an MSI installer, and the good thing was, well, it wasn't the first time that I had to take care of all of the packaging, so the packaging stuff was pretty much just copy-paste from Horizon EDA again. So what does it do? It has a parametric to the sketcher, that has all of the usual stuff like lines, arcs, circles, and constraints to draw these lines and arcs. There's a convenient all-in-one tool that handles lines and arcs in one tool, so one can draw arbitrary outlines in one tool, and there are also some convenience tools for drawing an axis-aligned rectangle or regular polygons, as they're needed, for example, for hex nut inserts also. To make things 3D, there's extrude and lathe, so lathe is basically a 360-degree revolution, revolutions that are not 360 degrees aren't supported yet. And to repeat things, there's linear and polar array, and to combine multiple solids, there's the usual operations from open cascade, so union, difference, and intersection. So for that, I just basically had to expose to user what open cascade offers for to make an N in 3D. There are also constraints such as distance, angle, point-to-plane, or point-to-plane distance, or that's useful for example when you want to make a hole that stops at 3mm from the last edge, you can just use a point-to-plane distance of 3mm, and that's it. For the step import, I basically copy-pasted the code from Horizon EDA that turns the step model into a set of triangles, and I also reused the code for extracting the reference points since the idea is that you want to import your circuit board, and you want to add some reference points, and then you can reference these points in the geometry, for example if you want to fit your case around the circuit board or make cutouts for connectors. The last important point are fillets and chamfers. These are basically just calling the open cascade functions to add a chamfer or a filler to edge, but unfortunately the way it's implemented right now is subject to a topological naming problem since all of these edges are just referenced by index, so if one changes the geometry in a way that adds extra edges or so, it breaks, but well. I was used to that from FreeCAD, so it was okay. So how does it all fit together? So in the middle there's the thing for the document that consists of all of the introduced specific data structures like groups, entities, and constraints. These are then presented to the user with the renderer and canvas, where they are turned into primitives that I can render with OpenGL, and then the user uses the tools, same as in Horizon EDA, to interact with the document and to take care of the solid model. All of the entities get transformed into something that OpenCascade can understand, and then that's again triangulated and rendered. And to take care of solving the things, there's the interface to the solver in the space, solver in solve space, and app-rolly as to be expected. The hardest part of implementing all of this are these interfaces between OpenCascade and the solve space part, since that's where the impedance mismatches are, since I had my data model and the data model from OpenCascade or solve space, and it somehow had to fit together. So what's next? So there are some, of course I have got some plans, mostly some basic things like measurements, revolutions that are not 360 degrees, or stuff like copy-paste. But the big distinction between, from the project point of view, between doing 3D and Horizon EDA is that with Horizon EDA at least have the aspiration that one might eventually be able to do really big and complex parts, but I want UnityD to stay and to be and stay small and little easy to use CAD software that doesn't have the focus to cover everything and all. It should just be a tool to make simple 3D printed laser cut or CNC machined cases for PCBs, probably something else, but it already does pretty much everything I need for my use case, so it'll probably, mostly stay as is, with of course some bug fix and UI enhancements, but yeah, don't expect anything big to happen there in the future. And I think then we're already over with the presentation, and now for questions I think. So, questions? Thanks for the talk. Very impressive for this time scale. You were talking about having 3D constraints, and then you just showed an extrusion size, but well that's something you can also do in FreeCAD, right? Do you have any other possibilities to do more complex constraints in 3D space? Yeah, sure. Okay, so the question was if there are any more complex constraints in 3D space, there are some, such as angled or point-to-point distances, or what one can also do since 2D and 3D can work together with a means of work planes. One could, for example, construct at a work plane in the same group as the extrusion that's perpendicular to the extrusion, then do whatever one needs there, and then constrain the extrusion to that. Or one can also constrain the extrusion to another sketch, so one can put the extrusion in a work plane and then do stuff there, so there really isn't, and then it's all protected into the work plane itself, so there really is no limit of what can do, but yeah, that's all the way that wall space works. Thank you for your talk and impressive effort. Do you think that CAD programs, CAD suites with this level of complexity could be a good stepping stone for beginners and maybe even children from very simple drag and drop programs like TinkerCAD towards something more parametric that they can manage to use when they start to grasp the basics of these kind of suites? So I think, yeah, it definitely has a learning curve since one needs to grasp the concept of constraints, degrees of freedom and such, but I think that's pretty much the same thing in every parametric CAD. So yeah, there are some idosyncrasies in terms of the user, the interface, and it's driven by a global menu that unfortunately has some discoverability issues, but yeah, I think it's something that one can also try with children, but yeah, I've never, I don't have any experience in that education space. Yeah. Great work indeed, especially for the time you spent on it. So in the beginning you showed these tables with check marks, but you didn't explicitly conclude that you had all the check marks for your software. Yeah, so let's go over it. So a step import, step import and export is pretty much done by OpenCascade. Since OpenCascade does the import, i.e. the triangulation and extracting reference points and export is just calling a couple of methods to take the topo ds shapes and write them to a step file, gemfas and fillets are just methods to call from OpenCascade, and all of the three bottom things are basically the same thing as in SolveSpace, since overall June 3D is pretty similar to SolveSpace in terms of overall operation. There are groups, constraints, entities, and if one knows and likes SolveSpace, they'll probably also like June 3D. Right, and another question, thanks. If you would have spent the same time in either SolveSpace or FreeCAD, could you have improved them to your needs? Yeah, but I was pretty sure that question will come up. So, let's go over it. I think FreeCAD, I've looked at the code sometimes and I've also find that there are really a lot of code, and I think especially changes like having a non-modal schedule would have probably been way more work, and with SolveSpace, they have their own geometry kernel for probably good reasons, and from a project conceptual point of view, I think OpenCascade and SolveSpace are pretty much diametrical. SolveSpace has really this nice self-contained thing without that big OpenCascade dependency hanging off the side. So, yeah, that's why I conclude, well, it's probably easier to write my own, and I also noticed that I really like writing CAD software. Okay, we have time for one more question. I use CAD software to create 3D models for on PCBs, to render on PCBs, and I felt that problems like SolveSpace are missing color support for faces. Does your Dune3D support this? Right now, it doesn't support colored faces. Yeah, I have to look into how to accomplish that with OpenCascade. These are always the topics that are a bit tedious, and yeah, well, it's OpenCascade, and as mentioned in the talk before, it has a rather cryptic API, but the good thing is there's FreeCAD, so FreeCAD is pretty much the best OpenCascade documentation there is. Okay, thank you. Okay, thank you very much, Lucas.
Comprehensible Open Hardware: Building the Open Book
Good morning everyone. As said, my name is Joey Castillo and I'm here to give a brief talk on, I guess, comprehensible open hardware, which for the record, I'm not a maker of open hardware tools. I'm just a humble user of them. But yeah, this talk comes from the perspective of someone using the tools made by the folks in this room to learn from open hardware designs and make some of my own. So one of the first things that I wanted to build when I got into open hardware was called the open book. This is an open hardware e-reader, more or less. And I wanted to make this for a long time, way back in like 2018, 2019, when I really wanted to make something like this. I didn't actually have the skills to make something like this. So to get there, I went online to steal as much as I could from folks like Adafruit, who make open source hardware. In opening up their designs for things like this e-paper driver board, they let me copy a lot of what they did for their gadget into my gadget. But getting ahead of myself here. The open book is the thing that I wanted to make, and I had some goals for the device. Those goals were pretty simple, or simply stated. I wanted to use it to read books. As I pitch it to new acquaintances, it's like a kindle that you build from scratch. I wanted it to support reading text in all the languages of the world. And I also wanted it to be affordable, accessible, and for lack of a better term kind of DIY-able. So just to give you an idea of what the device is and what it does, here is a short video of it in use. Here's a listing of books and short stories on the device. And I can launch this short story by Leo Tolstoy, which of course renders in Russian. The center button goes back home, where we can select a different work, like here the Tao Te Ching, rendered in Chinese. So I think it's pretty fun as projects go, but the fact is that's only half of it. The other half is I wanted the open book to be comprehensible to the person who builds and uses it. Like, it's through open hardware that I learned to build open hardware, and I really want to pay that forward to the people who have their own open book. To explain some of how I tried to do that, I have to flip it over to the back side. So there are a lot of issues with this first revision of the open book, but I'm showing it first because of this sort of dense silkscreen text that kind of became my trademark. Back when people were on Twitter, multiple folks at various times called me the Dr. Bronner's of PCB design. For this habit of filling every millimeter of my board with text. Up here, I'm narrating the entire soap opera of an ideal diode circuit giving five volts to a regulator, which is interesting, I guess. But why? Why should I pack my board silkscreen full of this kind of stuff? To answer that, I need to briefly do my ideology of why I got into open hardware. The problem, as I see it, is closed tech, especially as shipped by these big tech companies, fails to serve users of the technology. I tend to look at technology through the lens of power. Like, take this Kindle, for example. Who is this technology designed to empower? And while, yes, it does allow you to read books, I'd argue it is designed to empower Amazon. It's designed to push you into dark patterns that make you spend more money with Amazon. It's designed to surveil your taps and profile your habits for Amazon. It's designed to steal your attention and monetize it by selling ads for makeup or toasters. Meanwhile, the end user just wants to use the device on their terms without ads for toasters and is prevented from doing so by the platform owner. The big question for me, why does the tech get away with this? And the answer that keeps coming back to me is the technology is fundamentally incomprehensible to the end user. A device like this arrives fully formed as a slab of glass and plastic and it's meant to be used in the ways the platform owner sets forth. It's not meant to be understood or hacked or made to better serve the user. So what can we do about it? Well, I don't have all the answers, but in my practice at least, my goal is to make tech that folks can understand. My theory is if we can design well-documented open hardware that people can build on their own and understand, at least in the broad strokes, we can teach them that they don't have to accept technology that wasn't made with their best interests at heart. There is this fantastic quote from Bunny Wong in a blog post about hacking his Garmin smartwatch. He writes, the point of open hardware is not the ritual of compiling our stuff from source. It's the awareness that technology is not magic, that there's a trail of breadcrumbs that any of us could follow to liberate our digital lives. So with that in mind, what are the breadcrumbs? What trail am I laying down for users of my objects to follow? Over the course of a few years, I've had the opportunity to design several different versions of the open book and I think I've found three different sets of breadcrumbs for three different contexts. The first one has to do with helping the user understand how the gadget works, the second helping them understand how to build the gadget themselves, and third, explaining how to make use of the gadget. Let's take the first one first. This is one of the earliest open book prototypes and my vision back then was to use the silkscreen to narrate what each component on the board does. This has some benefits, I think. Like on the plus side, this could demystify the tech for someone who sat down and actually read the silkscreen. On the downside, I have to say, space is limited and I honestly left wondering if this is the most useful information to give the user. Like, I want to demystify the tech, I want them to feel like this is something they can understand, but is understanding how a MOSFET works the best way to do that? The best answer I can come up with is maybe. Still, there were a couple of bigger issues with this version of the open book. The parts are kind of small. These are 805 passives, which are pretty small for the average folk. They're fine-pitched parts. There's also parts like this microSD slot, which has its solderable connections hidden underneath a shield. I borrowed that footprint from Adafruit, which is open hardware and great, but they design for manufacturing, not hand-building. They're also honestly just way too many parts on this board. It's trying to do too many things and it's overwhelming to someone trying to build and understand it. So, yeah, this realization led to a new design that I called the abridged edition. This version cut the part count down considerably and tried to make it as simple as possible for people to build themselves. I used bigger passive components, 1206, and I picked parts with pins that are easily accessible like this new microSD slot. Instead of making folks solder down a fine-pitched microcontroller, I used the Raspberry Pi Pico module, which has a super-friendly 0.1-inch pitch. Yeah, some parts like this flex connector I could not buy on a module, but then I realized I can make my own module and have it preassembled. This little cast-related module, the green part, includes the e-paper connector as well as the whole supporting boost circuit. I ordered dozens of these for a few bucks a piece and I offered them alongside my main PCB. This meant that DIY makers only had to plop down one module to get the display working instead of a dozen densely packed fine-pitched parts. I also decided rather than using the silk screen to explain things, I could use it to explain how to build the thing. Adding step-by-step instructions alongside each of the parts on the board is a different trail of dreadcrumbs, but the upshot was you could follow the instructions, literally counter-clockwise around the board, and if you followed them all correctly, you would end up with a working device. Okay, so things I like about this set of dreadcrumbs, well, it is super effective. Since releasing this design, dozens of people around the world have assembled their own open book boards without any of my involvement. Like these photos are community builds that I never touched. I didn't even send them a board. These are people going in on group buys and part swaps and having enough success with the build that they've moved on to hacking on the firmware, which is exactly what I wanted to see. I'll also say we did a workshop at Hackaday Supercon in 2022, very ad hoc, not a formal workshop. We just sat on the floor of Supply Frame HQ and I guided a dozen people through building their own open books, and every single one walked away with a working device. Like, the plan worked. Still, after the abridged edition and doing these workshops kind of hands on, I realized to make the project scale to more people, I couldn't rely on everyone soldering it together themselves. I would have to have most of the thing done for them. This means I'm no longer using the silk screen to tell folks how to solder the thing together. Still, I did want to use it to do something useful. I still wanted to encourage that comprehensibility that we were talking about earlier. In this case, I kept something from the original open book, arranging the components in functional blocks, even if I can't fit room for narrative text to describe what they do. These blocks match what's in my schematic and how the components are grouped over there. This still gives people an overview of how the device works. You can see this is not a pile of parts arranged haphazardly. These parts work together in ways the user can understand. This is the battery charger. This is the power supply. Still, there is the question of what to do with the rest of the board space, and I can't leave it blank, so. The trail I'm finding most useful these days is the trail that leads to making use of the device. For this latest version of the open book, I'm including pin assignments as well as notes on how to develop firmware for the device right there on the circuit board itself. So, I'm going to be honest with you. I use this a ton when I'm writing my own firmware. Like, I am lazy, really. Sometimes I don't want to search my own documentation. Sometimes I didn't even write the documentation. If I don't want to open my schematic to try to decode what I was thinking when I designed this thing six months ago, what are the odds a user is going to go to all that effort themselves? Having the docs right there on the board is an affordance for people making use of my device, and as I found out, I am one of them. This also works on boards of many shapes and sizes. This is the circuit board for SensorWatch, the Casio wristwatch mod that I'm wearing on my wrist. Also, a shout out to Lucas, who is just up here. I learned everything about making a Casio board swap from your open hardware Pluto watch, so thank you for that. SensorWatch owes its existence to your project. Anyway, you can see here we're on the backside. This board is less than one inch in diameter, but we're still able to include notes about which pins are which, the capabilities, and even which on-chip peripherals I expect you to call on to make use of those pins. Self-documenting circuit boards like these attach relevant information to the hardware you already have physically in hand. This board doesn't just have pin labels. It has a narrative of how you wire it up. It doesn't just have component designators. It tells you what they mean for the device configuration. Oops. It creates a self-contained artifact. This is a prototype of a new version of SensorWatch. I'm still working on the pin assignments, and they may change before it's final, but even if I put this down and pick it up in six months, I don't have to cross-check a revision number with a schematic and a datasheet to get hacking. All the relevant information is literally in hand. Moreover, that information becomes available to the end user as well. Unlike closed-source objects, you have to painstakingly reverse engineer. Putting this information on the board itself makes the object hackable by default. We're throwing the doors open to the end user without forcing them to do so much as a web search, much less a deep dive into my repo. Also, just as a side note, this technique pairs very nicely with code that makes use of the same names. If your silkscreen says you named a pin button alarm, and the headers for your board support package also name that pin button alarm, you've made everyone's life easier, including, actually, maybe even especially your own. Once again, I am not someone who invents open hardware tools. I am just a humble user of them. And I don't have all the answers for when it comes to making or helping folks to rock the stuff that we make. Still, these are some of the ways that I have tried to make some of my stuff more transparent. And I just want to close with some questions that I can ask myself and we can ask ourselves as we finalize our designs and send them out into the world. Questions like, how would I imagine someone using this device? Am I offering affordances that make it likely they'll achieve what I hope that they'll achieve? What kind of information would I want to give a user of the device, both at a basic level and at a more advanced level? And also, I didn't put it on the slide, but what would I want to know if I'm picking this up after six months and I've forgotten most of my design choices? Most of all, can I tell the story that I'm trying to tell, the story of the device in a way that makes sense? Because if I can figure that out and print it right there on the board, both the artifact and its back story will live together forever. Anyway, that is what I wanted to share today. So I'm going to put up my info and I would love to take questions if we had any. Thank you. So first of all, I love the product and I love your philosophy in open source. So those open books support EPUB files right now? It does not. So the open book uses a ESP32 microcontroller. It's a very kind of resource constrained platform. At this time, I'm supporting plain UTF-8 text. That is my file format of choice. That might also be a bit of an ideological choice. I like the idea that a plain text file can represent a literary work. Plain text feels powerful. I think if space aliens come and see the ruins of our civilization in a millennium, they'd probably be able to figure out the UTF-8. I'm not sure if they'd be able to figure out the plethora of things that go into... EPUB is just a zip. Yeah, having said that, folks ask this question a lot and now that people are hacking on the firmware, I think it's entirely possible that I think the ESP32 is a capable microcontroller and I'd be curious to see what folks come up with in this space. So while it's not something I'm working on myself, this is the ethos of open source, right? Throwing the doors open to folks. Awesome, thank you. Thank you. Yes, but for me the problem is microvision in the sense, so because there is a lot of SMD, surface-mountain device, and for me it's not practical. You need an installation of a big house to have a fire and this precision soldering and so it's not for everybody, this kind of thing. So I know hardware, but hardware is now in the past, it was easy. Now it's very difficult. So if you can do a book format, it's little also. I prefer a large format and to chance to make a chance to come annotation and so on. So what do you say about that? I think you're absolutely right and I think this is the reason that I'm starting to move toward getting it PCBA assembled and maybe people just... maybe the experience of building your own book is like taking a circuit board that's assembled and putting it into a case of your choosing or 3D printing a case and maybe that's the larger things that you're putting together, but I totally understand not everyone is going to be able to solder these fine-pitched parts and yeah, I think maybe my appetite for DIY got ahead of my understanding of everyone's capability or desire to DIY. So I think you're totally right and yeah, I'm probably moving toward more PCBA in the future. Can you pass the microphone back to Andy? Hi, thanks. It's really great. A couple of questions. So the sales screen can mess up your board now if you make a mistake in the documentation. That is correct. I think it makes me very diligent about triple checking things before I send it off, but no, I will not lie, that has happened to me before. No way of automating that? I'm very curious and maybe some of the folks in this room have ideas, but I do like the idea of if I know I want to annotate, for example, a line on my microcontroller symbol, I want that to be on my sales screen. That would be very interesting to see if there are ways to link those things together. I haven't yet run across ways to do that, but if anyone in this room knows tools that can help me with that, I would love to do more of that. Can you do field substitution in KeyCAD in the sales screen? Okay, cool. I've been told I can do field substitution in KeyCAD in sales screen, so that is awesome. As a user of KeyCAD, I will check that out. Any plans for having a camera on the book that you can scan the board and show the schematics and the documentation on the ebook itself? Interestingly, not only in the ebook itself, but I have a colleague who's working on using kind of QR style codes to get a better sense of the assembly of various devices. I think there's a lot of possibilities there. I also had a slide about the idea of, I like the idea of putting things like QR codes that contain text, not URLs, but if I could put a basic readme in a QR code and you scan it and you get the full text of a pinout or a description, that would be very interesting. But yeah, possibilities. I see one more in the back. There's a question on the line on the ETA. So, yeah, the question is, am I planning to offer the open book online or is there an ETA? And I hope to do a crowd supply campaign at some point this year. I'm just, yeah, it's hard to find the time to do all the things I want to do, but hopefully by the end of the year, hopefully in the next few months I'll have a pre-launch page up and we'll be able to, yeah, put something out there. Okay, so thank you, Joey. Thank you. Thank you all.
LibrePCB Status Update
Hello everyone, my name is Wurl van Bruin, I'm the founder and main developer of LibrePCB and today I will give you a short update about the LibrePCB project. So for those who do not know LibrePCB yet, it's an open source EDI software to draw schematics and design PCB and the main goal is the same as KitKat but there are some differences and of course it is a cross-platform, it runs on almost every computer, Windows, Linux, MacOS and more and its main goal is actually to make creating hardware easy, efficient and more or less foolproof with an intuitive user interface and powerful architectural concepts. So while the intuitive UI is especially helpful for beginners to get started easily with PCB design, it's also intended for professional users, for example who care about things like a same file format or a command line interface to automate some tasks. So let's take a look at what happened in the past one or two years because there are some great news. Especially at the end of 2022, it was an exciting moment because I started to work full-time on LibrePCB, now doing it for a bit more than a year and of course this leads to a lot more progress than in the many, many years before. In addition, the LibrePCB project has been approved by the NLNet Foundation to receive funding through the Next Generation Internet Program, which helps a lot to keep the full-time development ongoing. Then our fabrication service got PCBWay as a new manufacturing partner, so if you order PCB through LibrePCB Fab, you can now choose between ASLER and PCBWay. Also, I'm very proud to have several new sponsors on board from last year, Bitelé Electronics, NextPCB PartStack, PCBGogo and WinZorz. Last but not least, there are many individuals supporting the LibrePCB project with donations or other kinds of contributions, for example translations or creating libraries and so on. So with these sponsorships and the donations, the LibrePCB project raised around $8,000 in 2023. In my opinion, that's already quite amazing for this relatively still early state of the project. So at this point, I want to thank all the supporters and contributors and for your trust in the LibrePCB project. So this really makes me happy and thank you very much for this support. I take this as a sign that the LibrePCB is on the right way, so I hope it's okay to continue this way. Nevertheless, it's still a very long way to go until we have a stable funding for the full-time development, so I hope this support continues many more years. Other things which happen beside the application development are a completely new website with much more content, a new documentation system with more documentation and also for a few months now, we also have official video tutorials on YouTube. Not complete yet, but at least a few ones now. But now let's take a look at the application. In September last year, version 1.0 was released, which was a very exciting moment. And beside many new capabilities in the board editor like thermal relief pads and so on, this release also added a 3D board viewer with step model import and export, which is not only fancy, but also a great way to review the design before ordering the PCBs. But actually, I mean, this is known from the 3D viewer, it's known from many other EDA tools. Probably every EDA tool is able to show your preview. I'm actually especially proud of two features which make generating production data really a pleasure. First of all, we have introduced comprehensive support for assembly variants and manufacturer part number management. So, MPNs can now be stored in libraries, so you don't need to add them to every new schematic you need them. In the schematic editor, you can even assign multiple MPNs to one component to export them as second source parts to the BOM. I mean, who didn't experience any supply chain issues in the last few years? So, it's nice to actually specify second source parts. And you can even specify different parts for different assembly variants. For example, assembly a 10K resistor in one assembly variant and zero-ohm resistor in another assembly variant. And to actually make generating these BOMs and any other output data, a matter of seconds, we introduced output jobs as a new unified way to export any data. So, these output jobs can be configured very flexible and stored within the project. So, the exactly same output files can be reproduced on a different computer. You don't need to configure anything again. So, yeah. And if they provide a command line interface, it's also very easy to fully automate the production data generation. For example, if you like to use a continuous integration system. So, now, a short demo is worth more than a thousand words. So, I would like to quickly show you a few of the features. I hope this works. Okay. On my screen, it looks completely different. But, okay. I think it's, you understand what should be there, I think. Right? So, the first is the 3D viewer. Let's see if it actually, yeah. More or less. Okay. Strange. So, I just want to show you actually that the 3D feature is very, very easy. Actually, you don't need to care about it. You just add a resistor or whatever to the schematic. And our libraries have the 3D models built in. So, you don't need to care about them. The part to the board editor, let's say a THD variant. And it immediately appears in the 3D view with a 3D model. And, yeah, it's even possible to switch between different footprints, for example. Different pitch. And the 3D model is automatically updated to the new footprint. Or, for example, vertical mounting variant. So, you actually cannot even do anything which isn't compatible. It's always assigned to the footprint you choose. So, yeah. Now, let's take a quick look at MPN management. I mean, the most simple use cases, you just want to add some component. And you have now the option to actually choose a concrete MPN, because they are now sorted in the library. So, if you add a component by its MPN, and let's quickly also add it to the board to actually make it appearing in the BOM. And when you export the BOM then, I think it was LED3, it immediately appears with the MPN, you're just assigned. So, it's very easy to generate high-quality production data. And another use case, for example, I mentioned before, if you want to add a second source part, you can just choose a different part, let's say, from a different manufacturer. Add it to the same component. It is listed as an alternative part now. And if you export the BOM now, you have a new column with the second source MPN and you can generate the BOM. So, there is no need anymore to manually adjust the BOM after generating it before sending it to the assembly house. You can generate it completely finished. No manual rework needed anymore. So, then, to actually generate the BOM, you can use the output jobs feature I just mentioned before. So, you can also generate these jobs. Every job means one or more files which are generated. For example, the Garber files, there is one job to generate Garber files. And if you, for example, like to send Garber files in a zip file to the manufacturer, you can just add a zip output job, choose you want to have Garber files in the zip file, maybe also the assembly PDF within the zip file. And now the output jobs are stored in the project, so you have to do it only once. And now you can generate production data, for example, for single jobs. Just double-click the job, the files generated and opened. Or you can generate all data at once. And you get, for example, the zip file you just configured, containing the Garber files and the assembly PDF, just like you want to have it to send it to the manufacturer. Also no manual file editing or archiving needed anymore. So, if you make any change to the project, one click and you have all files updated. But of course, not everyone likes to manually generate output files, even if it's that easy, because there is even a more easy option available. If you don't like to care about all these things, just start ordering your PCB right within the application. It's uploaded to our fabrication service website. You even get ERC warnings if you didn't resolve them in your project yet. You can choose your manufacturer just forwarded to the manufacturer you like. And without handling any files manually, you have your project... Okay, I was too fast. You have your project ready to be ordered. Just enter your shipping address, payment information and so on. That's it. So, let's switch to the slides. Okay, so now what's the overall state of the project? Generally, liver PCBs are fully functional and can be used productively for projects which are not too complex. Not too complex because hierarchical scheming is a very important factor. So, what's the overall state of the project? Generally, liver PCBs are fully functional and can be used productively for projects which are not too complex. Because hierarchical schematics and bosses are not supported yet. And also the trace routing tool and actually the board editor in general is still rather rudimentary. So, from time to time it might be a little bit inefficient. So, yeah. And of course, part livers is always a problem. It's not very comprehensive yet, but at least with liver PCBs, it's very, very easy. And to create the missing parts by yourself. So, a quick outlook now. The upcoming release will contain an EGLE project importer. So, it can import complete EGLE projects. And there's also some work ongoing currently to integrate live part information into the application. When you add a component to the schematic, you immediately should see then the part lifecycle status, stock availability and the price. So, this will be very useful. So, I hope we can make it happen. And yeah, it's clear from time to time. Some technology updates are needed. For example, switching to Qt 6. And yeah, for long term, as I mentioned, the trace routing tool needs some improvements and also hierarchical schematics and bosses. I think these are a must have. Yeah. So, if you like to support my effort on creating an easy and powerful EDA software for everyone, I would be very, very thankful about the donation. And to keep the full time development ongoing as long as possible. So, yeah, and there are also many other ways to contribute. Just check out the link here. And if there is any Wikipedia article right here, please let me know. We are looking for some help to publish a Wikipedia article. And please let us know your feedback on the feedback survey. So, yeah. The slides are online. Here are some links to get easily started with Libre PCB. That's it. Thank you very much. Thank you. Thank you for the presentation. I'm using Altium Designer and Qiget, and I work at a shop where Mentor was used. How is the state of the import of Altium Qiget and Mentor? It doesn't exist yet. Do you have plans to implement either of those? Any plans to implement these imports? I think Qiget import would be quite obvious. The other ones, I don't know yet how much effort is needed, how these file formats are known or not known, how to read them. So, yeah, yeah. I think at some day we will look at the imports, but it's of course not a high priority. So, did you encounter any problems with patents or something during your development? Because I'm developing a clone of a commercial software where I'm dealing a little bit with some patents that I might violate during that. Sorry, I didn't understand. Patents. Did you have any issues with those, like registered patents of companies of, I don't know? So far, I didn't have any problems with patents, but yeah, I'm not an expert in this area. So, I just tried to take care of licenses of things I use to hopefully not doing anything against the license terms. Any other questions? Okay, thank you, Urban.
ngspice circuit simulator - stand-alone and embedded into KiCad
I translate. No, no, I directly plugged into the laptop. Okay, so we are going to continue on. The stream going out that is being recorded looks nice, so the rest of us, we're going to suck it up and just listen to what we are here to learn from Holger Voight. So please give a round of welcome to Holger. Yeah, okay, so many thanks. Angie Spice, Circulator and Pink. Talk about Stand-Alone and Embedded in Turkey Cat. Well, I give a short introduction to Circulator. Then talk about what's new in Angie Spice. Talk about the Kikat-Angie-Spice interface and give some simulation examples. Well, conclude with what is next. Yeah, why circuit simulation? You emulate electronic circuits per software. It should be cost efficient at time saving. That's it. Some details, of course, you can check functionality without making hardware. It's very important if you do IC design because fabricating an IC with a defect circuit, this is very expensive. You can check for parasitic elements. You can make variance very easily. You can change some device parameters and see what is happening. You can evaluate new concepts with not too large an effort. You can cross-check against automatic circuit generation as a final simulation test. You can anticipate reliability, make degradation simulations. And it's a good learning experience because you can look into a circuit without using hardware to do so. You can see voltage and currents in different branches in this. Very interesting. Yeah, Angie Spice, what is it? It's a circuit simulator that numerically solves equations. Describing electronic circuits, it can also be other types of circuits. For example, thermal could also be mechanical. And you are interested mostly in time-varying signals in electronics, its current and voltages. It's the open-source successor of the vulnerable SPICE-3 from Berkeley. Okay, we have a circuit. This is a very simple circuit, an inverter with two transistors. And this is the entry to Angie Spice. So Angie Spice is a command line input tool. Many people said, ooh, command line. But I've just learned command line is very nice. KeyCut has got a command line and other software also. So we are not too bad with that. Okay, you have the net list, the SPICE net list, which contains circuit description, power supplies, transistors, some simulation commands to run the thing and some model data to run this thing. The output is graphical indeed. It's a time axis and the voltage axis. The ideal green input. Yeah, it's green still. And the simulated output, you see the inverted signal. Yeah. This is the Angie Spice user interface. Yeah, on the input side, you put in the circuit net list, the circuit description. You put in models or model parameters for the devices you're using in your circuit. And you put in simulation commands. And the output could be data tables or tables to file, of course, could be graphical plots. We use the venerable X11 interface or native Windows plotting capability. Or you can plot to PostScript SVG or use NuPlot or other tools for outputting. Yeah, what's new in Angie Spice? The current release is Angie Spice 42, released in December 27th last year. We have an additional matrix over. I will talk about these things a little bit more in detail in the following. We have a new matrix over in addition to the venerable SPACE 1.3. We support VariLock A-coded compact device models. We allow co-simulation for mixed signal simulation with VariLock digital circuit blocks and mixed signal digital analog parts within Angie Spice. We allow core simulation. Again, mixed signal with C-coded digital. So there is a way to translate C-code into Angie Spice readable shared libraries. And we have a, and I'm benefiting from the vastly improved graphical user interface, key cut, especially the upcoming key cut A is offering for using Angie Spice. Well, the matrix over. What is the circuit simulator doing? The circuit simulator, if you look inside, Angie Spice gets the circuit, makes a setup, parsing the netlist, reading the model files. And then it is, if you do a transient simulation, simulation versus time, then you have the ever circle here between model equation evaluation and these data go into the matrix and the matrix is solved. And then you go for the next time step and you repeat this until the time is over and you look at the output. The model evaluation is already running in parallel in Angie Spice. We use open MPs, so if you have a multi-core processor you typically have today, you use, benefit from that. The matrix evaluation is not paralyzed. These sparse matrix solvers are difficult to paralyze. So we have been looking for a long time for an additional matrix solver. We used the sparse 1.3, developed in 1986. And now we use an additional optional selectable KLU matrix solver, which is ongoing development by T.A. Davis and his co-workers. And with KLU you get a speed-up of simulation by factor 1.5 to 3 if you have large circuits and especially if you do circuits for IC simulation. And this is, of course, an advancement. We allow Verlach A compact device models in Angie Spice. Compact device models, these are the model equations describing modern transistors, for example. These complex, tiny things like FinFETs also would have 500 parameters and lots of differential equations to describe and people do the development in Verlach A. And so we had a real need to have an interface to this Verlach A because this provides access to the modern devices like BISIM bulk, which is for ultra-short channels, or BISIM CMG, which is for FinFET, or for gallium nitride devices, power devices, high-speed bipolar transistors, and so on and so on. Yeah, and we got this set up in cooperation with the company SemiMod, who did this open-source development. We have the Verlach A model description. We compiled this model with an open-source compiler OpenWath, compiled directly into a shared library, and this shared library can be read by Angie Spice, which has got the OSDI interface. So we are reading directly the Verlach A compiled model from a shared library or DLL. Yeah, we make use of this. For example, I've been mentioned maybe already, open-source PDK is for IC design, are upcoming, and one of these is the IHP open-source PDK, and this is a 130-nanometer CMOS process with integrated ultra-fast bipolar transistors. Ultra-fast means 500 GHz or so. The model used for the bipolar is the so-called phobic model, which is integrated into Angie Spice for some years now, and the MOS model is a PSP model developed by, currently I think developed by Lehti in France, and this is Verlach A, and we translate this, put this into Angie Spice, and so we can support this open-source PDK with simulation. This is just a simple example, 19-stage NAND gate ring oscillator, so we have 19 NAND gates in series, feed them back, and it starts to oscillate, and we have your frequency, this is an FFT of this signal, a frequency of 600 MHz, and you divide it by 19 by 2, and then you get an inverter delay of 280 picoseconds. Okay, yeah, we allow digital Verlach circuit blocks into Angie Spice. Looks a little bit more complex, but isn't that much complex? We have a Verlach digital circuit block. We compile this with an open-source compiler very later into some intermediate C code, and then we compile this intermediate C code with some C templates in addition, which are constant all the same. We compile it with GCC or MSVC into a C-CodedShare library. And this C-CodedShare library is read by Angie Spice. Angie Spice has a so-called code model interface, and we have written a code model in decosim, which directly interfaces this shared library. So we can now run simulation with this standard Angie Spice netlist, which may contain lots of analog, plus digital blocks. This is an example, it's just a demo, it's not a productive simulation. This is a successive approximation register analog to digital converter. Six-bit. And this uses the digital SARAR block written in Verlach with the analog part, which is a capacitor array with some switches. Okay, and even if things look complex, using this is not very complex. You need two commands. You have this command, Angie Spice, and Angie Spice calls a script written in Angie Spice control language, and you enter the ADC Verlach description. It compiles the Verlach thing, it compiles the GCC thing, and then you call the Spice netlist with the standard command, Angie Spice ADC dot sir, which contains the analog part and contains the simulation control, then you get this kind of thing. Okay, I just enlarged a little bit. You see that it's a successive approximation. This is the ramped-in voltage, and this is the x-axis is time. And this is a new start. We try to get the value of this point here. It starts with the starting value, and then successively approximates the input. Well, with a certain delay, 8.5 microseconds here, which is the time you need for the conversion, then you are here in the stable phase, and this is the red line just shifted by 8.5 microseconds, is, well, the output signal. Yeah, so digital plus analog. Okay, you can also do this with C-coded digital type of models. You have C-coded independent processes. You compile them with GCC, for example, or with any C compiler. And these communicate with NG-spice via another code model. The digital interface is now called deep-process. Well, this has been developed by Eurospalatis from Isotel some time ago, but we have now, for the recent version, have adapted it a little bit, modified it so it will also run under MS-Windows. And now we can, yeah, simulate some circuit, which has some circuit blocks from C-code. This is just, again, an example, a simple example. The C-code you see here, this is, yeah, a gray code generator. This gray code generator is compiled and loaded into NG-spice, and this is the output. The plotting here is by GTKWave, because this is a nice digital plot. Yeah, and you can use these kinds of blocks. So you define these compute functions with data out and data in and some other, and the time or the clock circuit, clock going in, you can run C-code digital circuit. Okay, so I want to talk about schematic entry for NG-spice, because this is under continuous development, and, yeah, it's a nice usable thing. Why do we want to have such a graphical user interface? Well, NetList as input quickly becomes confusing. You need schematic entry. You need to see circuits, circuit schematics, and then have an interface to the simulator. You can get better documentation, of course, if you group inputs and outputs. This is not an NG-spice development. So we develop this, for NG-spice, don't develop these graphical user interfaces. We make use of existing ones or support the development. And, of course, you need one, because all other simulators are mostly, most of all other simulators have one, so you have to offer one. There are three of these interfaces currently under development, we cooperate. This is a thing called X-Schem, whose main focus is on IC design. There is another one, QXS. This is a very universal interface, which specializes a little bit in RF simulation. And then, okay, we have the key-cat. So I wouldn't say that key-cat is developed because it's a graphical user interface of NG-spice. No, the other way around, yeah? You have heard about this PCB design and layout tool, and it offers a simulation, and the simulation engine is NG-spice to support the circuit designer. So, of course, I can then make use of this beautiful interface. Okay, just show these interfaces in strange colors. Yeah, I won't talk about these. I want to talk about this one, again, in strange colors. Okay, but you could imagine that it could look nice. This is the ischema window with some circuit, simple circuit, a simple phase shift oscillator with a 4.2 kHz frequency oscillating. And down here, you see the FFT. Of course, you see it's not a super clean sinusoidal signal, but, okay, this is the 4 kHz thing here. Yeah, so what is the interface looking like? Ischema does this schematic entry. Ischema generates the SPICE netlist, and Ischema also does a graphical presentation of the results. So, it sends the circuit netlist to NG-spice, it sets model parameters to NG-spice, so the simulation commands, and it gets back simulation results. NG-spice is here used as a shared library to this key-cut process. Yeah, I would like to make a live demo. I don't like these colors, but let's see if we can survive somehow. Okay, this is my starting template. I do not start from the zero because it takes too much time. So, this should become an operational amplifier. Simple thing, amplifier by a factor of 10. Okay, what is missing is the operational amplifier. I try to grab it, grab it from the library. So, we just load the library, it takes a little bit of time, but only once the first time, then it gets faster. I know that it is in the library simulation-spice, and here is the op-amp. I grab it, and I move it, and hopefully it fits because, yeah, it did last time. Yeah, it does. Okay, so this is how you place additional elements. Very simple. But now, let's stop. We don't need any more, I hope. Yeah, and now we do simulation. This is a real-time simulation. I look inspect for simulator, and I get this simulator interface. Well, black is green, and pink is white. Okay, I'm sorry for that. What do we want to do? We want to do the transient simulation. Transient simulation is output versus input versus time. Okay, and so we have, yeah, what is our input? Let's go back and have a look. The input is a sinusoidal signal with an amplitude of 0.1 volt and a frequency of 1 kilohertz. Okay, back to the simulator window, and I just click on to start simulation, and here is our simulation. The input is the small one, and the output is the red, who stays red. That's great. The input is the red signal. Okay, so this is transient simulation versus time. We could have another simulation. To be honest, I have prepared this. Four, this is so-called AC simulation, small signal simulation versus frequency. So you see the frequency behavior of this kind of circuit. Yeah, we again run the analysis, and you see that the amplification is 20 dB, so it's 10, is constant, but the operational amplifier has one single internal pole, and so it goes down. Okay, so this is very quickly, you just see what's going on. I think I have time to make some additional change. I put an additional capacitor in here. I collect my capacitor, I transform it because I have to rotate it. I put it just in here. Let's do it in here. And I have to give it a value. I guess I take one mic one. Yeah, and then we go back, and do the AC simulation again. Oops, there's something changed. We have this, this is sort of low pass behavior. It stayed, and now we have some high pass behavior for the low frequencies due to this input capacitor. Yeah, so very quickly you do a small change, and with a simple click, we are there. Okay, so this is what I wanted to show live. Let's go back to the slides, and I give some more examples. Yeah, the first example, this is, again, why do you want to simulate? This is a 2.5 kilowatt class D audio amplifier. And you would say, this is strange. No, you go to some Amazon and click in looking for these kind of amplifiers and 300 bucks. You can get a kilowatt amplifier today, because it's a digital amplifier. And, okay, so what did I do to get this simulation? Okay, I made a symbol myself of this audio driver circuit, just drawing the symbol. And this audio driver circuit is also something I created myself, because it has the analog input. It has a path width modulator. This is a translation from the analog signal to a pulse width digital signal. It needs something more. It needs a complementary pull output, because we have two transistors here. And it has a dead time generator to avoid shoot through, because what will happen? You have minus 100 volts here, plus 100 volts here. And if you manage to open both of these transistors at the same time, you will see the result in form of smoke. And so you have to avoid this. And, okay, and some simulation commands in here. The input is 2 volts, again, 1 kilohertz. You see the power supply. The output load is a 2 ohm resistor. Well, and this is the output. This is the input signal, and this one is the output signal. Okay, and with the double frequency, you have the power signal, the blue one here. And if you do an RMS over this output power signal, you see here it's kilowatt up to 4.3, for example, you will get an output power of 2.6 kilowatt. The simulation has a great advantage. Nothing explodes. You can just do it, and if you do, you can investigate the output filters and can check loudspeaker models and everything just by simulation. Of course, you can also do real-time real amplifiers. This is Tiberio Vecol has made this Q17 amplifier derived from the famous quad 405 audio field amplifier. You see lots of transistors in this thing. The output stage, the input is an operational amplifier. This is the modern contribution of the whole thing and some voltage generators here. Well, yeah, and you can, of course, simulate this and look similar to our 2.6 kilowatt, it's 100 watt, and what you see here is just at 300 milliseconds, we switch the output load from 8 to 7 ohms automatically to check what the output load would mean, and you see a little bit increase in output power. So you can model all these things and model the influences and so on and so on. OK, NG-SPICE allows to do mixed signal simulation. Mixed signal simulation means you have analog and digital circuits in the same simulator, and you could also simulate the digital part like the analog part, but this takes a lot of time, and if you have more than a few gates, it would be much too slow. So NG-SPICE includes an event-based simulation, which is very fast, and this is a mixture. Well, this is the veneral 7400 series of devices. You have flip-flops here, you have some output decoders and some NAND gates, and you have some XOR, or NAND, this is NOR gates. Yeah, and you can simulate this whole thing together, and you see that this is mixed signal means because we're using the digital output here for a delay line. So we have an RC delay and another RC delay, and we have the original signal, and so this gives an output pulse of a specific width. This is the clock signal generated in this circuit, and this circuit here, which is shown, is a rotary encoder, so encoder which does give optical signals when it's turned around one or the other way, and this is the digital output, again plotted with GTK wave, and you see this here, the Q1 signal is coming before Q2 signal, and because in the rotary decoder these two decoders are shifted a little bit, so you know that this is turning left, for example, and here the turning is changed to the other direction, and you see the Q1 is coming later than Q2, and this is detected by this circuit. You have here the pulses, let's say for turning left, and then left turning, switched right turning, and you see the output pulses here for the turning right. So mixed signal simulation is, and this is effective, because the whole simulation thing is 25 milliseconds, so it's ultra-fast, it's click, and it's there. You can even run this on this computer here, which is not the fastest machine. And we can have pure digital. I made a symbol for this up and down counter. You have the input clock, you have the input up and down signal, and here it's a 3-bit, 8-state counter, and inside of this is a state machine, and it's a very, very simple state machine. You have here the states from 0 to 7, so the 8 states. Here are the signals you see from 0, 0, 0 up to 1, 1, 1, and here is what the states are switching. The input is at 0 state, and the input is 0, input means backward counting, then the next state is this one here. Or if the input is 1 and we are at state 0, then we go to state 1. If we are at state 1 and we count down, we go back to state 0. If we are at state 1 and we count forward, we go to state 2. So you can do very simple programming inside one of these code models used by the digital event simulator of NG-SPICE. Well, and here's just the signal, the clock signal. This is the up and down, the up and down signal, and we count up and count up, and then we switch to down and then we count down and we switch up again. So very simple simulation, and the simulation time of this whole thing is mere 37 milliseconds, so it's very fast. Okay, so much about the examples we have. What's next in NG-SPICE? Here are listed some ideas, some more or less fixed plans, and some actual activities. We will do more tests with the open source PDKs, supporting the sky-water PDK, and especially the upcoming IHP PDK to support analog mix signal and RF simulation to support these kind of designs. We will improve the RF capability by adding harmonic balance with a special effective method, for example, to simulate intermodulation of signals and so on. We will support reliability and degradation simulation. Well, nothing lasts forever, chips don't last forever, and people sometimes want to know how long they will live, and so you can try to model that, and this will be done here. And hopefully with a funded project, this is very interesting. There has been the request for transient noise simulation. This is a difficult task, because we don't want to rewite the complete simulator, we have to figure out ways, and again here it would be very difficult to do that. If somebody is interested in integrating this into NG Spice, please let me know. We will improve the usability of key-cut NG Spice graphics interface. Continuously, people are requesting things, and we are detecting things, and we can try to simplify things, we can try to support more of what NG Spice is offering internally right now. For example, the digital simulation is, should be supported by having digital basic blocks as input, and digital plotting, for example, as output. And we have to enhance compatibility, because the world is, somehow we are competing against commercial simulators like LT Spice or Q Spice, or P Spice, or H Spice, and what other... We cannot do this in full, but the basic things should be compatible. But all these four I have mentioned have different, slightly different input languages, slightly different models, and so you have to take care of this somehow. Yeah, that's it. What I wanted to provide you with information, here is some support, websites, if you need more details, here they are. Thank you. APPLAUSE So, while we are taking questions, the video team is going to try to repair the video locally, so your questions will not be able to refer to the slide. Hi Holger, you said something about the creation of semiconductor devices. Would it be possible to simulate the creation based on radioactivity? Yes, this is included in this development plans. Thank you for the presentation. A quick question is how do we input the state machine in the component? Is there a special window where we come and we type it, or the state machine must be written in a dot c or dot something and we give it to the component? Yeah, the simple state machine, the question is how can we code the state machine into ng-spice? The simple state machine I have shown is just a text file. This text file is loaded, you put into your spice netlist a single line with a specific model and this model loads the state machine. That's it for the simple things. The complex, you could of course write state machines and c-codes if you want to. Then you have to do this translation. My question is maybe a bit naive, but would it be at some point feasible to include the tracks or geometry inputs from KCAD in order to mimic the links that you place between your spice components? Please, it's a little bit... Track width and we also have the PCB stack up. Would it be somewhat feasible to from this geometry inputs associate a kind of approximation of the S-parameters of each lines between the components? Yes, there is some work ongoing. It's not that intensive to use an EM-sover, it's called Sparse Lizards, to extract these data from your lines in KCAD. I think it's a lack of manpower to make this a real tool. KCAD has added IBIS simulation, so you have IC output and IC input, only the output and input signals and many semiconductor vendors offer these models. Then you could basically have a transmission line or an RC line in between to simulate the signal integrity. The problem is, as you said, to get these data from your PCB. Slowly, slowly moving on. Basically, yes, but this is a key-cut or ischima, it's a key-cut work, it's not the NG-spice. The NG-spice takes the transmission line parameters or takes the parasitic capacitance resistances and then does the simulation. So the EM would have to be data from the key-cut? Yes, exactly. The EM has to come from the key-cut. I wanted to ask if anybody has used the C interface to, for example, make simulations of existing microcontrollers or things like that that you could have in your design. There has been some activity on this, very scarce. I think it's two. It's yours, Platysy, from Iso-Tel. Just look up his website, Iso-Tel, and you could find some information on that. There has been another guy, I think he has used Arduino interfacing to NG-spice, but I don't know much about this work. Are there any dynamic languages that are possible to be used as a model, or is it just compiled languages that have to be loaded? If you don't care about simulation time, for example, would it be possible to use any scripting language to... Yes, there are various kinds of making models. You have the very old A-road, but this is compiled and static. And it's compiled, it's there. You can do models with NG-spice internal nonlinear voltage sources, for example. And these are very dynamic. And many power semiconductor device makers, they make so-called sub-circuit models, which are comprised of spice commands. These can be very complex, difficult to debug, but then you can do whatever you could imagine. Is it possible to perform simulations over PVT, so over process variants and voltage variants? Yes. And would it be possible to do this without changing any of the models itself? Yes, this is the typical content of the model. Content of modern semiconductor PDKs when you think about IC simulation. The worst case simulation or corner simulation is typically integrated. It's different model parameters. The model stays the same, but certain parameters are changed. So we have a question from online. Just heads up, we're still working on the video, so lucky for us. Holger is able to continue answering questions for the foreseeable future. Online they are asking, is there any post-processing of waveforms such as THD, FFT, etc. possible? FFT is standard. FFT is standard in NG-spice and is standard in the Kikat-NG-spice interface right now. It's more or less two clicks and then you have it. You can set up, NG-spice has a very powerful scripting language, well another language. It's not Python, it's another language which originated in 1990. So we keep it up and have more than 100 commands available. And you can do a lot of data processing with this scripting. So for example, classification into bins, or if you do Monte Carlo simulation, you can run Monte Carlo simulation. You can classify these data into bins. You can do a lot of post-processing internally in NG-spice. Well, of course, if this is not enough or you want to use standard interfaces, there are Python-NG-spice interfaces available. So you can use all these Python libraries which are there for data processing. So it's a lot of action, but the action has to be done by you. Okay, we have time for one more question. You do not actually work time. Okay, so let's give Holger a round of applause. Thank you very much. Okay, so we're going to check.
Sharing parametric models as web apps with replicad
Hi, I'm Steve. I'm a Swiss software engineer. I like to thinker. I like to make things. I like to share the thing that I like and the thing that I make. This talk is a lot about these kind of things. And the story starts, as many stories started today I think, in 2020, 21. And for some reason lots of people started to pick up new hobbies and I was no exception. So I started to do 3D printing. And it was a lot of fun. I bought a cheap Chinese printer. I think you're quite a bit with it. And I must admit the hardware part was not really my thing. I was more into the modeling part. But yeah, lots of fun. The thing is it was not as easy to share with friends. A lawyer friend, they're not going to thinker as much as I do. So it's more difficult to share. The thing is the machines are getting better nowadays. They're getting closer to being appliances. And so I can share them with them. And I can try to share the hobby, generally speaking. And the good thing is the modeling part, people are not going to model. I assume that these people that are interested potentially in 3D printing, that are going to interested. They're going to go on one of these websites, if you don't know, they're a repository of 3D models. And they're going to apply a very simple workflow. And they're going to download the model. They're going to slice it. So use the software that you tell the software, what is the model, what is the filament you use, what is your printer, and it magically spouts out a file. That is a print. That is just it. It's very simple. And if you're not technical, that's perfect. And I've been using this workflow for different things. If you look on the left, you have this thing that you might be using. And it's a way to make snowballs that are perfectly shaped. And if you remember, I'm a Swiss engineer, so I like my snowballs to be perfectly shaped. The other ones are not beer crates, because I don't have a printer that big. Therefore, batteries. Anyway, they're great models. They're simple models. They're fun to print. They're fun to share with people and to give them and all things like that. And this is not what I'm going to talk about. You know, these are very well modeled and shared as just a single file. What I want you to talk about are things that are more like that. So this is a very good project that you can find on printables called the Honicom Storage Wall. So it's a way to do pegboards for the printing. You have this base plate, which is a Honicom that you put on your wall and then the community has gone well and done a bunch of different attachments. And you can attach anything. So here it's probably in someone's office, but I've seen people using it in their bathroom, in their kitchen. You have attachments, people model attachments for everything. And it's just great. There's a lot of big community around it and all these kind of things. So I'm not going to talk about the attachments and modeling the attachments here. I'm going to talk about the plates. Because what happens is, you know, these things are made for 3D printing and 3D printers that tend to, you know, have different sizes and different beds. And this is what you can see in this file. You have different sizes of base plates that correspond to popular printers. They're not going to cover all of them, but you know, you can get quite far. You then have people who want to have, you know, nice borders because, you know, perhaps it's in for their kitchens and they want to have the kitchens being looked really well. And so people, you know, the community has provided. But then you get into these explosions of combinations and you don't have it covered necessarily by the community. And I can see people in the back just screaming parametric models. And yeah, yeah, I know. And this is what we usually think about it. And I mean, small parametric models, software, and anyway, I don't think these are the best answers. This is the best work, one of the best ways that we have now. But I think we can approve on it. And I'm going to show the limits first. And the people making the honeycomb storage wall projects are really good. And they have shared their the files that they've used to build. So they build the thing with Fusion 360. They also have some people in the community have re-implemented the model in OpenSCAD. And I'm just going to walk the, you know, the simple workflow I was at the beginning, you know, download slice print, what it looks like if, you know, you were new to 3D printing and you want to, you know, change the size of your build plate. So you download the model, this part of the same, then you have to find the hobby version of Fusion 360, or I don't know what it's called now. And if you try to do that, you know what I mean, it's not easy to kind of hide it. And then, you know, they try to get some money. If you figure that out, then you download it, pick sometimes a big file. Then you just, you know, you have a professional tool in front of you. And presently, I, you know, I'm comfortable with Fusion 360. This is what I use to actually learn CAD. So it's quite nice. But, you know, you just have a huge program in front of you and you don't know what to use. Perhaps you, like me, and say, oh, it's a challenge, I'm going to be interested and watch a lot of video and learn how to CAD. But perhaps then when you're done, you just don't know what you were trying to do. Or, you know, what is more likely to happen is you're going to abandon. You know, you're not going to customize it, or perhaps you're going to ask your friend who's more technical to do it for you. With OpenSCAD, you have something that's similar. I'm going to go faster. So download the model, then you download OpenSCAD. And the thing is, perhaps you're, you know, it's an open source thing. You don't know exactly what it is. It doesn't look as professional as the other tools. And, you know, the computer telling you, oh, it's not only software, you're sure, perhaps you're just going to abandon there because you don't trust it. Then, you know, it's code CAD. I love code CAD. But if you're a lawyer, code CAD is not your thing. And perhaps you're going to try anyway, but you're going to change the wrong line and add the wrong type of code and all these kind of things because, you know, not everyone is a programmer. And so, you know, you're going to feel that you're going to abandon and you're not going to have the thing that you want. So what do we want? We want to lower the bar for the end user, make it, you know, make these parametric models accessible to everyone. And the solution that, you know, proposes to have something that works very fairly similar to what we have before. You just add the configure model step and then, you know, download, slice and print. And how can you do that? You have these web generators and configurators. I don't know if you know them, but typically, if I want to create a QR code, I just, you know, Google QR code generator, I skip the first five because there are probably just lots of ads in there and I know the good one. I don't remember right now, but, you know, you have these kind of things for these two. It's single serving for a particular purpose. And it's just great. And there is no reason, or there might be a little bit, why we shouldn't share our parametric model that way. You know, it's just software to just do one thing, you know, the UNIX philosophy. And so what I have is a QR code for you. You don't need to go there, but it's because you can see it here. So it's a configurator that, you know, creates the Honeycomb wall storage thing. In the middle, you have the model. If you are on the top, you have something to configure it, you know, the number of rows, the number of columns. And here you can just download. You can see what goes, you know, configure, download. I don't have the slicing and the printing because it's another tool. I don't want to implement a slicer and a printer in my software. It's just something very simple. And you know, you have a couple of things where you can edit, you can see the code and all these kind of things because, you know, we share stuff. The code is open. Just go. What you're thinking is, I don't want to build my own configurator because it's, you know, you have them to maintain a server. You have to pay for it. You know, you have many reasons that you might have. Oh, I just went a bit too far. But, yeah. Yeah, you don't want to pay for it. Many reasons that you might not want to do it, right? Because perhaps you're more back-end person and you don't really care about building the UI. Perhaps you're more, you know, you don't want to touch C++ or you don't want, you know, you don't want to install, to compile some stuff on servers. You know, many reasons you don't want to do that. And so the thing that you've already seen because there was a bit of spoiler, we want to lower the bar for the maker as well. And the way that I've been trying to do that is with this project that I've been building. So it's not the first purpose. The first purpose is going to come later. It's just a bit of suspense. But with Replicad, you can, you know, as someone who is interested in code, make a web configurator very easily. So what is Replicad in that context? First is the online workbench for CodeCAD. So if you want to do CodeCAD, so it's drawing with code, you can just go to the workbench and you code on the left and you have your model on the right. It's something that, you know, was probably originally done by OpenSCAD. You have many, many different examples now. You have cat query, you have something similar. And in the terms of something purely online, you have something called Cascade Studio that exists. So it's nothing completely new. But you have it there, it's something that exists. You can do your model. Then it's something that is a bit different. It's a dot in JavaScript TypeScript. And perhaps some people are just, why would you choose JavaScript? Many reasons. It's a great language now. You should try it again. And the second one is, if you are actually new to CodeCAD, you might want to learn to code. And there are lots of resources for JavaScript online. It's, you know, it's a bit everywhere. And so there's a lot, there are lots of resources to learn it. Also, NPM exists. If you want to do some Voronoi stuff, there's a library for that. So it's also quite nice to just use a language and not have a specific language for what you do. It uses the OpenCascade kernel. So it means that you can make fillets as well as you can make them in FreeCAD, which, you know, means what it means. It, I mean, it means it's a powerful kernel. It's not perfect, but it's very powerful. So you can do lots of things with it. And the last thing, which is coming back to why I was introducing Replicant in the first place, we have a built-in web configurator. So, you know, you draw your thing, you can download the model, or you can click on something to share it, and you generate the link. And the thing that if you, perhaps, didn't let you the time to open the configurator that had before, is what you have directly. Just, you know, a bunch of parameters that you expose, a way to download it. It's very simple. But it's not everything. The first thing, you know, as a software developer, we build things for ourselves. And so, you know, I'm saying that I'm rolling the bar for the end user and, no, no, I'm doing it for myself. But it also means that, you know, perhaps you're a bit like me, and you're a web dev, and it means that the bar is also lower for you to build things with this. So perhaps in the list of things that I had before, some of them say, well, it's not that bad to build your own UI paths. I want to. And actually, replicates this library, and so it means that you can. I can just import it. You build your own front-end project and use it to the library. And as an example, I'm going back to my conference-driven development project that I had before. And here, you can look in the corner, and it's just parameters, right? It's not that great, because what I want is distances. I don't want rows and columns. And so, oh, what you'd have to do is actually do a bunch of math before actually, which is not very good either. So I did another project that you can look at, which is what you expect. It's an online configurator that just generates the same model than the other one did, but with my own UI. So it doesn't, you know, it shares some resemblance with the other model, because, you know, I'd meet both, and so I have a bit of style or try to. And the thing is, now you have distances. You don't need to have rows and columns and do the math, etc. Yeah. You have an undo button, because it was already there in the thing that I copied, but you might want to do undo in your particular project. But it's something that I built just for that. And there is a viewer, and actually it's more responsive than the other, because I spent five minutes to do the responsiveness and all these kind of things. So this is what you can do with this kind of thing. And my aim was to use CAD as a web API. Nowadays, the browser is just an amazing piece of software. You can do audio in it. You can do, you know, 3D rendering, which I use actually. But drawing stuff with CAD is not something that is there by default, which is probably a good thing. But you might want to do CAD in the browser anyway, and perhaps use replica for that. That's kind of my point. Actually, you know, going back to why I did it, this is something that I did. Another thing that I'm into is making boxes for board games. Don't ask, you know, people have some more niche hobbies. And so I made a specific UI for making boxes for board games. So, and it generates the box. But, you know, someone might want to generate documentation or to make the first step and not have to install a free CAD for that. So it might be a tool for it. Or, you know, if you have some specific needs, perhaps just for hobbies, but for work, it's something that you could use. It's kind of my point. And so if, you know, we get in towards the end of the day, and we're going to think a bit about what have we learned. The first thing that I didn't really mention, but I want to stress, because it's quite important, you know, I've said, oh, perhaps not great to share things, you know, parametric models. Actually, there's no wrong way to share stuff, right? It's just, let's try to be better about it. The thing that you can do is we can lower the bar, make it easier to share parametric models and, you know, as configurator, especially. And then, if you were a bit code-cautious and want to play with these kind of things, perhaps have a look at the workbench. You know, there's a bit of a community on the GitHub repos. You can ask questions, discussions, people are interested. So have a look. And if you are a web dev and you want to do a bit something more involved, you can think of replicat as a library to work. And so this is all I had for you. I hope you had fun. Do you have any questions? You make this sound so easy, which is wonderful. But where were the dragons? I mean, part of the thing is, was learning, I mean, I rely a lot on a project called OpenCascade.js. And to me, the dragons are here. Copiling C, C++ is not something that I want to do. And so I could avoid them. Then, it's lots of fun of, you know, building the thing, trying to figure out the different technologies and things like that. There was no, I mean, and then it was about learning OpenCascade. And, you know, one of the things that a replica does is it tries to handle memory as well as possible, because OpenCascade is C++. And, you know, you have to manage some times the memory. And there are definitely memory leaks in my project, but, you know, you're welcome to find them and fix them. But yeah, it was trying to find ways to do that. At some point, I designed the API to handle it. And then I found a way that was better to just have it magically disappear. So when will you buy a laser cutter? So you also make laser cutter boxes? Actually, it won't be, I don't have a laser cutter and I've been, you know, resisting buying one for a while. Before getting into 3D printer, I bought a silhouette machine. So, you know, it's to cut paper. And if you go to the website, the deck in the box website to make boxes, before making the 3D printed boxes, I mean, boxes in paper. And so you have the die lines and it generates, you know, you have the same interface to generate die lines to cut things for paper. But yeah, so, you know, I'm resisting as much as I can. And perhaps there is also one thing that is a bit of a dragon, that is perhaps a rabbit hole that I partially fell into, is the 2D part, because the OpenCascade is not that great for 2D Boolean operations. And so I started to implement them by myself and I'm, you know, starting to build a 2D CAD kernel and it's getting a bit out of hand. Great talk. Thank you. Have you ever thought about free CAD import or kind of a connection to free CAD as modeling free CAD might be easier than coding it? I'm not sure exactly how that would, I mean, you can import step files, but then you have, you know, the whole model and the way it is, I'm not sure that it would be easy to do. And part of the thing with code CAD that, you know, makes it easier, typically, you know, you have the type of naming problems that they're, they're solving currently, you don't really have it, because instead of selecting by, you know, the number, you know, which one you have in the array, because you clicked on it, you can say, oh, I want to take the edge that is at this distance from that, because you know this when you, you know, you model and batch, yeah, you have to do a little bit of mass to figure out what is the distance and things like that, but normally it's basic geometry and you're going to have it wrong and do it again, but it's going to be okay. And so, no, the answer, short answer, sorry. Okay, thank you, Steve. Thank you.
Streaming live using RIST On Demand to thousands, how you can have your cake and eat it too
All right, good morning. Again, I'm Sergio Amarata. I'm a board member on the RISC forum, and an active member of the RISC committee that writes specs for RISC. And I'm also the maintainer of the Librisque Open Source project. So today, we'll be talking about how RISC can support end to end live streaming with packet recovery. But in particular, I will explain how we can support this in a broadcast scenario, meaning streaming to as many users as your bandwidth can support. So we'll cover the topic in two sections. First, we'll provide a roadmap or an update on the RISC specification and the Librisque project itself. And then we'll go to a practical application and show you how you can do live streaming in a large scale with the open source tools provided. So on to part one, the development roadmap. The last time I gave an update at FOSDEM regarding RISC was February 2020, a few days before the pandemic shutdown. Now, four years later, we will explore what happens instead. I guess if I have waited one more year, I could have blamed the Thanos snap for the delay. So let's do a brief recap of the beginning of the protocol. In 2017, the VSS forum created the RISC activity group for the purpose of creating a unified interoperable protocol for transmission of IP data over loss in networks. The requirements were that it needed to be based on the UDP protocol, and it needed to include negative acknowledgment retransmission requests. So one year later, after a successful multi-vendor interop event, the simple profile specification was published. You can see that in the bottom. The RISC activity group then proceeded to add multiplexing and encryption capabilities and publish the main profile specification in early 2020. It was at that time that the Libris Library Open Source Project first was published. And you can refer to my talk I gave back then, where I go into detail explanation of what the simple profile does and the main profiles and the differences, et cetera. So as you can see on this slide, the RISC activity group has been quite busy adding features to the protocol to accommodate all possible use cases during the last four years. What started with the simple profile, the first release, as the desire to add packet recovery to an RTP stream with an MPEG-TS payload, has now turned into a reach protocol that will work with any payload and which includes multiplexing, encryption, and authentication. So Libris, the open source project, currently supports a simple profile and main profile. And we're working on adding support for the advanced profile. So in addition to the core specifications for the protocol, the RISC activity group has also published a series of recommendation or best practice documents. These are documents that extend the protocol into specific applications, into specific niches, and you need to actually consider that in the library that is compatible with this specification. So the library Libris, when applicable, has been made compatible with these recommendations, the code synchronization, the relay, et cetera, et cetera. So enough history about the protocol and the specification documents, those are all publicly available. They're not behind any payload. The VSF documents can be downloaded, the PDFs, and you can look at the specs and all the recommendations. Let's talk about the Libris open source project itself, right? In case you are not familiar with what RISC is, we can define it with just one simple sentence, like you see up there. It's a new protocol for transmission of IP data across lossing networks using UDP with NAT-based retransmissions. Before getting into anything else, I'd like to clarify the three most common misconceptions people have about the RISC protocol. And this has come about in talks and conversations. People tell me, oh, well, RISC is only for MPEC-TS, false. Advanced profile includes support for any payload with clearly identified payload types in the header now. There's even an registration with support binary payloads, et cetera. And misconception number two, you need a large buffer and therefore latency is large, a second or more, right? In order for you to use RISC, false. You really need two to six times the round trip time, the RTT, between the two endpoints you're trying to send the data through. So the shorter the RTT, the total buffer required will also be shorter and you can talk about 10 milliseconds, 20 millisecond total latency. It's just depending on what network you're deploying it in. In addition, and this is a major misconception on that second point, RISC also supports real-time data channels with no added latency, with lossy channels that you can have APIs going back and forth in real-time and send data that cannot wait for these buffers. Misconception number three, you can only use RISC for transmitting in one direction, right? You send data over there, this is packet recovery, you're done. That's false. The protocol allows for bidirectional transmission both with and without packet recovery on both directions. The limitations are usually introduced by the implementer of the protocol. The specifications are broad enough so that each implementer has the freedom to add or remove features at will. So let's talk about the Libris Development Roadmap. How do we determine where to go next, right? So we divide it into three categories. The first one is we want to improve the reach of the library. And by improving the reach, we mean improving the adoption of the library by client applications so that users can go ahead and have it available on every device. Libris, of course, adds support for all the different specs like I showed before and all the recommendations so that all these reach features that make the protocol have more use cases are available immediately under the Libris library. The second is distribution, right? We make sure that our library compiles on every platform so that it can easily be adopted by anybody and that it makes it when possible into open source applications like FFMPEG, Libris, OBS, etc. As a matter of fact, when running it within the video LAN servers, it compiles in all 21 different architectures that are predefined in their CI. So we're pretty confident that if somebody wants to use it, they can. In the distribution aspect, we also have it on the major distros now available in Debian, OpenBSD, etc. And the third aspect of how to determine what the roadmap is is we do timely enhancements and timely bug fixes very quickly when they come about. So on the feature set, I think that the most important addition recently that allows the application to be used into this broadcast market like one too many, the media service scenario is the EAPSRP6A authentication protocol. It was introduced in 2022 and what it allows you to do is instead of the normal model where you have a pre-shared key that you have to share among two endpoints which makes it very insecure because if that communication of that key for the encryption gets compromised, your entire network is compromised now. This protocol allows you to do a username and password, a unique username and password for each of the connected clients and part of the protocol, doing that username and password exchange which is different for them, includes the negotiation and the exchange of the pre-shared key. So there's no risk anymore of that pre-shared key to ever be compromised. Other features that allow the broader adoption is that we're working on a one-way satellite application, we're working on multicast discovery and a few other things. So third aspect, the distribution. Many FOSS projects already have Libreps compiled by default or have it as an option. If you know of additional projects or if you know, please drop me a line, I'd like to keep a database of which projects are actually already included in it, if possible. Ritz is also a part of my own day-to-day operations which gives us the advantage of finding the bugs before they are found in the wild and we fix them very quickly. OK. So performance enhancements over the last few years, we now have the ability to automatically configure based on the network conditions. The Libreps library does an RTT with a new packet that was released, the Echo packet, 10 times per second. What that does is it lets us measure, with a UDP, you know, ping, not a regular ping, the network conditions between the two endpoints. We know the inter-packet spacing, the variance, mid and max, we know the latency, we know all these things and with those values, with those parameters, and if you look at the default configuration, the library will auto-adjust its buffer to the perfect buffer for that link without you having to guess or know anything about the network. It will also adjust the initial buffer, the reordering buffer based on your jitter on the network. Your inter-packet spacing jitter, gaps in maximum jitter, and make sure your reorder buffer is at least that much. We've added, you know, because we've done these very large improvements, we've realized we need better metrics, so we've added support for Prometheus and other things straight out of the library, so that you can actually grab that and, you know, plug it into third-party tools and immediately create your dashboard that gives you the proper visibility in the connections. And, you know, last release was just a couple of months ago. So the top priorities for 2024 for the development roadmap is we want to add support for DTLS encryption and authentication. We want to fully add support for the new advanced profile that adds, you know, the new header ID with the special payloads. And we want to try to see if we can get back support of the library into VLC 3.0. So the goal of the original project, like we mentioned before, was an interoperable standard for this type of transmission. There were, you know, half a dozen or a dozen different methods or there still are of doing UDP with packet recovery, each vendor specific, et cetera, et cetera. Our goal was to create an interoperable standard with multiple implementations, and I think we've achieved that at this point, at least at the higher broadcast level and tier one, tier two companies and a lot of the open source projects that support REST. They all talk to each other, even if it's not the same implementation. So now to part two, right? Let's look at REST as a live streaming platform, right? And particularly we want to look at a model that does N2S. How do you use REST and Libris in particular to do an N2N streaming chain, like the one we're doing here, for example, or, you know, any one-to-many scenario, right? Lots of viewers. So let's diagram, you know, a simple scenario here. We have three components, sources, the sender, which is a REST device, and many receivers on the bottom, and the box here on the bottom, you know, symbolizes a single one of those receivers. So we see the logos up there for FFMPEG, BLC, and Open Broadcast Studio. That could be also G-streamer, any source, any encoder, it doesn't matter. Somebody that has the ability to generate compressed or uncompressed video stream, right? Well, we need a binary stream of some kind pushed to the library. Libris in particular doesn't care about what the payload is. You can push anything in the payload, we'll deliver that to the other side, even though the spec for simple profile and main profile say that you're transmitting MPEG-TS, the library doesn't look at the payload or restrict it in any way. Okay. So the source is sending a UDP, or RTP media stream into the input. We buffer it so that we have it available for retransmission, and the minute the buffer is full, we start listening on, we put the sender in what we call listening mode. It opens a UDP port and start listening for receivers that want that stream, right? So the minute our receiver wants to connect to us, then the handshake happens. I'm obviously oversimplifying the process of the handshake that all happens. The SRP68 protocol is quite complex. It would take a talk just to go through the details of that handshake and everything that happens. So this is only symbolic. The handshake happens. The username is sent to us, and we check for that username within our database of username and passwords. It's not really a data-major password, but a password hashes to keep everything safe. If the authentication succeeds, then we send as part of the SRP68 protocol the pre-shared key so that the receiver can decrypt the data now. Once the data is decrypted, that's it. We have an end-to-end transmission from source to hundreds of destinations with just the risk protocol in between. So with proper planning and setting everything up correctly, you can have a 300 millisecond glass-to-glass, one to hundreds of listeners. You need a good network. Like I said, the latency is more dependent upon the RTT between the endpoints than anything else. I mentioned 300 milliseconds because in our large-scale deployments, we've done this anywhere within the U.S. with 300 milliseconds glass-to-glass. When you have to expand it and have users that are across the ocean or with crappy networks or Wi-Fi, the latency will auto-adjust. The protocol will auto-adjust. For those players, suddenly they get 500 milliseconds. We notice as a rule of thumb that somebody in Wi-Fi gets a penalty of another 200 milliseconds automatically. So how do you do this from a practical point of view? The LibreSploracle includes some command-line utilities that allows you to send, receive, and relay. The RISC2RISC is the one... If you want to do a relay application one-to-many, this is the ideal scenario. You can also do it with a RISC sender, to be honest, but the RISC2RISC is effective because it acts as a relay, doesn't encrypt or decrypt, doesn't do anything, but receives data and sends data both in the RISC format. You can put this in a CDN, your data sender anywhere, and you configure in the RISC2RISC a listener with authentication, and then you put your stream from anywhere, your source, like from here, to that endpoint. Then you configure the other end, the one that's going to send to the older viewers with a database of user-oriented passwords, and now you have the full authentication. It adds no additional latency in that process. It's only the latency that you decide to put as far as buffering. As far as quality and quantity, the sweet spot seems to be between 3 and 5 megabits per second, resolution 720p or 1080p, whatever code you're using gives you better or less quality, and that seems to traverse all the different VPNs, corporate networks, et cetera, without any issues. Quantity, the RISC2RISC can handle 100 simultaneous connections, and the number seems low, but because of the threading model and the fact that it has to do retransmissions, after that the retransmissions get compromised. The way you scale is that you can instantiate multiple instances of the same RISC2RISC application within the same machine, and in our case, we have 1500 simultaneous viewers going off of this type of transmissions 24-7. So the RISC password utility is also a command line utility available on the project that allows you to create the username and password combination hashes, just like the HD password file in Apache has a similar format, that's why we created it this way. You run the utility, put a username and password and that outputs this username with a hash, and then you append that to a file, and then the sender can grab that file and use it as an authentication database. In the case you want to scale that to a much higher level, you integrate directly with the library and you use the library callbacks to do the authentication yourself against your own databases, and you can scale that to thousands of users. The command line sender is a typical scenario of what I put in the diagram, what I was using in the diagram, you put the input any type of UDP stream, output you encrypt it, and then the output URL, if you look at RISC, is in your column, column, you add 127, you add the add, just like you do typically for FFMpeg or VLC or that type of stuff, when you want to listen instead of send, and it creates a listening on that port, and that's all you would need to do to create a sender and use the sender as a really, as well, just for one stream. On the receiver side, you want a player, for example, that you can put the username and password, right? You put the RISC in FFMpeg as the input, RISC, column, forward, slash, forward, slash, et cetera, or VLC, or any one of your choice. In our case, we did a custom VLC application inside of Raspberry Pi where we were doing this 1500 at the same time. There were Raspberry Pi's running VLC 3.0 inside with a lib-RISC implementation inside. The transmission of the secret in this case, which is a password for the username and password, should be handled in the same way you share passwords now for any account outside the scope of the protocol, and that's it. Then it becomes very simple to create a large-scale network with this. So the summary is the key feature for this is this new type of authentication that makes the secure implementation on a large scale, and it gives you better latency, lower latency, then the equivalent HLS or dash, with a security model that's built into the protocol. It's no longer the browser or the DRM inside the browser, everything. It's the protocol handles the entire DRM. So we have a really solid roadmap for the future. We were looking for additional contributors and people that want to help adding the next set of features. We're looking for open-source projects that want to implement the library. We'll help you put it in. And that's it. Thank you very much. Thank you. Okay, the question is, what if you're pushing your stream to Africa with a really bad connection? What is the acceptable packet loss? I'm not sure what you mean by acceptable packet loss. To me, zero is an acceptable packet loss, and the protocol is capable of achieving zero if you give it enough buffer. You give it a second buffer, and the round trip is 200 milliseconds, and you will get zero packet loss. We've done tests and we've done transmissions between Australia. I was just two weeks ago doing a demo, a transmission from Australia to Madrid. 16 cameras at 10 megabits per second each were being transmitted in real-time using RISC, and they were being used in Madrid for a production of the event. And the transmission didn't have a single packet loss, and it was all done across open internet. We used one second buffer there because the connections were relatively good, but if you go and, you know, if your transmission is really bad, just increase your latency, and the protocol will recover. We have part of our CI integration process tests that add 50% and even 75% packet loss. And you see spikes in bandwidth, but we recover every single packet if you give it enough buffer. Does it support simultaneous build rates? Does it support simultaneous build rates? Yes, we support multiplexing. In all this example, I've done just one UDP input. You can configure the library and the command lines to ingest multiple UDP inputs, give it a different ID, and then on the other side, you can demultiplex them. I assume that's what you mean by maybe having different build rates within the same stream. The camera, like, sends it on the fly according to network to the combination? Correct, yes. And one of the specifications that you saw on the recommendations was called source adaptation. It was written precisely to accommodate that scenario. What is the best case, best use, or the best practice recommendation on how to do source adaptation? Reduce the build rate, adjust the build rate based on network conditions. It's all documented in a part of a spec as well. So for non-MPEC-PS payloads, as you mentioned, is there already a mechanism like a composite trail to basically define the mapping of different payloads? Absolutely. For advanced profile, there's a GitHub repository that has the mappings already. We have a dozen or two dozen of them. I'm one of the administrators of the repository. All you need to do is go in and, you know, put an MR for whatever binary payload you want to define. All right, thank you. I have another question here. Is it also possible to multiplex and demultiplex subtitles? Is it also possible to multiplex and demultiplex subtitles? Yes. The protocol itself doesn't care what you put in. We consider each of them as a binary payload of some sort. You're the one that determines what the format of that payload is. And you have this pipe. You put multiple UDP streams. One of them is going to be your VTT payload or closed caption or whatever you want to put in with whatever format you want. We don't define or control the format of what you put in. We do to decide on multiplex and mulling. We give you the capability to give them IDs so that in the other side you can map those IDs to different outputs when it comes out. Thank you. But it means that you don't do any timing, right? In between the different streams. That's all user-side. Well, no. When you give us... The question is, that means that you don't do any timing or synchronization. On the contrary, because we are taking care of the multiplexing, when we ingest all the different UDP streams, the timing is guaranteed. The minute we receive that UDP stream, we actually, in the library, the implementation that we did, we grab the timestamp at the network card. This stream came in at this time, and then we reproduce that exact timing on the other end. We reproduce the spacing, the pacing, and the latency. We make it fixed, so that is not variable. That means that when you multiplex many things in the same tunnel, you're guaranteed they're in sync on the other side, or at least as they were when they came in. We're starting the use cases of the protocols to the more for... kind of the current adoption on endpoint devices, mobile devices, browsers. Okay, the question is, the use cases of the protocol, what is it towards more, point-to-point devices, point-to-multipoint, browsers, etc. This is the last question of our time. The original idea was to just do point-to-point transmissions. That was the original scope when we created the first version of the spec. That has changed. We achieved that, and now we went beyond that. Now we want to tackle the distribution. We want to tackle the one-to-many, the media servers. We have actually a project going on with Miss Server to add a lot of this functionality and the scalability as part of the project itself, so that we have at least one media server that already supports that in a very scalable way, where it becomes very simple for an application like VLC, or VFF Play, or Gstreamer to hook up to this media server and start the playback immediately using the Pshuoroko. Thank you very much.
PipeWire State of the Union
Alright, okay. My name is Swim Taimans, I work for Red Hat and I started writing pipewires some seven years ago, I don't know anymore, way too long. I gave a talk about pipewires last year, so basically it's a follow-up on that, a little bit of things that happened in the last year. For those who don't know what pipewires is, it's basically a multimedia sharing and processing engine. So pipewire was originally built to send video frames from Wayland to applications because screen sharing in Wayland was completely unimplemented in anything, so there needed to be some way of funneling those frames around. It went to a whole bunch of iterations to make that happen. It started with G-streamers, some custom implementation, and then version 0.2, which is something that sort of worked, and then it sort of devolved into an audio framework because people think pipewire is for audio, but it's actually more for video. So it devolved into an audio framework and here we are now. So basically the core of pipewire is to link applications and hardware into a graph. It's very similar to what G-streamers does, you make a graph of processing elements. In pipewire's case, this is distributed, so it's an IPC mechanism to funnel multimedia around between apps, devices, and so on. So there's a whole bunch of multimedia that you can funnel around, cameras, screen sharing, but also audio. So pipewire tries to implement all of the APIs to make that possible. So there is support for Video for Linux, there is support for Bluetooth, there is a compatibility server for Pulse Audio apps, for audio, a compatibility library for Jack applications. So you get all of these things here, all sides covered, and you can also run Jack next to it, but in essence, it funnels data around. It's built in the same principle as G-streamers, so it doesn't exactly know what data is, it just funnels it around. And it does so very efficiently or try to. So that's basically where it is now. We managed to build a whole bunch of stuff on it and replace Pulse Audio and the Jack Demon in most test-tops now with pipewire. So 1.0 was released last year, so that was a major milestone. Very happy about that. So for that to happen, I wanted to have at least as good latency as Jack server so that we could actually replace pro-audio use cases with pipewire without having to sacrifice latency or performance. So that took a while, but it eventually worked and now we are on par with Jack regarding latency, and it's using quite a bit less CPU for large buffers, and it's getting almost a little bit better than Jack for very small buffers. So that's pretty good. One of the reasons for that is Jack is more efficient even at lower buffer sizes, but pipewire is more optimized in its conversion and funneling samples around. So that's the compromise, I guess. So compared to last year as well, we have now support for NETCHAC with Opus. I think it was a question last year, why don't you have that? Well, now it is there. So you can actually NETCHAC between Jack and pipewire. They're compatible. One thing that doesn't exactly work very well is firewire devices. The problem is that I don't have a firewire device. You can't really buy them anymore, so somebody needs to send me something. They are like €1,000, you can maybe buy, I don't know. It's also professional audio, so you need cables to connect. It's just not a plug and so on. So that's a little bit of a back home yet. What else are we working on right now for AES? It's basically RTP. It's used for various hardware, professional hardware, that does audio over TCP and IP. So you can interface with Dolmete devices and so on. It requires like a shared clock with PTP and all of that. So we have worked that in pipewire. You can run the graphs with PTP clocks, it syncs and all of that. So people are testing that. Very specialized hardware, I don't really have any of these things. On the other end, we are now past the audio stage and now we are going back to the video. Because last year, some things fell into place to make that possible. For example, video modifier support was added. It requires a multi-step negotiation. I have these modifiers, do you support that? No, I do. I do all but then what video formats and what resolutions. And you need to go back and forth to arrive at the video format that the compositor in this case and Gstreamer, for example, or any other application like OBS, to get the most efficient video frames negotiated. We also added support for compressed audio formats. So for Bluetooth, we are still tracking, it's a draft, low energy audio. There is development in Blues, which is the D-Bus service that runs and there are all the connections with devices. And it exposes a D-Bus API that an audio server such as PipeWire could use to talk to the Bluetooth devices. So there is development there and we are trying to track that and match it to make that work. Some small things that were added that we don't actually know what to use it for. Interesting things that are happening is the video support. So I hope this year this will continue going forward. So we added video support in Firefox, so that means that instead of Firefox going directly to the video for Linux device with IOCTLs, which is not so nice in sandboxes, but which also doesn't work with newer cameras, because newer cameras, they need much more setup. They need to control setup media controls and all of that. So there is a new library called Lib Camera that also handles these new kind of cameras that you are supposed to use. So instead of porting Firefox over to Lib Camera, it's better to port it over to PipeWire, because then you get all these cameras that are new, but you can also do some other things like send video frames between applications into Firefox. I was going to try to demonstrate that, but the video support in OBS is still a pending patch to make that happen, maybe next year. So there is also camera support there. So in OBS, it's an application for making screencasts and YouTube videos and stuff like that. So you can compose some things and try to demonstrate that. There is also a thing called virtual camera. So OBS can export its scene, and it looks like a camera that PipeWire makes, and then you can actually consume that feed into Firefox, and you can start chaining just like you would chain audio processing elements, but then with video. So that's hopefully something that we will try to make work this year. There's some more work needed to get that going. So we are bug fixing small improvements, because there is nothing really to be done on audio that we know it should work. And all the remaining problems are, in my opinion, I don't know yet, driver issues, timers that don't work so well, unpredictable delays in drivers. So I think the work needs to be done somewhere else. No immediate plans to fix there. So all the work goes into the video side of things. So video routing. So we're working on video converters so that we can convert between formats. Like if you want to implement certain shaders that work on one format and not on others, this should be made possible. Also processing filters with Vulkan shaders or processors. So here, so now that Firefox and OBS use pipe wire for the cameras, we need to start thinking, okay, this is now going to work in flat packs without having to open the whole socket. But then we can also start adding security, like the pop-ups, do you want to allow this camera, yes or no, or take away access to the camera if you don't want it anymore. So there's some talks about that to make that better. This is in-planet currently with the portal. But there are other use cases, like for example, we don't have any access control for audio in browsers at all. But that is something that we'll hopefully flesh out this year. Another thing, explicit sync support. Again, if you do the video processing, it's better to delay or like to queue up as much work in the GPU as you can and then have GPU itself synchronize all the buffers waiting for rendering and stuff like that. So explicit sync would transfer buffers and also file descriptor with it that you can use to wait for completion of the buffer data actually. So that's also something that we want to try to do. And then tooling and docs, the things we continue doing. So I was going to show you a little bit what it looks like the video. Everybody knows the audio. Also a little bit of tools here, I don't know if you know any of these things. So there's like a top thing. This is interesting. It doesn't do anything because there's nothing going on. But you can also get these things like a draw view. I'm showing how now because then you can see the cameras as well as a device. So if you, I don't know, let's see if this is going to do anything. Probably does, but there's no, it's going to the HDMI. Anyway, you can see, maybe it comes on the feet. I don't know. So you can have like a little look what's going on here. In this is a tree view of the graph basically. So you have like the audio driver is iterating and pulling in samples from another tool, PA Play. You can also see this as a graph view. And all of these things, you can link them together to other things. So right, so each of these devices and nodes are in a graph. You can visualize the graph. You can change the links between these things and do all these things. So for example, for OBS, this kind of what it is. Well, I made a very stupid scene, but you can make some interesting things. I don't know. You can put some backgrounds there and place yourself there. So this is using a screen sharing from one of my windows. I think the terminal, but it could be anything using pipe wire and also the capture, which is a new thing using pipe wire. And also these things here, the microphones, they are still a bit pulsed audio. You can look in the graph here. That's becoming a bit more complicated. But you can see these green, these yellow boxes here light up. So you'll see that hopefully a bit more. So you know Shell, that's the screencast stream that sends video to this one. That's the camera from OBS. Yeah. So I was going to show some Firefox things, but there is no export button here. So normally in OBS, you can now do so start streaming and send all of that to, I don't know, one of the hundreds destinations that are supported. But you can also start a new camera, a virtual camera, and then you could consume that camera or this composition in other pipe wire apps. So if we enable all these pipe wire apps and we make them as efficient as possible with all of the video modifiers and all of the tools that we get from Vulkan. Yeah. We should be a step closer to the ultimate goal. Yep. Some other thing that's interesting. I haven't shown that yet, which is basically called filter chain. So you can do this. You can make a small little file. Wait, let me see where I put that again. Yeah, this one. Yeah, it's a conflict file. It's not very easily, but I can imagine GUIs that generate these things, but nobody has written any of them. But you can basically make a little graph of lots of plugins and LV2 plugins. And you can link them together and then you can tell pipe wire to make a new sync of that. That's the input for applications to use. And then that is the output of this filter. So this is something that does again. And you can just then I'll use some debug here. Okay. You can run this graph. And if all goes well, you should also see new sync here. So this is this new thing that appears. So you can just stop this program again and take away the sync. So this is interesting. And did I quit? I can do it again. And so here is this new volume sync. So you can just on the fly create and remove devices if you want. It's a bit like pulse audio with loading modules. But in pipe wires case, you don't actually all need to load them into one demon. You can have separate programs starting and stopping them as they go. So for this filter chain, for example, that's used on for like implementing like sound correction for speakers and all of that. We haven't done any of these things on desktop yet. Also, maybe something we can do. Like for example, on Apple, you get the sound of of apples are so great because they do a lot of filtering to make the frequency match speakers and all of that. So if you don't have that, it sounds very thin and a lot of laptops, they need some extra processing to make them sound great. Sometimes why they sound a lot better on Windows. We don't do any of these things yet. So that's also something that we can do with these filters. All right. Something else that that I don't mention here because it's actually another project, which is the session manager. I've shown this. One big component in all of this is a session manager. We use wire plumber normally. So that one is kind of orchestrating all of the things that happen in the graph, the devices that appear. If a player comes where it's going to be linked, how it's going to be linked, is it going to do a mixing down mixing or is it going to need some filters before it does that. So all of these rules are external to pipe wire in in a session manager. So a lot of work is also happening there. It's a separate project. But yeah, there's, for example, a version five coming out where all of the conflict files are rewritten in a different way. So that's also a change or interesting things that are going to happen. For the pipe wire demon itself, I think it's kind of that's what it is. No new plans. Okay. Yep. So the usual. Yeah, we worked a lot on our documentation too. There's a lot more stuff there. Also the weekend as a whole lot of stuff. It's a bit difficult to organize all of these things, of course, and it's why am I. This is weird. I didn't start the browser. Well, I could do that. I guess. We got tons of information on the week. All of the stuff should normally be documented somewhere or another. So a few. Problem is that it's so, so much configuration and so much options that people get lost. I tried to do some simple guides. How do I enable multiple sample rates and you literally have make this file put that in it. That's it. So. All right. And get up. That's where we are. So yeah. Questions. Yes. Speaking of docs, I was looking at them just the other week. I assume you have the ability to use your own event loop manager rather than the basic tutorial, which says create this one of pipe wires. Yeah. So the question is, can you use your own event manager or do you have to use the pipe wire one? You can use your own one. The pipe wire one, you can make it and then you can get the file descriptor from it and add that to your own, to your own loop. So for example, no shell does that. It uses the G main loop. KD as well. Is there something or do you know some project which hooks in speech recognition into the audio part and creates subtitles on fly? What they're in the stream. The question is, is there an application that hooks in the audio stream and generates subtitles on the fly? No, but it's a great project. I think. Yeah. There's also the case, for example, of keywords. So listening for keywords like, hey, Google, okay, Google or something like that, or I don't know. Hello, Gnom. Yes. When you talked about consuming the virtual camera, would you be able to send those sources to multiple destinations? Yes, so the question is, does a camera can be sent to multiple destinations? Yes. So there can be multiple consumers from one camera in pipe wire. I can actually show that. Just to show what's going on. How am I going to do that? I can, for example, start OBS. So that's one using the camera. And there's also, let's say, I think there's an example here. It's in build. Build. Examples. I think it's called video play. Other way around. No. The thing is, of course, the second one has to have the same resolution of the first one. There's no conversion going on immediately. There's a way to reorganize the negotiation and all of that, but that is a GAM policy for wire plumber, I think. So that's, I think, it's not immediately implemented. Yeah. I was curious about the, the work capacity. I know that there is an AES-627 plan. And I also was wondering if it was the same thing for video, maybe a SIMPTP or an NDI or things like that. So the question is, RTP or network support for video? Completely unimplemented. At all. So only done for audio. Yep. No. The current stage, like, is it just, do you have, you've been involved with people who are using the AES-727 communication? Yeah, I know people are testing it. There's an issues page about the state of it. So I'll have to look it up what it is exactly. But you can find it if you look for AES in the issues. You find it on all of the hardware that people test with things they have, the tweets they have to do, and then we try to all, so that's ongoing. I have to go over here. Yeah. Yeah. Thank you for making it because I switched to hardware like two years ago and it was just a very pleasant experience because it just worked. Yeah. And I've also been using it in music related stuff and all the places Jack for me as well. Yeah, it's great. Cool. That was the plan. It wasn't the, if I have to repeat the question, it wasn't the question, it was just praise. Yeah, we have more questions. I have two questions. The first one is for the wire plumber. Does it have a ground using the place or it's just a direct command line? Command line. So the question is wire plumber. Does it have a GUI? No. No GUI. So you can, for example, have several applications and all the sources you can.
dublang, a multi-language live coding system
Hello, thank you first the organization for accepting my talk and thank you for coming. And in this talk I will present software that I am developing since I think two years ago named the Dublang. It's a multi-language live coding system. And then I will do a very short presentation here with a small video demonstration. And first a bit about my profile. I work as a research software engineer inside the project named Cortex Platform in France inside the University Gustave Eiffel. Also I am a collaborator of the software heritage project as an ambassador and a debunk contributor and also as a hobby. I am a live coder and visual artist. I am very interested in live coding to produce sounds and to produce video as well. And that's why I created this tool to support my interest in this subject. First the name of the project. I think it's important to mention from where comes the inspiration. The Dublang name is inspired by the musical style Dub. And Dub consists of remixes of existing music and Dublang consists of remixes of existing software. Then one of the goals of the Dublang tool is to have a single interface text to our interface live coding interface to manipulate and to use multiple different tools in the same source code in the same session. Then how it is designed. The Dublang system is designed in a client server architecture. And for that I am using in the client side I am using Nail Vim text editor because I found it very easy to implement not also because it's easy to implement using the Lua language that it's a really nice script language that fits very well in the purpose of this tool as the purpose is to mix different tools in the same environment than a script language like Lua language works very well. And then in the other side I have the servers that is being managed by system D service. Then here is one example of how it looks like Dublang source code. Then here is an example where I have two different languages and the region with the hashtag and exclamation defines a region for a specific language. And then I can have in the same source code different regions with different program language. And then for each language I have to implement inside the Dublang system extension through a plugin. Then the Dublang system is built on top of the architecture is pluggable. And I can create plugins, new plugins to integrate new languages or new tools. Let's see if I can play this video here as example. I hope the sound works. It doesn't work the sound? No, it doesn't. But I don't have sound yet in this moment. Oh, that's still something. You can try. You should be here. Plugging this into the audio object. If you're looking, you might get sound. What I do with this? Which put a click left or right? Sorry. Oops. What happened? I don't know what happened here. Oh, man, what's happening? Okay, there we are. I clicked it in the wrong button. Sorry. It won't feel screen apparently, but there you are. Let me go back one or two seconds. Then here, I have more or less the same code I showed in the previous presentation. Where is the sound? I lost the sound. Ah, look here. Oh. Yes. Then here, when I evaluate this, it's being executed by the super collide server. And then in the same source code, I'm going to add some... I think I finished my time. Just to finish this is now them. Then this is Bambam, it's being executed by the Tidal Cycles language. Then two different servers and the client is sending it to the proper server. Sorry for extending my time. Thank you for your attention. No time for questions, I suppose. Thank you.
From the lab to Jupyter : a brief history of computational notebooks from a STS perspective
Hi guys. So no demo for me. I'm just here for some food for thoughts. And I will talk as a social scientist about a specific case. What I want to do, I have very little time, so I will move very fast, is to make two things. First, a very, very short history of Jupyter's notebooks. And then, sort of a plea for better knowledge of the way scientific software are made and their history. Because I think it takes a lot in our area. The question is, and my starting point is, where are our stories of scientific software right now? I mean outside the specific events and globally in the main scientific area. Because software won't say a lot about these everywhere. And they ran from bespoke and code to international stars. So we are software's every-round research, but very little stories of how they have been made and how they evolved. And social sciences rarely looked at those software. And when they look at them, they show there are very specific dynamics going on. Research of software are open indeed. They are looking for uncertain ends. Researchers are usually known as specific developers, and there are very specific funding constraints on how software are developed. And these are specific consequences of the way those specific kinds of software evolved. The code can have some brittlessness. There is a lot of intertwinement with scientific activity. And it led some researchers to become specialized in software engineering and developing software. And it led to a lot of specific journalist of friend. We have seen one with J.F.E. Light just at the beginning of this day of the room. So I want to take a step back. And because there is a lot of open question about that. First, how can we tell the stories of our scientific software and how social sciences can tell stories of scientific software? Because there are different journeys, especially in open source. And there are different steps in the history of each scientific software. Sometimes it stops, sometimes it continues for years and years. And on a broader level, there is much intertwinement between open source and academia. And especially, what are the links between open source and science? And how the connection is made between academics and software engineers? And just to quote Christopher Calti in two bits about UNIX. In fact, the UNIX spread first to university, computer science departments, and not to business government or not government organizations. And then that it also became part of the co-pedetrical practices of generation of programmers, computer scientists. So there is something connected between open source and open science. And I want in my very little time, but I had to work with a specific case, which is the case of Jupiter's notebooks. And to say it in one sentence, innovation, it is an innovation going from research to become a worldwide infrastructure of data science. It was released, notebooks were released in 2011, 2012, and spread everywhere. And they won the ACM award in 2017. And it is the perfect viewpoint to see how a scientist of course emerged, how he progressively get more and more abstract from this starting point in the laboratory, and diffuse within and outside academia. If you want a long version French history, there is a paper in Al, but I will keep it very short. I'm not here to advocate about Jupiter notebooks. I use them, I love them, but I won't try to convince you. And I'm quite sure there is a lot of people against them around here. And if you are not against them, but you want to see why people are against them, just have a look to the dry-gross talk. But I'm making some sure that you know approximately what Jupiter's notebooks are, because I have no time to discuss about them now. What I just want to say is a very quick story. It is first a PhD student, then a specific script, which is a Python, then notebooks appeared, and finally we got Jupiter, as we know currently, which is basically an infrastructure for interactive data science with different kinds of languages. And you can see this evolution with the Python Dev mainly released, with the progressive emergence of notebooks around 2010, and the appearance of Jupiter. I just go back on those different steps. So let's dive in this history. The important part is to have the context of the early 20, or the term of the millennium. And we are at a moment where we had a lot of achievement with the free software movements and open source development. And there is around the laboratories, paradigm of literal programming, from the next move. And for people coming from computational science or mathematics, there are a lot of proprietary open software specialized for interactivity with programming like MAPL, Mathematica or MATLAB. And at this moment, there are also the beginning of the scientific Python community, which just is starting to develop with the first SciPy workshop organized in 2002, in 2002 in Austin, Texas, especially in the south. And in this context, Fernando Perez was at the beginnings of Python and then Jupiter, was a PhD student in his fourth year, tried to finish dissertation and wanted to move from proprietary software to open source and Python and need something more interactive to do his work. And the script which will become Python was a simple personal fix for the problem of his own workflow and was really grounded in his common sense as a researcher in physics and computational science. So he wanted something to make sense, programming with interactivity. And this was the idea, the value inside this moment that will unfold in a job. In this basic case, the SciPy community, so the scientific Python community was quite an amplifier and there was a very quick reception, and to the secret reception by this community and the company which backed SciPy and thought posted IPytranslations on their web page. And they get a lot of support from this community, think back and contributors, and quickly after this start, other contributors joined the projects, especially Brian Kanger, who jumped in 24. And they managed to secure financial possibility to continue and it was attained with post-doctoral grants that fellow Peerers get at Colorado Bolger and then thanks to the support of a team in Berkeley which joined in 2008. So the fact is, IPytern is something really well grounded in academia and SciPy community. If you look to the main contributor of IPytern, almost everyone was a PhD, some of them are in a position even later after the emergence of the software. And notebooks in this context were just a feature which appeared later of IPytern. And because 2004 and 2011, the project developed, a lot of support was given by the Python community and there is a lot of features and tried multiple times to add a notebook feature because it was something already here in other software. There are five missed attempts before they were able to make a first fabled version of notebooks because some technology, especially for browser, was not available. So in 2011, 2012, a new release of IPytern included IPytern notebooks. It was the beginning of the history of Jupyter and it works pretty well because it was really quickly adopted by the SciPy community while outside the first specialty frontiers of the developers of IPytern. And in 2021, Nature can say that IPytern notebooks are one of the ten codes that are making science, sort of a huge thing inside the SciPy community. But progressively, the notebooks became something more important and they led to abstraction of what a notebook is and the way researchers are using programming in their work. And there are two dynamics. The first, it was a movement of abstraction out of the Python community and on the other one, it was strengthening of the practices in the project of software engineering. And this allowed the project to make a split and to move from a very specific IPytern tool to something more general, more abstract, which became the Jupyter project and was backed with six million dollar grants of foundations that support open science. So it was a huge move because it led to refactoring the code, change the philosophy, reconstruct your latest with the whole project and there was a lot of money involved because it needed a lot of, you know, hiring of software engineering to do so. So at this point, Jupyter became something which escaped the academic world and had a worldwide option. Notebooks became standard of data science and they were integrated a lot of services like, you know, Google collab or use in third party, you know, tools already existing like the regular studio code. So it was, you know, a turning point in the way this initially scientific project became something way bigger than scientific community. And somehow I would stop here because it opened a lot of questions. Of course, for the research community, the question are what the current users of scientific, of competition on the books, what kind of work are they doing? How does it make the way we are programming change? But at this point, the question I want to carry here is does Jupyter project or software are still scientific software? And so how does something which was created inside within the scientific community is starting to get another dimension and to be something bigger or no more, you know, a research tool. So just to rub up because I am going to the end of this presentation, I want to stand for more historical documentation, not only documentation of code, but historical documentation of how those specific software genres are associated with scientific specialties, institutional background, funding possibility. And we need to take this specific dynamic seriously because of course, for competition on the books as we are trying to do with other colleagues in different projects, and there is a GitHub repo if you want to add some archive in the story, but also for all the other tools that are inside our laboratories, inside our daily routine of scientists, because they are a huge part of the way we are crafting knowledge and they don't have the same history than other more material, you know, artifacts and scientific instruments as the discops or particle accelerators. So it's my point, I finish here, thank you. Sorry for the speed. How can we define scientific software? Very neat question. Can I and how can we define what is scientific software? I think the only way I can answer that is that software crafted within the context of scientific research at some point and that builds not for making, you know, a complete tool but for answering specific research question at some point in the advancement of knowledge. And usually there is a national literature about the way that scientific software are really different like that don't take really seriously into account at least at the beginning, versioning, test units, they are quite squirming the good practices of software engineering. At least at the beginning and then if the software is still around a few years after and gain more users, it started to integrate those good practices. So somehow there are two universe but more organizational and social universe different and I would say scientific software defined by the
Beyond Ratings: Empowering Communities through Wikirate for Transparent Corporate Impact Research and Analysis.
Thanks. Hello. So my name is Vasily Gikazyaki and I'm a data engineer with Wikirate International. And I'm going to talk about how Wikirate empowers communities for transparent corporate impact research and analysis. But before we get into details about what Wikirate is, I would like to talk a little bit about what is the problem with environmental, social and governance data of companies. So usually when it comes to ESG data, we can say that they are expensive, exclusive and inconsistent. There are a lot of data sets hidden behind paywalls. So individuals need to pay thousands of euros per year to get access. Additionally, there are a lot of ratings. There are a lot of organizations actually producing ratings about companies. But the problem is they don't provide access to the low level data sets. So it's difficult really to understand what they are rating. And also, yeah, they don't make the methodologies transparent or the sources transparent. Yeah, finally, the last few years companies started reporting more ESG data in a text format, in sustainability reports. But the problem is that a lot of company reporting is not standardized and that hinders large scale analysis and comparisons between companies. So what makes open research so important in the context of corporate accountability? It actually fosters transparency in corporate practices and empowers different stakeholders, especially people that don't traditionally have access to those data. That they don't, let's say, investors and they don't have the money to pay to get access of this ESG data. It encourages collaborations in global scale and promotes data driven decision and policy maker making and drives positive changes. So Wikiread is an open source, open data platform that brings corporate ESG data together in one place, making it accessible, comparable and free for all. It's a wiki that means that anyone who has a passion for sustainability and ESG data can come to the platform, contribute to the research, contribute to the available data and organize their research as well. So our community is mainly comprised of civil society organizations, academics, university students, data and sustainability enthusiasts. And we strongly believe that in research and in research in companies everything starts with a question and ends with an answer. So I would like to give you a sort of overview of the structure of the data on Wikiread. So these research questions we call them metrics and we can have for its metrics several answers. And its answer is linked to a specific company, a year of reference and a specific source. So here we have an example about, did Airbnb UK Limited produce a statement in relation to any modern slavery legislation or act in 2022? And the answer in this case was yes. It produced a modern slavery statement under the UK Modern Slavery Act and there is a source and citation actually linked to this answer that leads to the actual modern slavery statement of a specific company. So in Wikiread in addition to research metrics we provide also calculated metrics as a tools for calculations and for analysis. We can say that the research metrics are actually the building blocks for analysis and the calculated metrics actually are built on top of research metrics and allow users to run calculations. So we do have namely Wikiread in score metrics, formula metrics and in more strictly formula metrics allow users to run their own calculations in coffee script so they can be quite complex or not complex. It depends on what the users want to do with the data. So and this actually, this calculated metrics helps to bring transparency into ratings. Here we have an example with the fashion transparency index is actually a rating that scores companies, fashion companies based on how transparent they are on different sustainability topics. We are a partner, we have a partnership with Fashion Revolution which is also an NGO and they're building, they do this analysis in research and we're helping them to actually make the research, the data, the ratings, the analysis, everything transparent available to the public. So one source of data on Wikiread is of course data that are coming from the ground, from civil society organizations but also there are a lot of data in the public domain. So it's easier to bring structured data and semi-structured data and building data integration pipelines to Wikiread but of course we do have the challenge of unstructured data and how we are going to bring those kind of answers into the platform. So for those reasons we are performing research projects and we have called for volunteers to come and research those reports and finding answers to questions on specific topics like modern slavery, greenhouse gas emissions, etc. So how is the data used? One use case of Wikiread data is building data dashboards that are actually used for advocating for change. One example is Fashion Checker which was developed in partnership with Klinglo's campaign and actually advocates for worker rights, especially worker rights on the supply chains of fast fashion companies. We have the Beyond Compliance dashboard that was also a partnership with Work Free Foundation and is a living data dashboard that assesses modern slavery reporting and tries to highlight gaps on modern slavery reporting and puts for new legislations and for new policies. Also the data is used for writing news articles and producing help CSO, CISO-Cyber-Suside Organization, produce reports and making research findings and analysis transparent. So it's also used for writing research papers, research some of those and Wikiread data are free, are under Creative Commons license so it's welcome to anyone to use the data, explore the data. They can do it through the API and through the user interface. We have an available RESTful API and several wrappers that will allow users to pull data from the platform, also contribute data to the platform if they want to. We also have a GraphQL endpoint that allows users to form more dynamic queries based on their needs. So where to start with Wikiread? If you're interested in contributing data, you can always say start with the guides please. Read the guides as most of the questions are answered there. But of course if you have any more questions you can directly contact us. And yeah we have several projects that are in need of contributors. You can help us improve the data. We have verification tasks, we have verification tasks so we ask the help of the community to help us with this process. And yeah if you are interested in volunteering these links are available to the slides. And yeah you can contact us if you want to share ideas with us, form partnerships or get support. Yeah as I said in the beginning Wikiread is an open source project written in Ruby. You can check out our github repository and if you want to get started with Wikiread and DECO you can do it. And you can also create if you want your own data dashboards if you're interested in ESG data. So yeah I think that's all from my side, maybe it was too fast. But yeah thank you. Thank you. We have maybe four minutes of questions. Hello. Hi. I have a question about if AI has helped you in a way in any of these processes for example while manipulating or getting data from the public domain or something like that. Yeah so yeah we are considering, sorry the question was about if AI helped us in any way obtaining data from the public domain. So the answer is that we are considering using now AI and LLMs for extracting actually more structured answers from text reports. But still we are in the test process so yeah hope that I answered your question. Yeah. How much of companies are covered by the data set? Is there specific industries you are targeting? Yeah so yeah we have about, I'm sorry yeah. The question was how many companies we cover at the moment on the platform. So we cover about 1400,000 companies. The biggest focus on research or the biggest companies. So more data can be found at a very popular let's say companies and because we did have a lot of projects on the fashion industry we do have a lot of data about fashion companies. And in total at the moment we have five million answers. So yeah. Any other questions? Yeah sorry. So it's open to contribution as far as I recall and you mentioned some verifiers. I want to ask how do you make sure the data is consistent and how do you go through the checks and see if the data is reliable? Yeah so yeah the question is how we check that the data that they are coming into the platform is reliable. And of course it's a question also because we are doing crowd research sometimes people do not have the expertise on ESD topics. And what we do is we have different verification levels. So we consider an answer verified when more than two people have come to the same conclusion. And you can see on the platform that we have Stuart verify and community verified. Stuart's usually are members of the community that they have more expertise on the specific topics that the research is about. I'm always the person that's like let's squeeze as many questions as possible so we'll do really rapid questions right now. Very quickly. I was just wondering if this could be expanded to cover other types of data rather than ESD. The question is if this could be expanded to cover other types of data. Yes it can. It's again about environmental data but one use case that it comes on the top of my mind now we are having companies you can have something similar for countries. So you can highlight for instance the electricity or water usage or CO2 emissions per country and not focusing specific on companies. Thank you so much.
Wikimedia projects and OpenStreetMap as an Open Research Infrastructure
The aim of this presentation is to show how Wikipedia, Wikidata, Wikimedia Commons, the Wikimedia project and also OpenStreetMap and other resources can be used as open infrastructure for research. We're talking about websites that are based on an open infrastructure, so they're based on open and free software and of course they have all content that is available openly. What is also interesting about this ecosystem is that it's incredibly multilingual, so you have a really wide community of contributors in over 300 languages. And even more, it's one of the biggest existing online communities and this is obviously a feature if you want to collaborate with citizens, which is one of the aims of OpenScience, so working and collaborating among people and institutions. Also something that is also valuable is the fact that we're talking about resources that can host different kinds of content. So it can be data, but it can be also images, audio, documents with also a community that can contribute in different ways in improving this content. It can be restoration or improvement of images. It can be also adding captions. It can be transcribing documents. There are many advantages in those projects. Some of those are very well known. The visibility is probably one of the biggest. We're talking about for Wikipedia, 28 billion views per month. We're talking about the visibility that Wikipedia and the Wikimedia Commons have provided to collection like the Met collection, the Metropolitan Museum in New York. It moved from a collection that was viewed two million times to 10 million times. So the visibility of those projects is very impressive. But also we're talking about an international community, a community that has also chapters around the world and the desire of enlarging the community with policies and with funding that have been created. Also we're talking about reusable resources, so a resource that really provides content, information data that are really available also to people that don't have particularly technical skills. And also there are other features like the fair data principles that are applied also on all those resources, but also an attention to new ethical principles, the care principles, or the synergy with open government and with GLAMS, so with cultural institutions. So those resources are already used in research. Wikidata is probably one of the major examples. And the beautiful project, SCOLIA, is one of the examples that you might access that provides information about researchers and topics. But there's been a lot of work related to how to use those resources as a research infrastructure. And I'm just quoting some of the papers related to this and focusing on Wikidata. Daniel Michen has done an incredible job in this. He was also a Wikipedia in residence for the Open Knowledge Foundation. We just heard a presentation from them. He was the first Wikipedian in residence and he worked extensively on open access, improving content also on Wikipedia related to it, improving also the communication of the project among the open science system. And in 2015, there was this project in working on Wikidata as a virtual research environment, which is very promising. It was not financed, but it kind of gives an idea of how the infrastructure can be used and is already used in this direction. Furthermore, there are studies that are highlighting how, sorry, I need to breathe. This is something that I, you all noticed that it's something that I sometimes forget. So going back to, in 2019, there came up this study about Wikidata. So it shows how Wikidata is already extensively used, but he also talked about how art and humanities and social sciences are not very present in the field. And a research about art and humanities used and it's used in Wikidata show how there are projects that use the data, but there are a few projects that collaborate using the data. So create a community that actually upload the data from research and use the data. So I'm just going to present to you three positive elements and three challenges that I encounter in my work related to arts, humanities and social sciences that I think might be interesting to highlight. So for the advantages, the large use of all those resources combined together, so not only using Wikidata, but really take advantage of the different format that those resources allow to upload. A second element is the broad interest for a heritage and museum, so the existing and real attention that are on those projects. And the last issue is the possibility of visualizing and monitoring content. I breathe a moment and then get back to you. So for the challenges, a major one is a copyright and the restriction to public domain, the difficulties of course of collaborating with a community and also the challenge of scaling up and working with the different skills. So the first element, the possibility of using all the infrastructure is particularly interesting for humanities, arts and social sciences because it allows to really take research, resource and data. And in humanities and social science, you have a lot of also qualitative data. So you have interviews, photos, you have site exploration, you have artworks, you have content that comes from archives for example. And you can find on those resources the possibility to upload it. Also working with OpenStreetMap allows for example to enter data that Wikidata would not allow. So the combination of those two really allows you a broader work on those infrastructure. This is an example that comes from the upload of data from the Ticino region, the region in Switzerland. And the upload was done on Wikidata but also OpenStreetMap and with the upload of images on Wikimedia Commons and the creation of articles also on Wikipedia. The second issue is related to heritage. At the moment the 97 nations have participated in this contest called Wikilov's Monuments and they have uploaded an incredible number of data but also they have worked in creating one of the most incredible database of heritage sites around the world. And this content enriches the existing resources but can also be used to evaluate the existence of images and also the presence of heritage in different countries. This is a visualization we've been working on that allowed also to create a research based on the analysis of those data. Another project, another focus of the community is working on content coming from GLAMS. So GLAMS stands for Galleries, Libraries, Archives and Museum. So we're talking about the broad network of cultural institutions and institutions. Consider also that universities have libraries, have collections, have archives. So it's very strange how sometimes the research institutions perceive separately the GLAMS. And there is sometimes a great difficulty in Brigida too. Also a lot of research for humanities and social sciences comes from those sources. So you work on documents, you work on images, you work on collections and this is really a center of interest for researchers in those fields. And the Wikimedia project, particularly after 2006-2008, have really invested a lot of energy in encouraging institutions to become open access and to upload content on Wikimedia commons and also with synergies with Wikidata. And in Italy we did a project in which we contacted all museums. We created the best existing database of museums existing in Italy. It was done in collaboration with ICOM Italy. We uploaded a national statistic about museums. So on Wikidata you can really access all the available data about museums and museums in Italy are quite numerous, as you can imagine. And also they started also collaborating and opening up their content to make sure that also museums were engaged in checking their data and contributing with authorization. This is a topic that I will shortly touch. We created a forum that allowed the museum to upload authorization for the content. In Italy there are restrictions also on public domain. And this forum was also developed with Daniela Scasciatratte, who might be here, so one of the developers, and so to facilitate also the institutional contribution to the project, which is one of the problems with Wikimedia commons. You need to be an individual to contribute to Wikimedia project. So you need an external interface or a system that allows to associate to a user an authorization that gives the authority to that user to upload content for an institution. So it is a step that is still missing. Those data allow to produce research. So you can monitor museum in a country. You can view if they have a person in charge of communication, how is their collection, if it's digitalized or not. And we enter in this a third positive aspect of the Wikimedia project and OpenStreetMap is that you can really visualize content in amazing ways. And visualizing content doesn't simply mean I have a statistic, I see what is there. It means also to visualize knowledge because what is on Wikipedia and what is in Wikimedia commons very often provide you information of what is available as knowledge. For example, images of heritage. In Italy we had a discussion with the Ministry of Culture because they miss images of certain areas of Calabria. So the community actually negotiated with the government, the data, and then they produced content that is now accessible also for the government. So monitoring what is available there somehow provides an image of what is actually available on the internet and to anyone. So monitoring knowledge is also interesting if you are contributing to it and you're contributing to modify it. So if a museum or a researcher is improving content related to the architect Paraviccini you can really see how you made an impact on that knowledge. And it's quite incredible to visualize this impact because normally impact on research is made with completely different criteria. So this criteria is actually something that is the mission of a museum of course to improve knowledge otherwise if their mission was to have a lot of visitors they would offer beer that would make it a bit easier. But also it's really a way of changing the perspective on how you create a researcher that is really available and visible. Now I brief a couple of moments and then I move to the challenges. Okay the first one is copyright. This is an issue that is present on all humanities, social science research. This is obviously a very well-known challenge. So you would expect that for example you take a photo of a monument and you upload it. Actually things are a little bit more complicated than this in particular. The fact that you need to identify what is heritage and what is not. You need the rights obviously of the photographer but there are other issues related to property and also to the rights of the author of that building. If you live in a country that has freedom of panorama you can take photos of everything that you can see outside so you're fine. But many many countries do not have freedom of panorama and it's a right that was not unfortunately made accessible to anyone with European copyright law. So in those countries you need to ask the authorization of the architect that has not been dead for more than 70 years or the artist that produced an artwork. This is a layer of complexity. Furthermore you have layers of complexity that are added to public domain content. This is tricky maybe it will change because in theory with the European copyright law maybe we are moving in the right direction. But in Italy you need to pay a fee for every commercial use of content in public domain. And this is obviously a very complex block. So those restrictions create layers of complexity and makes it more complicated of course to upload content on the Wikimedia project. In particular because those projects really want content that is clearly open and accessible. I still have a lot of time so I should relax. So I want to make sure that I tell you everything that I might know. So we did a project to explore the impact of culture on safety in Africa. We did it in three countries and in particular in Cameroon we worked a lot with authorization. So we uploaded Cameroon, Duane Cameroon has a great production of public art since 1991. So there are artworks that are disseminated in 13 neighbourhoods and it's quite an incredible project because they've been commissioning artwork to international artists, local artists. You can see the transformation of the city through those artworks. So what we did, we uploaded images on Wikimedia Commons. We created data on Wikidata connected of course to the categories. We created a list of artworks on Wikipedia in English and in French. And we uploaded text because all the production of the research project was on CC by SA and with the authorization. So for every single institution and every single author we created a permission that was then sent to the system of recording permission of Wikimedia Commons. And this allowed to create really the possibility of uploading content. Since it was done in Africa it was a bit more complicated so what we did was we had printed a form that the artist would sign, we scanned them and send them to the permission system that recorded them and registered a ticket. So of course I took an example that is particularly complicated because public art in Africa with a living artist with no freedom of panorama is probably the worst you can get. But it's feasible so it's complicated but it's something that is possible to do. Of course it requires a lot of changes also in procedure and also the need of creating processes that allows the upload of those authorization and facilitate also this connection between institution and the rights management. The second challenge is related to collaborating. And now I don't know how many of you contributed to Wikipedia. How many had their content deleted on Wikipedia? This is something that is an experience that I think everyone... So contributing to Wikimedia Project is not easy. It's a little easier sometimes on Wikisource and Wikiquote I would recommend as a first if you want to go on holiday on those projects is quite fun. Also no Viki Voyage can be challenging too. So those projects have a lot of rules, policies and also collaboration is never easy. So everybody that collaborates know that they are challenging in involving other people and also creating processes that are transparent. There are some specific rules of the project that researchers need to take particular attention to those in particularly of course the no original research which is also an advantage for a researcher because you quote the work of everyone and you source everything. Also conflict of interest, the fact of declaring why, how we are contributing, neutral point of view for the encyclopedia and of course Wikipedia is an encyclopedia. So it doesn't provide space for everything. But it's true that for museums, cultural institutions, for heritage, also for improving articles related to territories that are very connected to topics related to architecture and art, this Wikipedia is perfectly suitable. Also you need to consider that research in humanities and dissemination. Sometimes the two boundaries are storing information and disseminating information. Sometimes also scholars would like that the way they store information is beautiful, is accessible because it's something that might interest a broader public. So it's sometimes not sufficient to store a folder on Zenodo. You would like to have an interactive map that allows you to see the building, having access to all the documents on. And Wikimedia project can somehow provide this infrastructure. The last issue is how to make this scalable. So of course working on licensing, working on CC0 for data is an issue. But the upload of content on the Wikimedia project requires a certain expertise. And what I saw in the past is that very often projects worked when there was the community involved. So people that were experts already of those projects. So this joint work and also maybe the model of Wikipedia in residence could be an approach that can be interesting also on the Wikimedia project and OpenStreetMap. Finally I wanted to just mention that I'm working on a landscape analysis of research infrastructure for social sciences and humanities. So I started on Metawiki, that is where we start, we always say, oh you find the don't make it Metawiki. So here you find a list of research infrastructure to make sure that Wikidata has those resources. But the truth is that at the moment there are two problems. The first one is that all local infrastructure or collection databases are not connected and they're not perceived as research infrastructure because they are too small and they don't have national relevance. So having the possibility of bridging those resources and maybe Wikidata can really provide a landscape analysis on this would be very valuable. Also making sure that we know about those is very useful because those are resources that can also nourish those websites. And finally there is the problem that investment by government on research infrastructure normally they focus on implementing the infrastructure and maintaining the infrastructure while also populating the infrastructure is another issue. And there's going to be also a presentation about OpenRefine that is very important and relevant for this because obviously you need tools that allows really to nourish and to connect those infrastructure. I'm done. Thank you. So now you land there and take some questions. I told everybody that I know to not ask questions. Yes. One question they asked the question was if we have an idea of how much data from research feeds Wikidata and is accessible on Wikidata. I think this research of 2019 might give some insight of it. I don't think it was more focused the study on models rather than actual data. It is sourced. So I presume it's something that is possible to is an information that is possible to view on Wikidata. So that would be feasible. It's true that sometimes the taxonomy of also property so the possibility of actually getting a full access to the information is not obvious. Also for research infrastructure one of the challenges that one thing is called the virtual library. The other one is a digital library. The other one is a repository. So combining all those also broader terms makes it a bit complicated to get a full idea of it. Thank you. Another question. Thank you. I enjoyed the talk. One thing I was wondering is there is a link up talk in the chat here because I enjoyed any and Chris's talk talking about opening infrastructure finder and now you're talking about opening infrastructure finder. I wonder whether there is dialogue or anything at all. Yes. So the question was how to connect maybe the if I understood correctly connect the possibility of finding open repositories and how we can connect with this one. It is important to notice that a lot of libraries and repositing existing repositories are already collaborating with Wikidata. So there is a desire. Europe Anna which is one of the biggest repository has a very strong collaboration for example just to mention one of the most well known. This is a repository of Glam's for open research. There are lots of connection associated rather to repository that provide information about researchers or papers. This is something that is implemented on Wikidata quite nicely. But it's true that also in general the investment are not on something like Wikidata. So investment are either repository by topic and at a national level I never saw an investment that is on Wikidata. It is rather in maybe creating some interconnection. So this is something that my but of course I'm also here to actually stress this. I think we should collaborate more with Wikidata that would be valuable, useful and efficient. So that's all. Thanks. If you have any more questions we can welcome them. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you.
Unlocking Research Data Management with InvenioRDM
Thank you. Hello, everyone. My name is Karolina, and together with Havi today, we will tell you about how Invinio-RDM is unlocking research data management. But before we start, I would like to ask you if you see any connection between those three images, and is anyone able to answer that quiz? And Luisa says that you have three seconds to do so. Those are cuts. Those are cuts. No, no, sorry. So what about now? Sorry? Yes, you're close. So the common, the connection between the images is that is CERN, actually, where the World Wide Web was invented. It's located in Switzerland, so fondue and chocolate. And that's because you can see the funny pictures of the Internet of the Cuts. Thanks to World Wide Web Invention. But it's not the only thing that we do at CERN. So we are housing the biggest machine in the world, the Large Hadron Collider, and many more machines which experiments are using. And also we are sharing our knowledge and welcoming visitors. So wherever you are in Geneva, Switzerland, please pay us a visit. We do much more than only physics at CERN. So we do also open source projects, like, for example, the World Wide Web, that it was given back to the public. But that's not the only one. And this is what we are talking about today. So it's Zenodo, which I have been told that some of you know already, but you probably don't know what's in VINIA-RDM. But I will start with Zenodo. So Zenodo is an old-purpose research repository that any researcher around the world can just go and store their research results for free. And it is hosted there at CERN as long as CERN exists. So the question is, why do we need such a place? And this is the answer. So very important, up to plus, the crucial scientific data, many years of research work inside. Well, we don't want to allow this to happen ever again. So we provide a safe space for storing data for the researchers. But not only researchers. We have also integration with GitHub. So you can cite your software, stored in GitHub. And what's the advantage of storing it also in Zenodo is that GitHub allows you to delete your software, but it will be preserved in Zenodo. And we have received many questions about the platform, if it's possible, to take it and install it as it is in another institution. So up to a point, it was not possible. But we have received so many questions that in the end, we have developed another platform, which is InvinioRDM, now the engine of Zenodo. Now it is possible to easily upgrade the software, install a new version, and we are basically supporting the underlying engine. So, Havi, if you were to characterize InvinioRDM with one word, what would you say? That's a good question. So you have to use one word, I would say, that InvinioRDM is fair. And when we talk about the concept of fairness, I'd like to quote our former director general, San Roth-Litter, who once said, why do I like Zenodo? Because Zenodo is fair, fair in the sense of lower case and fair in the sense of upper case. The most conventional use of fairness, which was already covered by this first part of the presentation, is like equitable or just. Now let's see how InvinioRDM embraces and promotes the fair principles that are, that is an acronym that stands for Findability, Accessibility, Interoperability, and Redusability. So starting with the first one about Findability, when we upload our research, one of the key things is that we want to have a link that we will make sure that it will resolve over time, that it's not going to be broken. And for that purpose, we have DOIs, which is a digital object identifier, which is a globally unique and persistent identifier. We encourage people to use their own DOIs if they have one, otherwise there will be one automatically generated and registered using, registered in data site. It's as important to have a nice metadata. That's why we adopted the data site metadata schema, which is simple, yet a powerful format to describe nearly any research output, data sets, software, as she mentioned, journal, papers, anything you can think of. And of course, to find out all this data, we need a good search engine with capabilities, such as filtering options or search variations or powerful query syntax that will allow you to find the data even without the identifier. So these are key aspects, not only for humans to find data, but also machines. So if we continue about accessibility, a very common use case is that we have our data and we want to keep it restricted, but we want still people to find the data. For that purpose, you will make your metadata public, and if people want to access the data, they will have to request access via a simple form, and then you can choose if you want to grant access or not. In the same way, you can also share different links with different permissions levels that will allow people to view the record and the non-pullished versions or even edit to make collaboration easier. Now if we talk about interoperability, one key thing is to follow standards. That's why we follow the one I mentioned, the data site metadata schema, which includes things like common vocabularies, which will allow us to have the same concepts to describe data as other people do and other machines do, so that we make sure that everyone will understand it in the same way. Another important thing is that when we upload our work, we have to link it properly with other data that is also uploaded, and you can do it very easily as well. And if we talk about how machines exchange data, we also provide a strong REST API that allows you to build your own data, build your own integrations of top of the Miner-DM, and we also have an integrated YPMH server, which is a standard in how systems exchange data. If we talk about the reusability, I think one of the key aspects is that when people use our work, we want them to cite it correctly. So here you have different styles of citation that will always include a DUI. The DUI is very important also to track the impact of your work. If you remember, she talked about software citation. We know that 85% of all software citation is on Senado. And of course, having also a clear licensing information, it's also important so that people know how they can use your data under what conditions. And I want again to stress a little bit on the metadata, so having a rich and comprehensive metadata is very important not only for people to reuse their data, but maybe for people to also reproduce it in the future. And since we are talking about the reusability, do you think there is something else that we can reuse? Yes, we can reuse the whole software entirely. So these are examples of how InvinioRDM was reused with other institutions, with our other partners, and as you can see, it's very customizable. Those interfaces are very different from each other, so it's quite flexible if you would like to join this sizable community, that it's still growing. We have many partners around the world. And if you would like to install an institutional repository, also in your institution, you can get to know more about InvinioRDM under this QR code on the right side. Also, you could pass by our booth, it's in the building K, floor 2, floor 2nd, and if you are a developer who would like to contribute to Open Source Project, you can check out our community on Discord as well. We answer questions, and you can see also a growing community there. So thank you very much. So are there any questions? Thank you very much for the talk. I already know it, and I like it, by the way. But I have only one specific question for the... You also said, you have plans to support the process with mixed licenses. Software is usually not just one license, there's a lot of SPDX expressions or something like that. Okay, I will just repeat the question for the stream. So we also were told that it's like our repository, I think that's worth to mention. So the question was, if we plan to provide more licenses, so I think we were very fast here on the slide. There are already many standard licenses that you can find, and they're available, but also it can be customized. So whatever license you need, you can add to the software. If there are more licenses, you have a data file under CC4, CC by 4, and draft code under MIT, so then you cannot simply say from the outside this is only MIT or CC by, so then you need a list or CC by and MIT or something. Okay, you mean if there are multiple values for the licenses attached to one record, do I understand correctly? Okay, if I remember correctly, it was no? You can have multiple licenses. Yes, so you can have multiple licenses, but you cannot map it one to one. You cannot say for the file is this license and for the metadata is this license, so it's not there. Okay, thank you. I think the question is, if I archive the software in Zenodo, how long does the software is preserved in Zenodo? So the question was for how long is the software preserved in Zenodo? So the answer is as long as we have data center at CERN, as long as CERN exists. Okay, but what is the commitment of CERN in order to organize in terms of how long it will last? Well, in terms of contract, can I? Well, for now we say forever, but let's see what future holds. We'll see if the sun goes out. Sorry? We'll see if the sun goes out. Yes. Hello. I am, sorry, out of those compared to other data in photos at CERN, is it more specified for scientists' researches? Or... So what I think the question is, is that if it's targeted on one area, that's this what you meant, of research whatsoever? Yes. So it is not targeted, because it's very, like we said, it's reusable. So we have, for example, universities also installing the software and keeping it as institutional repositories, but these universities might differ in the domain. So it might be, for example, Northwestern University, but then they host many domains. They do a lot of research. We have also installations at CERN, like one is an ODA and one an internal one, an institutional repository, which we are in the process of migration right now to upgrade the version of the software. But I can also, like, come back to hear there are much more many usages. So it's not targeted to simple domain. Okay. Next time again, there's another theme. Thank you.
Making OpenRefine more reproducible
Okay. So we welcome Antoine Delpache, if I'm correct. And yeah, Florey Searst. Thank you. So I'm Antoine Delpache. I'm a developer on the Open Refine project. And I'm very happy to be back in this bedroom to tell you, give you a few news about Open Refine. And in particular, I'm going to be focusing on what I'm working on right now, together with Zoe Cooper, who's a designer on the project, to make Open Refine more reproducible. So I will first explain to you what Open Refine is, because I'm not assuming everyone was here four years ago. And if you were, don't worry, there are some differences that you might be able to spot. I'm very keen to know if those differences look good to you. And also, what do I mean with reproducible in this context? So what is Open Refine? It's a data cleaning tool. So you can import tabular data, mostly, in it. And then it lets you do all sorts of cleaning operations on it. Guess what? So let me give you an example. So this is a database of filming locations in Paris. So every time you film something in Paris, you need to register it with the city, and then they make this data set. And one thing I can do here is to say, let's match all of those films with an external database. And we call that reconciliation. So in this example, I'm going to reconcile it with WIC data that we've already heard about earlier today. And because reconciliation is a bit of a tricky process, we have various options to let you configure how we're going to match your data to WIC data so that we just don't only rely on the names, but also on other attributes that we have in this data set. And we then have various tools to help you make that a little bit efficient and let you review the results of the reconciliation manually. So for instance, here I can hover this and get a link to the WIC data item that it could link to. So that's a sample of one type of operation that people do a lot with OpenRefine. You can then manually match things if you want to go through the entire data set yourself. Let me show you something else. Well, first, once you've done this reconciliation, you can pull some data from the target database. In this example, I could, for instance, do something quite simple. Sorry. Let's just add a new column with the URLs of those entities in the database. So that's something that I can do quite quickly. And you get your new column. You could also pull more information from WIC data, identifiers in other databases, things like that. Let me show you another sort of operation you can do in OpenRefine. This is the column with the directors of those films. And I can try to cluster them. So what does it mean? Well, we are going to basically look through all sorts of values in this data set and try to detect whether they might refer to the same entity. And when that's the case, then you often want to normalize those to one consistent spelling. That's very useful, typically, as a first step for reconciliation. So those are samples of the canonical values you could use. So let's say I want to use all of those suggestions and accept them as valid clusters. OK. So those are the sorts of things you can do in OpenRefine. Now, what do I mean by making this tool more reproducible? So imagine you're a researcher working on some data that you've collected. You're cleaning it with OpenRefine as part of your research process. And at the end, you want to publish a paper about what you did and you want to make your research process transparent. So you want your fellow researchers to be able to inspect what you've done in OpenRefine and ideally even reproduce it on a similar version of the data set. So what can we do for now? The best thing we have for this so far is our undo-redu tab. And as you can imagine, it's primarily designed for undoing things that you've done, but it also happens to list all of the operations you've done so far with OpenRefine. So you could try and copy and paste this in your research article as a way of saying, this is what I did. Now, this is not exactly ideal. So we are working on improving basically this part of the tool. And before we get into reproducibility per se, there's already a lot of usability issues with this interface. And that's where it's been very interesting to work with a designer on this project who was also not familiar with the tool before she came on board. And so she was really able to come with a fresh eye and identify things that I really couldn't see anymore because I've been looking at this for so many years already. So for instance, here, it might not be clear to everyone that you can actually click on those previous steps to go back to them. We don't have any undo button in OpenRefine. We only have this weird undo, redo tab where you can't really click on the undo or the redo, like things like this. And so it's been really eye-opening. What else can you not do? Well, say I realized that this match here was wrong and I want to undo just this operation, but I want to keep all of the following ones. There's no good workflow to do that, but it's very often requested. So let me now show you what we can do with those extract and apply buttons here. So I'm going to roll back here. And if I click extract, I get this interface where I can select some operations I'm interested in and then I get some code for them. And this big blob of JSON is something I can copy and share as the representation of those operations. And I can also reapply them later on on this project or another one. Now, the problem with this is that it's very hard to work with this representation. It's very unreadable. And it's also very brittle. So for instance, if the column names of your new data set do not exactly match the column in the original data set, you will have horrible errors and it will be very hard to do anything with those operations. So that's the core of what we're trying to solve, providing a better representation for those operations so that you can understand what they are and also reapply them reliably. So as a summary of the main goals of this project, make the basic undo-redu functionality just more usable. Then make this reproducibility also easier and effective because we want those representations of operations to be reliably applicable. And also adding this advanced undo functionality of undoing not just the latest operation, or maybe just modifying the parameters of an earlier operation. So that's the main goals. And what do we have so far? Well, you might have already noticed some differences in this prototype, but let me show you another one. So far I've been working on making open-refined operations aware of which parts of the data set they modify. Because the problem is, if you want to let people undo a deep operation, then you need to be able to detect which following ones can be kept or not, or if they need to be recomputed because the data they were working on has been touched. So now that we have this capability of scoping operations a little bit better, you can, for instance, run reconciliation on multiple columns and that will run concurrently, which is something that wasn't possible before. So you see the reconciliation I started earlier, it's only 7% complete. It's a very slow operation because which data reconciliation is particularly slow. And now I can already start reconcealing the other column. And if you see, we already get some results, although the first one hasn't completed yet. So that's already won win. It's not directly about reproducibility, but I hope this will be work on by users because it should save people a lot of time. And on top of that, we've done some research about how other tools represent pipelines or their undo-redu functionality. So this is a screenshot from Talent, another data cleaning tool that we've been looking at. And in those sorts of data cleaning tools, you design your pipeline explicitly on a canvas. So it's a very different sort of user experience. But we've also been looking at Excel, how they let you track changes, or basic undo-redu functionality in Google Sheets, things like that. So that's been also very interesting in trying to get some sort of user experience that our users are already familiar with. So as you can see, this is all work in progress. This is what I have just here, a prototype. We don't have full answers to all of those questions yet. But we're working on this, and we are very keen to hear from you. So if you're interested in those topics and would be happy to test out some ideas with us, we're running some user testing sessions. So you're very welcome to sign up for those. And that's basically the state of the project. And I also have some open refined stickers if you happen to organize some training events in various places. So do also get back to me if you want some. Thank you. Thank you. We can maybe take one question. Thank you for the presentation. So it's an interesting piece of software. But what exactly is the target audience? Because I mean at some point, if you have the data rendering script, it makes the job. I mean, to not get me wrong, it's interesting. But just to know who exactly you are targeting. So the question is, what is the target audience of open refined? So it's a broad range of communities. I would say it's generally suited for tasks where you can't really just write a script upfront, which will do your keening. And it's not really about whether you like programming or not. It's just some tasks where you need to be looking at the data while you're doing the cleaning. As you saw reconciliation, it's a messy thing. You can't really just come up with the parameters and make the matching. You need to be looking at the data. Same for clustering. So it's a mixture of interactive data cleaning and a little bit more automation that you would have in Excel. So basically here the point is the point is the point click aspect for the operations. So the real point is the point click aspect for the user. Let's thank Antonina again.
Qadence - A library for Digital Analog Quantum Computing
All right, folks, we're going to start. David, it's you. Hello. Hi. I'm David, or Jorick. I work at Pascal. I'm going to tell you a few more words about that in a minute. And I am here to tell you about an ongoing work at Pascal called Cadence. And as you can guess from the name and possibly from the logo, it's related to quantum computing. So before I proceed, I would like to stress out one thing. None of the things I'm going to tell are my work. I'm just, for one thing, I joined Pascal recently with a baggage in programming language theory, compilers and things like this. And this project has not reached the stage where we can use programming language theory or compilers just yet, but maybe someday. So a few words about Pascal. What do we do? We build qubits. More generally, we build quantum computers. We build quantum algorithms. We build quantum tools. We build quantum teaching materials. I forgot to mention we are a private company, but we are a spin-off from several laboratories. Sorry, there is a strong research background at Pascal. And importantly for today, we build open source tools related to quantum computing. And if you're interested in knowing what the inside of a quantum computer looks like, well, that's part of the inside of one of ours. I think this one is called Fresnel, but I'm not sure. You can see lots of lenses which suggest that lots of lasers are involved. Yes, lots of lasers are involved. We're not generally allowed in this room because of the class 4 lasers. Way too dangerous. Still, cool to have. So if you're like me, you might have a question. What the heck is quantum computing? I mean, we all hear about it. A little bit. Well, I hear about it every day, but I pay for that. But we hear about it in mass media and everywhere on LinkedIn, etc. It's still not clear. At least it wasn't clear to me. It might still not be entirely clear yet. What quantum computing is all about? So the first thing is quantum computing is about computing with qubits, not with bits. An important part of it is quantum computing is very much research. You may have seen many announcements, each of them informing us nicely that the last few problems in quantum computing have been solved. I'm sure that we are going to see these announcements for the next 5 to 10 years. Quantum computing is currently a very active research domain, but it's a research domain. And while there are companies that are actually building quantum hardware, we are not there yet. It's not something you can buy at the local shop or even if you go further down the road. And it's probably going to be a few years before we can do anything really useful except in a few domains. I'm going to mention that a bit later with quantum computers. Still, it's extremely exciting. And when I say it's open research, it's open research for the hardware, it's open research for algorithms. And these algorithms most of the time are designed based on mathematical models of quantum computing. There are a few algorithms, but not many algorithms that actually run on quantum hardware. And there is lots of research on compilers and tools, but again based on mathematical models usually and simulators. Lots of hype too on quantum computing. So on the upside, it means that lots of credits for quantum computing, lots of funding, which is why companies such as Pascal and a few others can do their work. It's also thanks to this that a number of academic laboratories can do their work. And it's a good time to be working on quantum in general and quantum computing in particular. It makes things a bit complicated when you have to read a press release and it's a bit hard to understand whether the new problem that has been solved on a mathematical model has been reproduced in labs or is actually ready to come out in production. Why do we care about quantum computing? Well, we do care about quantum physics anyway because in computing, I mean, because CPUs need to deal with quantum phenomena on a daily basis. One of the reasons why we cannot make CPUs that are much faster anymore is that we have hit some physical limits. I'm not exactly sure which ones, I'm not a field physician, but they exist. So we want to go for the next generations of hardware and at some point you can either continue fighting quantum physics or try to embrace it. So that's one of the reasons. Another reason is that there are hopes that quantum computing will be faster. I mentioned hopes because despite some papers including a famous paper by Google two years ago each, we don't know yet. There are good reasons to be hopeful that for some classes of algorithms we will have something very fast, but we're not sure yet. Similarly, we hope that we can be energy efficient. I'm going to show you some algorithms later during this presentation. And there are good reasons to be hopeful that we could possibly someday replace entire data centers working on very specific algorithms with something much smaller. Again, this needs to be validated in labs and on industrial hardware. We're not quite there yet. And also simply because we don't know how to build new hardware at the moment. If you look at what's needed to train chat GPT or at least an old version of chat GPT, I assume it's worse now. If I recall correctly, they were using 10,000 boards, each of them carrying, I don't know how many GPUs each of them carrying, I don't know how many cores for the training part. And I don't know how long training lasts. So how we do it at the moment is we expand as many resources as we can, which is not something that can last forever. Again. So I mentioned bits, 0, 1, easy. Cubits, three dimensional, more complicated. Plus you have the question of whether the qubits are 0, 1, which is a complicated phenomenon, its measurement, and I'm starting to have a few intuitions about it, which probably means that I'm wrong. So there are two flavors of quantum computing. The first favor is digital quantum computing. This is a program in digital quantum computing. If you look at it, you'll see something that looks very much like a circuit. Well, that's why it's called a digital circuit. You have quantum data coming from the left conveniently. All these rx, ry, rz are gates, which operate on the quantum, on the qubit, sorry, in these, all the ones prefixed with r, r rotations on the sphere. These x, z, and I could have had, y's also are symmetries on the spheres. There are all other gates, but these are the ones that I had an example to use with. And at the end, you might be able to do some measurement, and in practice, you'll have to run your experiment many times because what you end up with is probabilities. So you need to measure probabilities by taking pictures, essentially, which means you have to take many pictures. So as I mentioned, a program is a circuit. And there are programming languages for almost 10 years, I think, there have been programming languages designed to create those circuits, or at least to give a syntax to the circuits and possibly to do modeling and simulation on those circuits. But the big snag is the hardware isn't there yet. One of the big difficulties that digital has is noise. I know it's not the only difficulty, but that's the one I remember, which is already good for me. Again, I'm coming from a different field, adapting is complicated. On the other side, you have analog programs. This is an analog program. This is actually part, I believe, of the test suite of one of our computers. So the test here is, hey, can we make a program that looks like our logo? Needless to say, it's probably not a very useful program. But we need to manipulate things at a very fine level. So in practice, when you're dealing with analog, a program is not a circuit, but it's also called a circuit and some parts of it will model as a circuit. But in practice, it's geometry and pulses. It might be different for other kinds of hardware support, but I think the ideas are generally the same. When I say pulses, I mean laser pulses, so you have to set up a frequency, a shape, and things like that, which is a bit complicated. I'm not going to claim that I have any understanding of how it works. And this, why do we care? Well, there are two reasons. One of them is this actually takes advantage of the hardware. It maps extremely naturally to hardware constraints and to some classes of problems. So from the top of my head, there are a number of graph algorithms that map very naturally to this. I showed you a two-dimensional representation, but it could also be three-dimensional. And so graph algorithms, a number of optimization algorithms. I'm going to show you a little bit of an example later. And if we have a problem that maps naturally to an analog circuit, the big advantage is that this is something that you can mostly run today on some machines. Not everything can be run, but we're much closer to this than in a digital world. And one thing I should mention, if you are familiar with the history of computing, well, every computer nowadays is digital, but before World War II, there were already many computers and they were pretty much all analog. So if you look at the battleships of the UK, US, French, German, Navy, they all had onboard computers that were electromechanical and that were used for aiming precisely. So they were computing ballistic trajectories. It worked before we knew how to do digital, and it worked because this specific problem that they wanted to solve had a very nice physical, electromechanical representation. In the end, they disappeared. It took a few decades for them to disappear replaced by digital, because digital was so much more generic, but it took lots of time for digital to catch up with analog. So these justifies war were interested not just in the digital, which is going to be much easier to program once it works, but also in the analog, which might give much better results in some specific cases and which is much closer to being actually something that we can use. Of course, the problem is how do you program that? I mean, that logo was not very intuitive. Well, it's easy. Well, no, really. And I apparently accidentally removed one of my slides, which was a big differential equation, which showed on one side the interactions between atoms and the other side the interactions with the laser itself, which I have no idea how someone can go from this differential equation to actually writing an algorithm, but some people succeed and they have my complete admiration. Anyway, that's why we, and when I say we again, I mean they have devised cadence. Cadence is a toolkit. It's designed for experimenting. You can experiment both with digital circuits, with analog circuits. You can mix them. Once you have written your circuit, you can simulate or execute it. When I say simulate, the world is a bit overloaded, but simulate. I mean, an emulator running on your CPU or GPU that's going to pretend that it's doing quantum physics usually at a fairly deep level. You can pick a level or execute. Well, if you end up in the subset that actually runs on the machine, that you need big glasses and be very careful to look at, that we have in the basement, we have a few of them. They're not really in the basement, but we do have them. So if you end up with this, you can compile your program to essentially a sequence of laser pulses and then send laser pulse to the computer for execution. We do that because there are many experiments that still remain to be done. We're not quite there yet. One of the reasons, I'm putting it first because that's the one I'm most interested in, but it's not necessarily the main reason, is this is the kind of thing that can help us find out how to design a programming language that is both usable, ideally by human beings, and also executable on the hardware, which is something that doesn't really exist at the moment. Another thing is, even without that, just having some abstractions on top of laser pulses, for instance, we have libraries of geometries, well, that makes life easier when you don't have to actually solve that differential equation all the time. An interesting aspect of simulating and executing circuits is that we can run optimizations for at least two different meanings of optimizations, one of them being how we deal with noise. Noise is a big problem with quantum computing if you put your substrate, you should put your atoms too close to each other, they're going to interact, if you put them too far away from each other, they're not going to interact, how do you send exactly the data you want and not the data you don't want from one to the other. So that's the kind of thing we can simulate using CADNs or lower level tools, or possibly other tools, but anyway. And the other thing is something I'm going to show you very soon, again, still might work. So at some point, I assume that some people will ask questions, don't be surprised if my answer is, I have no clue. Okay, so let's look at a few demos. So this is an example of a graph. Let's re- yeah, okay, this is a random graph. We want to solve the MaxCAD problem. It's a well-known problem in graph theory. The detail is not extremely important. We want to find the best places to cut the graph according to some criteria. So this can be reformulated as maximizing this value. And someone, I was sure I had written my sources somewhere. Okay, so someone has devised an algorithm to do that. Sorry, I didn't sort my sources. So this starts by waiting, yes, after the wait. So we derive a circuit from the graph. So there are as many nodes as edges, if I recall correctly. And we do a number of operations whose objective is to eventually make some configurations more likely than others. So I couldn't tell you exactly how it works. Many operations, many, many operations. Yeah, and in the end, we can measure stuff. So once we have this, we can represent the quantity we want to maximize as an optimization problem for one of the many different, what? Okay. Demo effect. Hop. And so this code is basically PyTorch for people who are familiar with PyTorch. And then we can run what we call training in that case. So we can run the optimization problem. So what we're going to do is iterate. So there is a theorem in the paper which I forgot to cite that shows that this computation is eventually going to converge. There's no guarantee that it's about after 100 iterations. But in practice for a demo, it seems to work. And if we pick the configuration that was most likely, again, there is this problem with the cat which might or might not get out of the box. If we pick the configuration that is most likely, it happens to map to the solution that we're looking for. And here, so we need to cut in such a way that something, something. I don't remember exactly how to read this, but I'm going to read it. I don't remember exactly how to read this result. But the interesting part is, hey, quantum algorithm, give me the grants. So that was a digital algorithm. I'm going to show you something that has a very similar outline. We want to fit a curve. So this, we're just going to take the curve x maps to x2 and see if we can teach a quantum circuit to basically represent this curve. For this, we're going to use the quantum, the ansatz quantum learning algorithm, which exists. And basically, we're going to try and optimize a number of parameters, a number of angles here, and see what we can do. So again, let's finish our circuit. What is going on? It was working this morning. Yes. Yes, no more error messages. Okay. Okay, so this is with the initial state of our quantum circuit. The dots are the approximation, the, are samples that we want to approximate. And the curve is the initial result. As you can see, it's not exactly a perfect match just yet. So we're going to run a few steps of learning algorithm. So this one is just pure by torch, just regular optimization. And usually it works. Normally it works. I'm going to pretend that it has worked and I'm going to pre, to start. Yep. What the? Yeah. All right. So after a few steps of learning, this is what we get. We have an orange curve that why not absolutely perfect actually matches fairly well the blue dots. So okay, it's not, not time to call the noble committee for that. But this has applications. Of course, this is a very simple example for a very simple curve that we want to fit. But if you look at it with a little tolerance for approximations, this is kind of the things that neural networks are doing. That the learning phase is something kind of like this. In fact, there is an entire subdomain of quantum computing. That's quantum machine learning. And this is, I believe, one of the simplest algorithms of quantum machine learning. If you look at the API documentation of cadence, you will actually see a QNN module. So quantum neural networks. And this is a very, well, a very active subfield of an already very active field. Because if the hypothesis we have on, if the models we have of energy use and computational power are correct, this means that hopefully we could replace these tens of thousands of cores used by a chat GPT or whatever its competitors are named and replace them by something that consumes way less energy and hopefully runs at least as fast. So time to reach conclusions. What do we have? We have a toolkit designed for exploring the design of quantum circuits, both on hardware that already exists, on hardware that we believe is possible and might come out of, into labs or out of labs within the next five years, and on purely hypothetical hardware because why not? Experiments are interesting. We have this mechanism circuit optimization, which I've showed you. I showed you how it could be used to solve problems or to approximate curves. It has also other applications such as the problem of noise. I mentioned noise between atoms, for instance. Sometimes you want to optimize based on noise models and make your things work because you know that your model isn't perfect or at least your high level model isn't perfect and you want to go to a lower level model. And again, it's not a programming language, but I hope that maybe someday it could serve as the beginning of one. Ongoing work about enriching everything, writing libraries for domain-specific problems, for known algorithms, for geometries, etc. There are many questions. There is ongoing work on compilation, on the subset that we already know how to compile and more larger subsets. And of course, I'm trying to make this easier to program. And when I say we, of course, I mean them. There was a paper recently accepted and presented at Planck. If you were interested, it's on the last line here. And all the documentation and the source code are on GitHub. So thank you for listening. APPLAUSE We have like four minutes for questions, my friends. I'm sorry, did you catch it? Was there any attempt to implement the circuits that we mentioned as an actual problem? I can see the question for the mic. Yes, the question is whether these particular circuits have been implemented on hardware. The answer is I have no idea, I'm sorry. LAUGHTER I believe... No, sorry. I'm not going to say random crap. I don't know. Right now, the main use case is experimenting with this. But again, for the second algorithm, for instance, if we can manage to make its scale to very large... to a large number of curves and more complicated curves, there is a potential application to basically machine learning in general, not just artificial intelligence, but... And the former one, I can't think of any specific example for the former one, but I know that graph algorithms are very interesting for many things, because, well, for one thing, there are good reasons to believe that they can be executed on existing or almost existing hardware. And there are many important problems that can be modeled as graph algorithms. For instance, we are in an energy crisis at the moment, and all the energy distribution problems, for instance, are graph algorithms. I've heard of people who want to work on it. I have no idea whether they actually work on it. Also, for car... for modeling the circulation in cities, things like that. I couldn't tell you about more than that. Okay, I think we should also thank you very much. Thank you very much.
Welcome to the EU Policy Workshop Devroom
So good morning everybody, welcome to the open source in the European Legislative Landscape Devereux. I have a confession to make which is that we applied for this devereux two days before the end of the closing deadline and we have made it up as we went along after unexpectedly being awarded a devereux. So the whole day is very organic but it has a very important purpose. We've discovered that the European Union has noticed that devices contain software and that the software needs regulating. And they have started doing an amazingly effective job at writing software into regulation. So one of the people we have with us today, Benjamin Bergel, wherever Benjamin is, he's presumably, I know he's here but he's hiding. He was involved in writing the NIST 2 directive and then he went on to write the CRA and he is surprisingly expert if you have a low opinion of EU policy officers or unsurprisingly expert if you know that they're all generally brilliant people. However we discovered that the EU's model of what open source is is that it is low quality components full of defects that are created by hobbyists in their basements. And the regulations rather reflected that. And so we found over the last year it was very valuable to engage with the regulators. Today what we want to do is not talk about the technical details of any regulations but rather gather the feedback of the open source community so that we can document the reflections and outlooks of the community for the benefit of the commission as they go forward in regulating within their digital agenda. So we've arranged for there to be four workshops today. The first workshop that is starting in six minutes is a workshop on the consequences of the Cyber Resilience Act and the Product Liability Directive. Then the second workshop which starts at 11.15 is going to look at how we engage with policy makers as a FOS community. The third workshop which is at 1.20 is going to look at how we can assist in getting more free and open source software in use by public administrations. And the fourth workshop is going to look at how the free and open source community can come alongside the task force that is implementing the DMA and the DSA and promote interoperability given that the best path to interoperability is not standards but rather the implementation of standards in shared open source packages. So that's our agenda for today. We have some ground rules that you'll see again during the day. First of all we encourage you if you are like me and you talk a lot to maybe talk less and to encourage and leave space for other people to express their opinions. We encourage you to always be holding the microphone when you speak in a session where notes are being taken and that is all of them because today we have four rapporteurs free to the workshops. The rapporteurs will be listening to what's said, noting down the substance and writing a written report for us to send to the commission after the workshop. When you do start speaking please make sure every time you start speaking you indicate who you are and if you have an affiliation what your affiliation is. Please note that this is a very complex topic and we know that it's a very complex topic so please be open to new ideas. When we run into an intractable problem let's note it and move on to something we can fix rather than obsess about the obstacle. And finally there's two ways of looking at this. Please observe the FOSDEM code of conduct or if you prefer let's have fun and make new friends.
CRA & PLD: [begin workshop] How will the open-source community adapt to the new EU Cyber Resilience Act and Product Liability Directive
I'd like to hand over to the chair of the first panel, which is Martin Erz and Martin Erz from MLNet Labs, and is going to lead what we do next, Martin. Thank you, Simon. So welcome to the first block of the day, which is about CRA and PLD. I just heard from Simon how the structure generally works. I will say a couple of works about how this block will work right now. So an important person during this session will be our rapporteur, who will be writing down all the things that the speakers say, but also perhaps the things that you will bring in, because the idea of these sessions is to actually have some interaction. So for this session, that will be Merco. Merco will be our rapporteur, and at the end of the block, he will summarize what he learned today. So for the agenda of this particular block, we will have two lightning talks. We will have a panel, a workshop bit where you can actually do something yourself, if you haven't already, by asking questions. We will have a third lightning talk, and then we will close with a rapporteur's summary. So that's our agenda until about 11.15.
CRA: 40 new ways the CRA can accidentally harm open source
Hi, so my name is Toby Langelle. I run a small consulting firm based in Geneva, Switzerland. And I have kind of straddled throughout my career open source and standards. So people thought it was a good idea to bring me in to talk about this. So this lightning talk is called 40 New Ways to CRI Can Accidentally Harm Open Source. And that of course references the 40 plus standards that are, the harmonized standards that are going to be written in the next couple of years to essentially make it possible to implement the CRI. So the first thing I want to say is the CRI has landed. It could have been really, really bad. A lot of us were really, really concerned. And it turns out that it isn't. Firstly, first thing is the open source community rose to the occasion. And I think that's really amazing and it was beautiful to see. And like a lot of people put a lot of work in. And I think we should all be very thankful for the work they have put into helping us. And then also policy makers actually paid attention, listened, and considered the input from the community. And also for this, I think we ought to be really thankful. So thank you for both sides for making this happen. In the process, we avoided harming open source pretty seriously. And we also avoided harming that EU's ability to leverage open source, which was another one of the potential risks of the original versions of the CRI. So we do now have a lot more clarity. There's an asterisk there because lots of people still have lots of questions, myself included. My key takeaways from the last version of the CRI is that the responsibility falls in the right place. IE was the people monetizing open source. The company is monetizing open source. So for me, this is really important and it's great that this has spelt out really clearly in the last version. And then the other thing that I thought was really interesting is the open source stewards, this new notion of open source stewards, which really institutionalizes the foundations that have been playing an important role in our space. And it's also, I believe, a really smart instrument for the EU's ambition around Southern Tech. That said, it's going to have industry and ecosystem-wide impact. I think companies will be a lot more cautious. I will certainly advise my clients to be more cautious. And a lot of projects will move to foundations and I think they will do so earlier. And then the conformance requirements, they're going to climb up the dependency tree. And so essentially, I'm suspecting pretty quickly most of the ecosystem will actually be subject to some parts of the CRA, probably the lighter version that is for open source stewards. And I do have a question that was this. This is going to create a lot of financial and work overhead. And I'm still kind of wondering who's going to be paying for this. So I think this is a question that will need to be dug into a little more in the future. So to meet the CRA, there are essentially going to be two options. Either you demonstrate conformity by yourself, so the burden of proof is on you, or you will essentially follow a set of standards, the harmonized standards, and this is going to provide presumption of conformity. So the fact, though, the standards are going to be how the CRA impacts open source because that's what everyone's going to do. Essentially follow the standards so that they can be presumed to be conformant. And so 40 plus standards, that's 40 plus way things can go wrong. If you believe that the standardization process is less opaque, easier, more open source community friendly than policymaking, I have bad news for you. And so essentially the same kind of misunderstanding, the same kind of risk that CASC carry through the CRA is probably going to carry through 40 different standards. Actually sitting in 40 different rooms to make sure that 40 different standards don't harm open source in a weird and unexpected way is a lot of work. So I mentioned the opaque standardization processes. Also open source has special requirements. Things have to happen in the open. They cannot be patterns around the standards. And not every organization functions in an open source friendly way to put it mightily when it comes to how they deliver the standards and how non-uncumbered by patterns these standards are. So that's also something that will be incredibly important to make sure that the open source community can actually have access to those standards and be able to implement them. The two last points is there's a huge diversity of open source stakeholders, a lot of which were very poorly represented in the CRA even though the open source community was there. So there were these two words obviously and they were very much involved. Hobbyists, it's very hard to actually represent hobbyists, right? Small commercial open source startups that are going to be incredibly impacted, including in the EU because they will be considered manufacturers, rightfully so. Probably don't have the resources or the know-how to be involved in the process. And the last point is interop was other jurisdictions. One of the huge strengths of open source is the fact that licensing is essentially standardized worldwide and like the MIT means the same thing here and there roughly sufficiently that it's like okay. And if we start having security standards that are different across different jurisdictions it's going to be a huge burden on open source maintainers and open source developers and we want to make sure that if you comply to whatever the EU comes up with in terms of standards it's fairly similar to what NIST is coming up in the US, etc. etc. And that's it. Thank you very much.
PLD: When software causes harm – who pays and why?
Okay, so I'll advance the next lightning talk, which is about the second big legislative effort that went on during the past, well, couple of years, really, smiling at one of the people that worked on it in the European Commission, which is the Product Liability Directive. And with us today is Rob Carolina, who is the General Counsel for ISC, Makers of Bind, who's going to give you an introduction into product liability in five minutes, which is... So, take it away for Rob. Martin's original idea was do the product liability thing in three minutes, and then you can do some other stuff for two. So what I'm doing here is giving you a reading test, and I'm trying to condense down to two and a half minutes a topic that we spend about 40 to 60 hours on in law school. So the reason that I'm giving you this reading test is because I want you to be familiar with this fact pattern. I'm going to tell the story in reverse from how I usually do it. This is a story about an automated car that hits a pedestrian in Ireland, Pat Victim. That car has on board a piece of software called Bravo Drive, which has included within it a piece of software called Open Sesame. The car was imported by Exotic Imports. The car was manufactured by Einstein Motors in California. Einstein Motors got Bravo Drive software from Bravo Bits BV in the Netherlands, and Bravo Bits VB got Open Source, Open Sesame from Firefly APS in Denmark. Terry Dastardly hacked into the automobile because of a weakness in the authentication package, provided a few inputs. And the next thing you have is a car that runs over Pat Victim in Ireland. Don't worry about Terry Dastardly. He dies or she dies in a horrible paragliding accident or without money or is run over by a bus. Just take them out of the equation. The question that product liability seeks to answer is, in a situation like this when we have an injured victim like Pat Victim, who pays for their injuries. Two slides that look like this. This slide is designed to teach you the difference between two different legal theories on how you sue people who manufacture things. The left-hand side is the law of negligence, at least as it's practiced in common law countries. I would not come to a civil law country and teach people about the Napoleonic Code. However, I will talk to you a little bit about common law and suggest that the two are not worlds apart. As you can see from the chart, when our victim tries to sue all these various peoples, Johnson, exotic imports, Einstein, Bravabits or whatever, Victim is in a little bit of difficulty because the people who manufactured and imported the car did everything reasonably. They selected good components. They selected trustworthy producers of things. They did not act rashly. Whereas the error in the situation came from a software vendor called Firefly and maybe, just maybe we could establish that they owed what's called a duty of care to the victim. If someone like Pat Victim was a foreseeable victim when someone wrote this authentication package in Denmark, but as you can see, it's going to be difficult to establish that. Now, in the reading test that I gave you one slide ago, I did put in there that the folks at Firefly, they had a bad week. The problem with their package was because someone made a coding error and the QA people were kind of asleep that week because we're going to get that in a forensics report from an expert who's going to come to trial. The right hand side of this slide is designed to teach you a different area of law that was adopted in the U.S. in the 1960s and in Europe in 1985, which says what do we do in situations like this where everybody acts reasonably but Pat Victim still has injuries? And the answer is we don't look for people who did things unreasonably. We don't care how careful they were, how cautious they were. We look for people who manufactured and put into circulation a dangerous product. We tried really hard to make it safe. It doesn't matter. If it's dangerous, it's called no fault liability for this reason. And as you can see, because the automobile manufacturer and the importer, and this is the law as it exists today in Europe under the 1985 directive, because they were dealing with a product that is dangerous, they will be strictly liable, but the software vendors will not because software has not been deemed to be a product. Enter the PLD, which changes things on the right hand side of this chart. And as you can see, what happens here, one of the design characteristics of the PLD, and the origin of these slides, by the way, was I did a talk at Etsy five years ago, which said this is coming. So I keep using the same slides for five years, and they're still accurate, is that we recharacterize software as a product, and now we can attribute liability to Firefly because they distributed a dangerous product, a piece of authentication software that had been, that didn't work properly. We'll just leave it at that for right now. And since we're running a few minutes ahead, I have one last slide that I'll show you, and I'm just going to hold on this for 60 seconds while you read it. If you're looking for a copy of this, I just posted it half an hour ago on X and on LinkedIn. So whatever the answer is, depends on what questions we're asking. I know a question I'm asking, I'm the guy on the left. It appears the questions on the right were the questions asked by the European Commission. And that's how we have the answers that we're talking about today. Thank you. Thank you, Rob.
CRA & PLD: panel
Okay, so welcome back to this session on the CRA and PLD block. We are having a panel with some of the people that directly wrote the pieces of legislation we're discussing in this block. To my left we have Benjamin Bergle, who is working for the European Commission as Head of Sector for Standardization and Product Security. I almost did it right. Next to him is Chuck Dinghou, who is Director for the Python Software Foundation and we really wanted to get a community perspective on this panel, which is what Chuck will provide and also what Chuck will challenge you to help us provide, because that's kind of what we are trying to do here. And finally we have Omar Enaji, who is Policy Officer for DGGROW, who has worked on the Product Liability Directive for multiple years now. My name is Martin and I will try to ask some questions. You will be asking the really clever ones. I will be asking the other ones. So let's get started. I would like to ask our panelists to do a real quick introduction specifically to answer the question, what does implementation of these laws mean to you? Because we've been over the proposals, we've had the negotiations, they're about to be confirmed by Parliament. So what this panel is about really is about looking forward. We're not doing the negotiations over. We're now looking at when will these actually hit Europe and what's needed to get there. Thanks a lot, Martin. So for the Cyber Resilience Act, I mean the text isn't final yet, right? So we don't know exactly when it will enter into force. As I said yesterday, sometime around the middle of 2024, maybe a little bit later. And then we have a three years transition period. So manufacturers, hardware and software manufacturers, they will have to start applying the rules roughly around June 2027. So that gives us three years during which we can prepare for the implementation. We just had this fascinating presentation on the 40 standards, right? So that's going to be a huge part of our work, helping the European standardization organizations with the standards. We will also have to produce guidance, of course. And thank you actually very much for inviting us here because I think these are the venues where you get all the tricky questions that need to be answered in the guidance, right? Because of course the CRA is a high level piece of legislation. It will not provide an immediate answer to every edge case that you may have. So I think this is where the guidance really comes in. And we want to be inclusive in this process. We want the community there, open source, single vendors, everyone. And we're really looking forward actually to this process. Thank you Martin as well. So for the PLD, a bit different from the CRA because it's a directive and not a regulation. So it requires transposition at a new level in each member states. Actually the law will be applicable in each member state. So this year will be 2026 in theory around June, July. It will depend exactly when the parliament will give the vote. And by then the liability rules will be kicking in. So yeah, that's roughly it. Would you mind spending a couple of cents more on the difference between a regulation and a directive because we appreciate that a lot of you may know a lot about software and we also think some of you may not know a lot about you lawmaking. So can you? Yeah, so I mean just a quick legalistic view says at you level you have three types of acts. You have three types of acts. A regulation, a directive and a decision. A regulation and a decision basically only requires to be directly applicable at national level. But the law remains the same. For a directive it requires transposition and the transposition is basically incorporation into the national law. You will have 27 different laws that would say the same basically. But because of the particularity of the directive it would also require changes in some other parts of the national legislation. A directive also requires implementation along with the incorporation. The regulation only requires the implementation of it. And it's directly applicable as a regulation into the national laws while the directive needs to be incorporated to be applicable. So you will have the central piece of legislation but for the rest you will have national laws that will tell you or give you the answer. And the role of the commission during the transposition, the two years transposition, that's why there is a deadline for that. It's to check each legislation to ensure that there is no mistakes that doesn't go against what the main legislation says. So that's the big picture. So my next question is about your personal experience trying to express false into law or maybe to interact in the EU policy space whereas you may have previously focused on the developer space. So a different question for each of you. So for you, what was it like to work on a policy topic as someone who is very knowledgeable about software development? For each of you, what was it like to work on a topic with the nuances that open source has in your policy? So first of all, again, my background is very similar to a lot of developers. I'm closer to a software developer than a policymaker. So for us, I think we have a lot of concern about whether I will be reliable. I mean, maybe I've created some fun stuff. I publish it as open source because I want to share it but then you have no control of who is taking it and doing what about it. For example, the car example maybe at the beginning when I created this project, I'm not expecting someone to use it in a car and then the car will hit someone. So that is something that I think a lot of developers have that in their mind. There's a bit worry that now if this happens, will we be not publishing anything anymore so that will affect the open source ecosystem a little bit more? And also, for example, if you're working for companies or maybe then your company would tell you to not do it because the company don't want to get involved in your hobby project that may get into trouble. So there is a lot of concern, I think, as someone who, you know, and also, because software is very different from hardware, right? You can't make something at your backyard and then come and in fact you can take it in production. But software, you know, the power of software is like, you know, some individual developers, they can still develop a piece of software that is, you know, very applicable in a lot of application but is maintained with very limited resources. I think that that make hardware and software a huge difference in terms of scales. You know, you don't have enough, you know, resources, you can't massively produce something in hardware but if you have limited resources, you can still massively produce some things in software that like a lot of people use, right? So that's the concern from a developer perspective. Yeah. So. Thanks. Yeah, I mean, for us, I think it was a huge challenge to adopt the existing European framework for product legislation, the new legislative framework, as you call it, or the CE marking that you're familiar with, to software and to cybersecurity, right? Because, I mean, software is not a tangible good. It's different and cybersecurity is also very special. It's not the same as safety. Usually we've always regulated safety. Now, for the first time, we are regulating security and I found that to be a huge challenge. I think we managed to get it right but it was a challenge. What I really liked about engaging with the open source community is that you meet a lot of passionate people who really care, right? So when we regulate other areas, you get to meet lobbyists who are simply paid to defend interests. Of course, you're also defending your interests. But on top of that, I mean, you meet people that actually really care about the things that they work on and you see it's more than just a job, it's a mission for them. And I really appreciate that. Well, I mean, for me, it's a bit different because let's say that the product I really like to direct is about any type of product. So what I had, it's basic. I think it's the CO2. Do you see maybe it's a defective product. But the idea is basically how to deal with perfume industry, with car industry, with tables, with chairs, with vaccines, with pacemakers, with hats, with whatever you want, all of those industries. With the PRD, we didn't have a specific sector. We had all of them at the same time. And what we actually needed is basically to have people that could represent each of those sector to hear the concerns and what could work and what could not work. And I have to say that with the open source software community, it was maybe a bit harder to achieve that because of the fact that you are all individuals, there is not really someone that represents you. You need to speak a little bit harder because this is not a mic. It's only for the recording. So what was really complicated for the open source software community for me is basically I could not have a single voice that could tell me what were the full concerns, had different voices. But to be totally honest, the one that we're more talking about, your issues, let's say the bigger one, which I pretty sure do not represent you. And so that was the main difficulties for us from the PRD perspective, was to get what are the real concerns and how do we reply to them. But at the same time, we also had to be totally honest. The PRD is a piece of legislation made for victims, which is basically all of you, all of us. So we needed to find the right balance, not to put too much pressure on the one that creates the product, but also not too much pressure on the person that actually suffered the damage. And that was what we needed to achieve. And where we need to find the good balances when we have your inputs, this is where we can actually find the perfect balance in a way. So I will be asking, giving the crowd a opportunity to ask questions. So if you have one, raise it before I'll get to you. I'll ask two questions to Benjamin so I can have a look around. So my first question, Benjamin, and it's about stewards, is how can a steward know they're a steward? And my second question is, suppose they find out they're a steward, but they're not in the EU, who is the supervisory authority they are supposed to be talking to? Okay, so I mean, you find out if you're a steward by looking into the law, the law defines the concept of steward, right? It says you have your, if you're someone that's, I mean, if you're a legal person that supports a project on a sustained basis, and this project is ultimately intended for commercial purposes, you are a steward. The regulation also gives a few examples, such as foundations, I mean, not every foundation will be a steward, but if it meets those criteria, it's a steward. And so you can look it up in the law. As I said before, there will be cases where it's maybe not as clear cut, right? We hope that with the guidance, we can also address those cases. So I'm quite confident that the end of the implementation process, people will usually know if they're stewards or not. Now, if you're outside the EU, so the CRA is indeed a regulation, yeah? It means it applies across the entire single market in a uniform manner, and all the market surveillance authorities are responsible for you, essentially, yeah? If your product is, or if your software is published and accessible across the entire internal market, then all the market surveillance authorities will also be responsible for supervising you. So I will be walking into the crowd to get a question. I will be off camera, which is fine. So please state your name and affiliation and a question if you have it. I'll hold the mic. Okay. My question is about a Debian, which there is a Debian foundation in France, and there is software in the public interest, but these foundations only handle financial issues. They have nothing to do with code in any way or form. Are they going to be considered stewards? Yeah. So unfortunately, I cannot give legal advice on individual projects, right? Because if I get it wrong now, then it's a huge problem. So you will have to check for yourselves. I mean, what I can tell you is, I mean, we put some indications into the law when you could be considered a steward. So for instance, when you are hosting the collaboration platform, if you are to some extent governing the project, if you take decisions on the project, or if you do steer the development process, then you would be considered a steward. Taking another audience question. So please state your name and affiliation and then the question. I'll hold the mic. Thierry Carreze, Open Infra Foundation and the open source initiative. You mentioned the chilling effects on development and engagement from the open source community. And I think it's the main fear we have is that whatever legislation is created, it would prevent or discourage people from participating in the open source commons. And I think it's linked to any uncertainty will be interpreted in a worse way. So how are we going to, with 40 standards on the CRS side and transposition in every country, 27 countries on the PLD side, how are we going to have enough certainty for those people to, for them not to have this chilling effect on their participation? Thank you. I'm going to Omar first. Well, I think you can send an email to one of us. That's basically the first. I mean, we are open to have any discussion with anyone that has an issue on the ground because we are not on the ground. So this is how it works for every unit in the commission is basically everyone has legislation or has a policy and we receive feedbacks from people. Someone, for example, for transposition would say, well, I'm in Spain and this is how the law applies in Spain. And I'm pretty sure that that was not the main idea because when I looked into the main piece of legislation, it says something opposite. Well then it's the work of the commission to realize, well, that something goes wrong there and then we enter into contact with the national authorities. That's for the transposition part. But if there is like, there are issues during the years of applications of the directive, then we have what we call a review clause in each piece of legislation. Every three years or five years, you will have someone from the commission, usually one of us that will do the review with the study, having interviews, taking all the evidences and proof and you will collect all of them and then realize, OK, there is an issue that was not foreseen at the beginning. How do we solve it? There was a gap. How do we, how do we feel it? That was actually the main, the same thing that happened with the PRD, the PRD that dates back from 85. It took 40 years to review it. Before that, we started the collection of the reviews of the proof and we collected all the opinions and this is where I say that maybe your community was the one that was not really involved into that because of how the process is, but everyone has a voice in the seat there. Sorry, I want to ask a follow-up question. So I know that like sometimes the White House will have some open call for like suggestions and comments. Will you plan to do something like that? Well, first of all, we need to apply it, but that is for sure that for the next review, which will happen. So it's two years, four years, it will be in six years. In six years, we will take us, do a state of play of how it's applied and then obviously we'll have to collect a bit of information and we will have to check with people of the industry, the communities to see what is their experience and if there are things that work or don't work. So that's how we will have to do it. But I cannot tell you right now, but it will be one, but I'm sure that it will be one because it's how it works for these kind of things. Yes, so I would like to fork Omar's answer. I would like to add that, I mean, I don't think there will be a chilling effect on open source coming from the CRA, to be honest. I mean, let's be frank, open source is essentially outside the scope. I mean, of course, there will be cases where manufacturers will, of course, try to place requirements upstream, right, and talk to upstream developers, but I mean, you are for the most part not covered by the Sub-Brazilian Act. If you want to make sure that the transition goes smoothly indeed, I mean, please do reach out to us. I think we've proven over the last year that we are a very approachable bunch here. We are taking your concern seriously. We are going to do our utmost to find solutions, but we are even legally obliged in the CRA. There is a specific provision that requires the commission to consult the community. I mean, we would do so anyway, but you even have that reassurance that we have to do it. And yes, I mean, just please do reach out. Just one thing for the chilling effect, because for the PRD we have experience with that. Forty years ago, I could show you the newspapers that was going all around Europe from manufacturers saying that if this piece of legislation would enter, there would be no products anymore in Europe. I'm pretty sure that this is not the case anymore. What the PRD did is basically give trust into people. When they buy something, they know that if something goes wrong, at least they will have the back cover. That is the idea of the entire piece of legislation. One practical comment I would like to make with respect to the question that was just asked is that after the panel, we'll have a workshop. And one of the mechanics is that we want to ask you about your fears, but also hopes and perhaps your solutions. So if you're listening to this and think, hey, but I have these corner cases that I'm really worried about, make sure to remember for like 20 minutes more and then put them to paper because we're actually trying to collect these. I saw multiple hands. I'm first going to ask a question myself and then I'll return to the audience questions. So it's related to the PLD. In December, a political deal was reached on the PLD and one of the things that was publicized by the MEPs looking out for open source specifically was that open source would not be in scope if it was not a commercial activity. And it was a delegation to the technical level to implement this idea along the principles of the CRA. Now when the text of the PLD became public in end of January, what we saw was that there was a single or maybe one and a half PLD recital and the CRA has seven, eight maybe. So I'm asking is the PLD team that much better at writing recitals? Can we somehow use the nuance that was expressed in CRA in the PLD or are you going to offer guidance? Because I was a bit surprised. I was expecting more nuance but maybe I'm wrong and you're the expert. So maybe a bit of a tough question for you Omar and I'd like to hear from you. So I mean I will ask you a very short question. How many products do exist in the world? Because what you as a community got is basically one full recital over 47 while the PLD applied to millions of products. So I think in proportions you got quite a lot actually. The difficulties for the PLD is basically to say that yes there is a CRA that gives an explanation about open source software but you will also have the AI act for that. You will also have all the type of legislation that will get the open source point and we have to cover all of them at the same time. We cannot copy paste from one single legislation because we apply to all of them at the same time. So the difficulties was really to find the right wording. I think as you said the MEP that quoted said that the main idea is that is the commercial activity. While this is applicable for any product, any product that's actually been developed or supplied outside, mostly supplied outside of a commercial activity, it's out of the liability regime. And that's what we written in the recital. It's basically restating the fact that if it's outside of a commercial activity then you're out. But if you're in, that's where the PLD applies. We cannot create a specific regime of open source in the PLD itself also because of the nature of the legislation that has to be neutral and you cannot have just very specific provisions about one single product because each provision has to apply in the same way to any other type of product. That's a bit of the, and the CRI would apply for cyber vulnerabilities but then you will have the AI act that would apply also for open source. And for us we need to cover all of them so that's why it's in this way. So I have a question that relates to work that will be a little bit out of your hands. So for you Omar, it's about the 27 member states that somehow need to get the work you did and then make their own and somehow understand the nuance of what open source is about. For you it will be about Toby Stark on 40 standards. What will you be doing for Chuck and me and all the other people writing software to apply what you learned in the past 12 months or maybe you already were to help the people doing that work not make, like understand the nuance of what is essentially a niche of a niche but also rules the world of products with digital elements. So what will the commission do to help the community in these stages of the process? Okay. Yeah, I mean, so the commission is not writing the standards, right? I mean, that is how it works. I think you also would not want us to write the standards. So that's probably a good thing that we are not writing them. It's the European standardization organizations. They are made up of national delegations from the national standards bodies. These standards bodies, they often send representatives from manufacturers and from others. We have like, the commission has basically three ways of being involved in that process. First, we are the ones drafting the standardization request, which is the basis on which the ESOs, the European standardization organizations are going to work on those standards. So in the standardization request, we can already express our expectations, what the standards should look like. Then, although while we are not going to be writing the standards, we are going to be there all the way, right? So we will be in all the meetings. We will listen to the conversations. We will give our views. We will answer questions on how things are to be interpreted in the CRA and so forth. And in the end of the day, we also have to rubber stand the standards. They have to be cited in the official journal of the European Union that gives them this power to give presumption of conformity. So what I can reassure you is that we are going to be there all the way. We are going to look at the process very closely. We are also more than happy to engage with those parts of the open source community that do have expertise in standards, right? To find solutions to the issues that you may have. So again, I mean, I already said it a couple of times, please do reach out and let's discuss that in more detail. Thank you. Well, I mean, my work is not done yet. As I said, the transposition will kick in as soon as the co-legislator have officially voted, which should happen either in June, July or September in any event. And then after that, we launch the transposition period, which means basically that we will be receiving the 27 piece of legislation piece by piece, or sometimes just the entirety of it. And we will have to work closely with each single member state to ensure that the legislation reflects exactly the directive. What we have as a tool in the commission is what we call the infringement procedure. So when the commission realizes that a member state does not conform itself with a new legislation, we can bring the case to the court to ensure that the member state applies it or does it properly. I'm not, I mean, as a small background, the first PRD took for some member state more than 20 years to properly transpose the directive. So I hope we're not going to be there, but this is how it works from our side. And then once it's transposed in any event, we will have to check constantly if there is a good application, because one of the things is it's not only the transposition by the member state, but also how the jurisdictions will be applying the law. A national court is also a representation of the member state at your level. So if there is a misapplication at that level, we would also have to intervene to ensure that it is done in conformity. Thank you very much. I will take two audience questions, one here and one there. And then we will continue with the panel if there's time. I will be holding the mic. Please state your name and affiliation. Alistair Woodman representing 2501C3's outdoing open source projects. As far as the PLD is concerned, do you anticipate that the market will support insurance policies for this to deal with the sort of quenching thing? Or is it a non-goal or a goal to encourage insurance in this particular regard for non-mulse-feasant behavior? I think that was for you Omar. So the PLD does not have any requirement about insurances. So everyone is free to do whatever they want. Basically, it's just you need to calculate your own risk. And once you know your risk exposure, you will know whether you need one or not. But it's not from outside that we do it. And to be also totally frank with you, as I said also yesterday, most of you here will never have a claim on PLD. I mean, this does not happen every single day for each type of product. We have a few cases that can happen. You can have access to all of them. It's true that for software it's a bit more rare that this happens because you have something that the traditional products don't. You can correct the piece of software before something wrong would happen. You know that there is a vulnerability. You know that there is maybe something defective inside. And then you will correct it with an update and then you avoid having any issue. That's a bit more of a facility for you. And we will not impose from our end-in-end insurance for that. That's a bit of the approach. Audience question. Please state your name and affiliation. Hi. Olli Johansson. I'm an open source developer also active in OpenSSF and OVASP. The problem with those 40 organizations that create standards for us open source developers is that ECMA, Sennilec, all of them require quite huge fees. Who will pay them so we can take part in the standardization effort? I think this will be for Benjamin. Yes, I don't think I have necessarily a satisfactory answer for you, right? Yeah. So I will take note of your financial needs. But indeed, I mean the CRA is just one of many pieces of legislation. So we do not shape the standardization policy. We just use the standardization process for the CRA. But indeed, I mean this is an important question and we are more than happy to look into that. Thank you. So we're slowly nearing the end of this panel. I'm going to ask a number of questions in succession and then we'll see if there's more time for audience questions. So to Omar, I'll ask, do you know about any other related legislation that is coming for this community that we should be waking up about? So take a moment to think about it. I'll get to you for the answer. So for Benjamin, I would like to talk to you about the guidelines. Can you be very specific about how people can contribute to the process of writing them? And there is this delegated act possibility about voluntary security at the station programs. Can you talk about what your intentions are, maybe how people can help? So my goal is with these three questions to the two of you, is to give people in the room a clear view of what they can do. Should they have the time, the money, etc. to be involved in EU policymaking? So now I'll hand over to Omar. Well, I have no idea. I have to be totally honest with you. We are many, many directorates. But it's as simple and it doesn't really require any money. It's just two times time to check what the commission is doing. The various directorate channels, I mean, mostly did connect, but it can also be did you grow? Could be just wherever the directorate is. And then if you have a question, you are wondering something, you are... I mean, don't say that to my other colleagues, but you can send an email to the units. And this doesn't cost any money. They will happily reply to you and give you any answer that you're seeking. There are stuff that you don't understand from legislation. There are information that you would want to bring to the attention to the commission itself. Our emails are open for that. And this is also our role to have a look into what happens on the ground. I mean, as I already said, we are legally obliged to consult. But we would do it anyway, of course. As the commission, I mean, we are very likely going to organize conferences where people can attend and bring their ideas to the table. We are likely going to have some form of expert group or a similar body where people that want to, like, be more involved than just ad hoc, but in a more structured manner where they can engage with us. And, of course, you will be seeing us at conferences. You can invite us to your events. We're happy to attend, maybe not always physically, but online. So there will be plenty of opportunities to engage. As regards the voluntary security attestation programs, so, yeah, I mean, the idea is basically to give those projects that are not directly in the scope of the CRA a chance to provide some form of assurance that the projects are secure, right? We know that many of these projects, they don't have financial resources. So the provision is quite open in that regard. It does not require the ones that develop a project to also pay for that program, but other people can step in. So, for instance, integrators that take an active interest in a certain component because they need it for their own due diligence, they need the assurance that it's secure. They could also team up and pay for that assurance. Now, these attestation programs, there is only a so-called empowerment in the law. That means that the commission is empowered. We are allowed to flesh them out. So they are not there yet. We don't have these assurance programs at this point in time. But the commission will be able to work on this. And for this, we will also need your input so that we can shape these programs in a way that they are useful for the integrators or users to have the assurance that they need. But they also take into account all these specificities of open source projects because they are often so different, right? The way they are structured and the promises or commitments that they can make compared to more traditional, manufacturer-based projects. So I think we'll take one more audience question and then I'll ask Chuck if she wants to do any reflection on what this means for Python maybe. Let's see. I think there was a hand pretty early on. So name and affiliation, please. Hi, Vittorio Bertola from Open Exchange, which is a 300-people German open source software maker. So the question is, well, first, this is creating cost, of course, not for security because we have a flow of security record. We already spend all the money that's necessary on security. But for the bureaucracy that now you are introducing for compliance. So this is making us less competitive and all our competitors are from outside Europe, including Google. So how is this going to be compensated? And maybe ours, we are a pretty big company. We can cope for it. But the French Foundation for Debian that has to hire a lawyer, there are going to be costs. Are you going to put some money onto this, maybe to fund developers to cope with security issues or to fund the bureaucracy? And also, how are you going to avoid the international players from gaming the system? I mean, it's way too easy. I see this happen for like the Googles and Apple. They create some initiative, which is a non-profit. They put the code into that. It gets outside of the CRA scope or maybe gets the like system, whatever. And then they don't have to support the cost of compliance. Well, we still compete with the same piece of software and we have to pay the full cost of compliance. So do you have any thoughts on this? Do you want to check to go first or answer the question first? Yeah. Yeah, I mean, it's true, of course, that there will be some bureaucracy. I mean, no law has ever been created that doesn't create some bureaucracy. Okay, maybe the PLD doesn't because it only hits you once something happens and not before. But usually, of course, there is a certain compliance cost that's quite unavoidable. I think the competition concerns there may be a bit overstated because the CRA does not only apply to European companies or manufacturers or open source projects, but it applies to anyone who is bringing, publishing or putting on the market those products in Europe, right? And we all know that Europe is a big continent. It's quite relevant. There are probably very few manufacturers in the world that do not place products on the European market. So they will all be subject to the rules. We do have some field facilitations actually for small manufacturers when we talk about actual manufacturers. So there is a provision that again, it's an empowerment for the commission. It allows us to create a form for a simplified technical documentation for small companies. So that means that small companies, they will only have to fill out one form essentially and the length of the form is somehow going to inform the expectations towards how much information you're going to provide. So I think that can help a lot like one single form makes your life much easier. And then we also have some funding calls. Actually there are funding calls ongoing right now until the end of March that also aim at helping small companies deal with the implementation of the CRA. Thank you Benjamin. So I think we are at time. I would like to thank our panelists for the courage to come here to talk to us, to have this conversation. They're not leaving yet, but I will ask you for a round of applause before we continue.
CRA & PLD: CRA conformance for Open Source Projects
Our next speaker for today is Marta Rybczynska and I probably didn't manage to pronounce her last name so I will be asking her to do that again and show how I was almost right but not quite. Marta will be talking about CRA conformance and the thinking that Eclipse has been doing around this and she has a background for a number of years developing different solutions but I think she also closely followed the CRA. You may have seen her article on the Linux Weekly News months back which was a very good summary of where things were at at that time. So without further ado this is Marta and enjoy her talk. Thank you Marta and you pronounced my name quite correctly in fact. My name is Marta Rybczynska and I'd like to do a test implementation of a Siri in five minutes today. So let's go. The example open source ecosystem that is quite standard one with a physical product to make things easier. Starting from the end we have the final product that is sold to customers and we have the device manufacturer of that product and that device manufacturer is assembling multiple open source and preparatory elements adding their own software to the whole thing to build their product. This device manufacturer can of course have multiple product and they are not integrating one open source project they are integrating upstream project A and of course a hundred other open source projects that upstream project A develops a project under an open source license and they have dependencies. They have a dependency B that is another open source project working in a similar way. So okay here enter open source towards you have already probably seen the definition I highlighted the important parts for me. Legal person that has a purpose of objective to provide support of open source. Okay so what comes out of it was towards pop in in the whole thing. They pop up for the dependency B. They pop for the upstream project A. That's that's pretty expected and then a few remarks in there. Very likely stewards will be foundations especially if they have trademark to the project name. That is quite quite obvious situation but we also have situations that are little less obvious. When we can think between stewards or manufacturers or none of those for example if there are four profits that are supporting projects that are not critical to their income like open sourcing CI scripts, open sourcing, programming, tooling for their board. Things that are absolutely not critical that they are absolutely not monetizing. And we also have consulting companies not giving names. They are many consulting companies that are contributing to open source projects in a sustainable way for years. So how do they qualify? And when we add this can we have multiple stewards for a single project? If we just take that definition of a steward why not? There may be a foundation and there may be a company that actually donated the code to the foundation that's still contributing. If it's if they are not monetizing why not? And then interesting case stewards. There's a definition by stewards also have some obligations and what happens if the stewards cannot force the project or they want to force the project but the main developer say I'm not going to implement that. Pay someone to do that work. What do you do? Question mark. Okay and then we finish adding the CRA elements to our scenario. We add due diligence or that the device manufacturer should do about the open source projects they are implementing. We have the conformity assessment that they should do while releasing their product and we have the final user documentation that they are expected to release. And well mostly for the con we have some challenges for the conformity assessment. Changes and opportunities for the open source world. A final product includes dozens of hundreds of open source projects usually. So manufacturers quite often use the same project in many different places and many manufacturers use the same open source project in different places. So what makes sense and what is logical to do the conformance work, to do the paperwork all together in an open source way and release it open source license? Oh there's an alternative. The big ones will be able to pay the whole work on their own. The small ones I'm not absolutely sure if they include a hundred projects. So that will be it for me. Thank you.
CRA & PLD: rapporteur playback
So I'm trying to summarize what we have learned in this session. We had a great opening from Toby who kind of explained to us the importance of these standards that will be written to accompany the legislation. We had a lot of discussion about the 200 pages of text that is the law. And we will then have fun reading the 4,000 pages of text that are the implementation standards and a lot of the details will be in those standards and are still to be developed. He also pointed out that the CAA has landed now and is not catastrophic, which I think is an important point to take. It is also an opportunity for the FOS ecosystem to step up here and to play a leading role as stewards. We have a lot more clarity with the separation of roles. We also have the first time that a major law in a major economic block talks about free and open source software and describes a specific role for open source software stewards. So I think that's a win. Next we had Rob walking us through the wonders of the hardware and software supply chain and how liability, especially strict liability, can work out in that and highlighting that the approach in the UAPLD is one of strict liability. We had a panel where besides the really interesting questions which I will go to in a second, we also had a very symbolic picture here in the front. We had the Python Software Foundation sitting in the middle squeezed between the Cyber-Syber-Zerians Act and the Product Liability Directive. And I really thought when I saw this, this is kind of a, the picture of what we're seeing because we have a group of people that is making free software available to the world trying to do the best and essentially really, yeah, working in the public interest here and trying to see how to, how can we make this work in the environment of our own regulatory frameworks. Also are the highlights here. I found it indicative in the question to Ben who is going to be a steward and what happens if the steward is outside of the EU that, first he said, looking through the law, which is the right recommendation, but he also said this is to be more clarified further down the road. Yes it is, but that is just indicative of this uncertainty that we are currently in. And so it's the right answer, but it means that we need to stay on this topic and we need to get answers to these questions. I also thought that the question from the audience about what if we have a very decentralized open source community, maybe with a legal entity in France, but that's not really controlling anything that the developers do, it is just coordinating the work. It was very pertinent because this is exactly the gray zone between being an individual developer, being a loosely organized community and then being a well organized centralized community, a more centralized community that clearly qualifies as a steward. And I think there, it's not just on the lawmakers to make this clear, this is an impact where the free and open source software community will have to sharpen our own governance norms to make it more clear which are we in this situation. So there will be implications on how the communities operate that were clear in, this was one of my takeaways that was clear in this discussion. We focused a lot on the Cyber Zillions Act because it was such a pertinent topic recently, so I was really glad to see the Part Liability Directive here. Key takeaways that I took from the discussions are you cannot escape the Part Liability Directive, I hope I'm quoting that right, and I think Oma also was able to say why. If a law protects the most vulnerable person in the chain, then it can easily exclude others. Who is the most vulnerable person will probably have to always be accessed in a concrete case. He also pointed out that an important aspect of EU law making is that all these laws have review cycles and they're not one-off written and then collect dust and we will have to live with the consequences. They will be reviewed and he encouraged us to engage in the review process and to provide our feedback. This is probably one of the most important takeaways today for the people in the room is let's stay engaged here basically. How am I doing on time? Two minutes, okay. We talked a little bit at the end of the discussion period on how will the European Commission engage with the standards. Oma pointed out that the European Commission does not write the standards and what the process will be and that your source community will be encouraged or is encouraged to actively participate in this. That's a big takeaway that we need to take. There was always the question of who pays for the additional bureaucracy, who pays for the fees to participate in some development organizations. I think we didn't really get good answers here today but it was made clear that this is an open issue because our organizations are almost exclusively non-profit organizations and that's additional cost. Additional cost requires more fundraising. I think the question of who pays for this, who pays for this and who pays for this got repeated a couple of times including in the workshops. Let's go to, if we have a little time, I summarized a couple of highlights here. One is a big shout out to our own lawmakers here. The EU is approachable. It was here in this room. It was in multiple panels. It was willing to answer questions from angry developers. I really congratulate them to this attitude so thank you for being here. Another takeaway is why is this room so important because I think it was Toby who pointed out that we have a very diverse set of stakeholders and normally in the processes the stakeholders that are well heard are the ones that have the means to do so, the big foundations, the larger projects and we need to have the hobbyist community, just small like one person enterprises and all those also like, yeah, involved and that's I think more in this room than in the discussions until now. I want to point out one thing and this is in response to Omar who said software is only one thing out of a million products. That's true. Software is also an every product and every product consists of 80% for you know, so software. That means we're not one in a million. We're 40% of the overall market if you divide it 50, 50 between hardware and software. So I think we're totally worth it to be especially considered in the law and I think we do have the impact that justifies that. Regarding engagement in standards, last statement here, the EC made a very direct offer to engage. I think we should take that especially the foundation I speak for the Linux Foundation here. We will engage. There's one really positive signal here. We've recently been appointed to the multi stakeholder platform for ICT standardization which is kind of the consulting group to the commission here and we will use our influence there to bring more free and open software players into standards development. Regarding that as well, keep in mind that standards development is also a national activity in the member states. There will be representatives of your member states appointed into those standards bodies and it's a great way to engage through where you live and get somebody into that. With that I would like to close, a big thank you to the panelists, to the moderators and to all the participants. Thank you very much.
FOSS policy engagement: a CRA retrospective.
So Martin and I, I think we met seven months ago, six months ago, eight months ago, something like that. When you started to get interested more and more into policy, and I was in Eclipse trying to make something out of the CRA so that we can solve all the issues that we've been discussing during the first session. And we quickly realized that we had backgrounds that are very complementary. In a sense that he has obviously this all open source background, and I have this policy background into advocacy and how to advocate in terms of policy. So out of our four matters combined, we sort of decided to combine our efforts. So here today I discuss what it's like to do advocacy, and then Martin will explain what it's like from an open source perspective to be new in Brussels and try to do something about it and how we try to organize the whole thing. So I think what we need to do here is share the information about how policy making is done in Brussels. In Brussels you have several institutions gathering together in order to create those policies. The main three ones not entering into details of all the policy making procedures are the commission, the one that is drafting the proposal, for example the one that drafted the PLD, the one that drafted the CRA, the two policy officers that were here this morning are the one who did it together with their teams. Then you have what we call the co-legislators. One is the parliament, the one that we directly elect, and the other one is the council, the council of the EU which is representative of all the government of the EU. So we're talking France, the Netherlands, whatever country you're from in the EU, that's your government that is there. Then the question is how do you actually influence those policies? There are I'd say several things to keep in mind as a community to do that. If you want to influence that policy, first is to get interested so that you gather the knowledge on the policy, to gather the reason why this policy is happening, to gather the actual details of the text or the issue that the policy makers are trying to address. The second one is get organized, trying to actually have an impact so that you have credibility and the policy makers don't see just one citizen coming to them but a group of citizens that is organized enough to represent a part of the society. The other one is just write down stuff so that you have clarity. Then you have to identify the different elements within the policy making process that can allow you to get involved. Here I'm talking about contacting policy makers, the REC-1, being able to identify them properly. I'm also talking about getting support in your network, in the companies that you know because the open source community in the case of the CRA for instance, we won part of the overall challenge for policy makers but they also have to discuss with industry, car companies, large tech companies that are closed source as well and all of this needs to be addressed. I'd go back to Marty now giving details and then we'd be exchanging throughout the presentation as well to see what didn't work, what worked, what could have worked if we would have acted differently and then if you have questions we try to do that as well. I'll start with a quick promise. I was here last block and this will be the last time you see me here and then I'll just sit down and other people can talk because we take this rule seriously. What I will be sharing with you is my personal story of how I got here because that was not really my plan. So my role is to work between policy and technology from an Elnet Labs which is a small R&D organization in doing DNS enrouting from Amsterdam. In September and because this is not about me but about the lessons I learned, I'll give you the lessons first then you can plug your ears for the remainder of my talk and then that works for the deal. My lessons first, I think we were too late and if you're too late, if the commission makes a proposal then you're chasing the train, right? So we did a lot of train chasing and that means you need a lot more effort than you would have needed if we were in front of the train at the station discussing where the train would be headed. I also think and I learned that we cannot expect FOSS to organize as an industry organizers because if it turns out that it is nicely organized like a trade organization and you're probably not talking to the whole community but just to the industry part of it. And I think because of these facts that the digital dossiers in the European Union needs to change a little or we need to figure out the mechanisms to do advocacy from the community because for all the digital dossiers software is relevant and for software open source is relevant. And I'll try to illustrate why I think these are the lessons because I think I and we all got quite lucky on this one. So two last lessons, I think we should be talking more to parliament as a community. They should be the people that are most accessible to us. I think I personally didn't talk enough to parliament and it turns out that even if you don't have any EU policy experience you can figure this out if you have enough time and if you're lucky then you can make this work. So now to the story and you can plug your ears if you want to because you just had the lessons. So in September 2020 I found out that there was this thing called the CRA and I read it and I thought hey this is weird, open source is mentioned which is great but it's also clustered under the whole idea of non-commerciality and we all know that there's also a lot of open source and commercial products right. So I sent some emails and I got lucky real quick because I reached out to the Dutch digital civil rights organization Bits of Freedom, they connected me to a law professor and they connected me to a wonderful recent graduate who had a lot of context on this law because she had recently interned with the team that actually wrote it. And I want to thank Francine because she delivered the mental model to me and to a lot of people I work with later about how this actually fits in with the NLF and that's thanks to her. So in October I contributed to the first blog by ISOC, they were the first I think to mention hey there's this thing coming and I wrote a little newsletter saying oh I'm spending some of this time reading this stuff and I send a little tweet and then I got lucky again because the tweet was noticed by the commission, by Benjamin specifically and he said hey maybe you should come and talk. So that was kind of a surprise, I mean you write. So I was planning a visit to Brussels, I bought a t-shirt, super relevant detail, yeah. So I was there because my organization works in DNS right so I was attending a DNS conference in Brussels and I decided to drop by the commission because I was invited. So I learned a couple of things, they have really nice rooms with really nice flags and you get, they receive you friendly people and then you get kicked out because that room was for the boss. And what also really stuck with me and Benjamin repeated this morning, one of his colleagues said oh it's so refreshing not to talk to a lobbyist for once and I was like yeah I've no idea what I'm doing here but so but it was really constructive and I'm not saying this to please people, I'm saying this, this was November 2022, we actually had conversations about whether compliance work would increase security of open source and I was arguing it probably wouldn't and we got the question like is there anything we can do that will increase security and so we got into the conversation about if a vendor is obliged to report a vulnerability maybe they should also be obliged to send the patch if they have one and that was just a conversation and I don't know how it got into the law, maybe Benjamin will tell us someday but I think at that point already they were thinking oh maybe you can tweak this a little which I think was great. I also learned and this is about talking to parliament, Benjamin and his colleagues were very insistent that I should talk to the co-legislators and I was like what I'm here now so co-legislators others also other and it turned out they were actually done already because they made the proposal they could explain the proposal but they were not the party making changes at that point in time. So in December I visited Brussels again and I came with a list of examples like hey this is the ways that people write software and maybe interpret it as a commercial activity and people told me yeah it's great but you need to talk to the co-legislators. I was like oh great maybe I'm doing this wrong but I'm not talking to you so maybe you should come to Fosdam. So last year we had a Janssen, the session with Omar with Benjamin and I think that was the first time we did EU policy at Fosdam and some questions were raised because Benjamin told us, it's on camera, you should be talking to the co-legislators and Alex who is chairing this room today was in the audience and asked so what is your plan because you're talking to the wrong people and he was right right but I didn't know so we're just trying to get. So what it did get us though was it started building alliances right because a lot of people started interacting with me with others, Open Forum Europe became very active and I think that was what Fosdam did for us. Now remember that blog it did three things so I got in touch with people who did know how Brussels worked and that was useful because I didn't. I got an email from an aide to the Dutch senate and they were interested to understand what was it what this about so they started writing questions to the commission, they started writing questions to the Dutch government and these questions are obligatory to answer right so it created some pressure and we got a visit at an outlet lab from the Dutch delegation writing the CRA from the national like from the council perspective and it turned out that the Dutch were both very pro CRA but they also grasped FOS so that I think was a win because at that point in time we kind of or at least I could start to talk to the right people instead of the friendly people because that was kind of how this works right. So the questions from the senate helped for the civil servants to actually show up because they need to focus their time on where the relevant problems are right and I got some help from people that are more experienced working the Dutch government so I want to thank Bert. Then came some silence because FOS was over and I talked to the Dutch and then everything moved back into its own room right or just silence and I got a rejection from parliament because I applied to go to the hearing and they're saying yeah we don't know you we were mostly talking I mean there's limited seating and we're talking to the people we've heard from before so that was a bit of disappointed I mean it was a very kind rejection letter but it didn't help me as like a random Joe trying to but also open forum Europe built up steam so in and so open forum Europe was the place where a lot of us met on a weekly basis Kiran made an agenda every week I was now in the corner and he basically made sure that people were actually discussing the same topics no matter whether and we or maybe I got lucky again because about every couple of weeks I got from random people leaks from the council discussion from what the parliament was discussing and I learned how to analyze these and just share like oh this is what is being discussed and I also had to I also learned how not to write a position paper because we wrote some they were very lengthy and I think they were completely ignored so that was a good thing to learn and we yeah and we learned that the policymaker perspective perception was that we were just trying to get out so I and I think a number of people shifted focus to challenge some specific assumptions instead of arguing about the scope so over the summer I started emailing some me piece the commission the council there was a lot of silence and a bit of despair but then in the October November time frame suddenly communication started flowing we got emails not just me anyone asking for reviews we got asked questions for input there was a proposal floated about an open source steward and no one saw coming and I think at that point we started to be in the position where we should have been before the train left there were people engaged on a topic talking to the right people about the policies they were making and I think the policy makers delivered in parliament in the commission and in council by actually having conversation about these topics with people working on the specific aspect so this is how I got to these lessons learned we were too late we cannot expect the community to organize like industry and I think digital dossiers need a way to get us involved at a stage where where in a way that actually works for the community oh yeah and you should talk to the co-legislators that's it for me I'm having over the end so because I let it did a lot of things weirdly and I'm happy to hear about it so I think it raises the questions then from the very beginning I sort of said this is how it normally works you get organized you get interested and then you start writing stuff and then you speak to the co-legislators but the question is and Martin raised it very well he got lucky I wish I could be that lucky in my life but how are we gonna get organized who is gonna be the person that is enough interested in the open source community so that we get the information at the right time who is gonna write the stuff how are we gonna agree on the stuff that we want the open source community to say all those questions we need to figure them out figure them out on our side because Martin just said the co-legislators and basically all the institutions need to figure out a better way to interact with us but how do we also get better at interacting with them and that's the question that we're probably going to have to discuss today during the workshop the and then the fishbowl and that's I think the question of today in this specific session are the two what do we want the co-legislators to do so that they can come to us more often and in a better manner and how can we step towards them as well so that we also get better at interacting with co-legislators and the commission thank you
Public services interoperability: [begin workshop] Free/open source and Interoperable European public services
Okay, good afternoon everybody. We're going to start our next session. Everybody has found a seat so we know how many people can come in. That's up to the people at the door. My name is Gai Selenius. I work with the European Commission Open Source Program Office. This is for the recording. Okay, I will amplify myself. I will quickly run through the ground rules which you have seen already. We would like you to make sure that everybody gets a chance to speak. Me and Lina will be handing around the microphone. This is for the people online. We want to focus on finding solutions for the problems that are coming to you soon. This is the third workshop. The title is Public Services Interoperability. It is made out of two parts. We will first discuss the Interoperability Act and we will then have a presentation on the Commission's Open Source Strategy. And the reporter is Axel who I think is still outside so somebody should get him in. And with that, we are almost ready for our first session. Welcome. Okay, so hello everyone. Just a very quick introduction. My name is Lina Cevajos. I work with policy at the Free Software Foundation Europe. I just want to make a little bit of introduction about this session. If you have been here before, you have seen that we are trying to try different formats doing workshops, fish bowls. And this format, we kind of imagine it more as a discussion, kind of like what we have had so you don't need to move, you just need to raise your hand. I will bring the microphone to you. And I also wanted this to be not like a technically like a Q&A but more like let's chat and let's try to find common ground.
Public services interoperability: rapporteur playback
Thank you. Okay, perfect. I'm sorry we need the microphone. Okay, so now we're going to have a wrap up from the repertoire on the inside inputs that we got. Thank you very much, everybody. But let's take around five minutes. Only five minutes, and then you can go. That's all right. Okay. Oh, yeah. I'm sorry. I'm sorry. I'm sorry. Okay. All yours. I'll do a gesture thingy. Hello. Okay. So for the few of you that might still have forgotten about what we heard about in the last two hours, I'll just do a quick wrap up. But first of all, thank you all for putting forward so many questions. We're very happy to see that many people be interested by public sector and public sector. And having so much interest in what our speakers presented, I think maybe it's all right. I think maybe some of the most interesting comments were first on the presentation of Issa before that related to clearly the question of the implementation of this act. And then we'll have to go back to the first question. Clearly the question of the implementation of this act and how to be part of this question of the interprobability assessment. Now it was also made quite clear that this legislation is not an OS law and that there were many pushbacks again having this made as an OS law instead of having it just as an interprobability Europe act. A lot of you also raised question of the open documents, of the open format and how this will be integrated into this regulation. I think that's what's very nice to hear. And on the question of standardization there as well, it's been quite clear that the board would be one of the main actors in the implementation of those. As for the second presentation and the question of the OS, the question of the OS strategy, you raised question about the inner source, how to move away from inner source. I think it was also quite clear that it was the first step and that it was really helpful for you. And on the question of incentivizing or having mandatory use of open source and what was the strategy used which was clearly trying to understand how open source works, why people adopted and trying to avoid having counter effect by making it too mandatory or strong. Yeah, and maybe just to finish, I think one last one. Oh, yeah, thank you. One last one was really interesting also on the question of the EIDES. Thank you for bringing that up and other regulation like this. So we talked about very specific policy papers or regulations today, but there's a lot of regulation at the U-Level that are constant by open source. And I think it was really interesting to learn about how actually it can be shaped up by this reference implementation and so on. So, yeah, thank you all. I will finish to write a better report about that later on. And good luck to those who stay. We will start, I think, in a few minutes. So, yeah, thank you. Thank you.
Digital Services Interoperability: Intertwining EU telecom law, the DMA, internet devices and Free Software
Thank you. Thank you. My name is Nico Riecke and I'll give it to Lucas to start it off. Hello everybody and welcome. I'm very glad to be here. Thank you for the organizers to invite us to this talk. I'm very happy to see that there are a lot of interest to DMA because we have been working already on this for some time and I would like to contestalize all the problems that we already started hearing here about interoperability, about security, about having access to infrastructure from the telecom perspective because together with Nico we would like to prompt you with our experience on advocacy on routers. And I think that contestalizing this example on routers and router freedom in Europe can help us to understand a little bit better what expects us when we start dealing with smartphones from the DMA perspective. So let's talk about end user control of devices, this contestalization about the DMA and telecom and then I will give the word to Nico so he can tell us our experience with router freedom. So the first question I would like to ask us, do we have control of our devices? Devices are becoming ubiquitous, we are using that for everything in life but I have the feeling that we are losing control over our devices. We cannot change the battery, we cannot uninstall the programs, we cannot even install programs. Today people call it side loading but in laptop we cannot, we don't call that side loading because we are just downloading and installing. But well big tech now says to us that if we wanted to install something outside Dev Store we needed to side load. This is not good for software freedom but let's talk about that. So I think that we are losing a little bit control of our devices and here are some key aspects of gatekeeper control of our devices. They are imposing online accounts us so if we wanted to use our devices they say first you need to create an account with me. And then I think when I bought my Android phone I was prompt, the first thing that I showed in the screen was you need to create an online account. Then when we use our smartphones we are already trapped into vendor lock-in because we have no access to third parties repositories and app stores. And this is really key because on these repositories is where we can find apps and we can exercise our software freedom in order to populate our devices with our software. And last but not least we are not free even to uninstall software that comes to our devices. And we check that sometimes on drive devices or iOS devices there are a list of apps there that is draining our battery and they are doing stuff because it's proprietary but we don't have access to the source code so we have absolutely no clue what is happening there. So based on these facts I think that we are losing control of our devices. And therefore some questions that I would like to prompt the audience and perhaps in the coming moments of the workshop we need to answer is that how we can re-empower users to have control of our devices. So we already heard that DMA is a very important piece of that and I believe so I think DMA is crucial but we need to go further. First, we need to recognize that devices or ecosystems are mostly proprietary. So the largest two operating systems smartphones they are proprietary, Android and iOS they are proprietary. And therefore being that and since they are so large we are calling the gatekeepers due to the monopolistic power over termination bottlenecks. So basically everything that we needed to do with this device we need to go through this company. That's why we are calling a monopoly over the term features of these devices such as one example operating systems, browsers and app stores. And of course we hear them since they have this power of devices they can hinder interoperability and they exercise type controls of APIs, apply proprietary standards that we heard today, hampering functionalism, block access to drives and hardware. So in the FSFE, the Free Software Foundation Europe we have been working on a concept that we call the device neutrality. And this we want to re-empower users and to give control over devices back to them by software freedom, eliminating vendor lock-in and giving end users control over data. And last but not least I would just quickly because I'm a lawyer and I would like to point to what is happening nowadays in the EU, right? So for 10 years we have been working on the open internet regulation also called the net neutrality regulation. And this regulation had very clear rules over internet access devices. So it applies to routers, modens and other internet access devices. Then in 2018 there was a reform of the telecom law in Europe called the European Electronic Communications Code. And it then implemented some rules over iOS, over operating systems and apps and also to network operators. But now comes the DMA with rules over devices and operating systems and apps. And in order to contest to allies all these challenges I would like then to give the word now to Nico, I'm sorry, so we can learn a little bit how DMA can connect to that from the perspective of routers. Okay, as a case study, yeah, yeah. Routers of freedom, well, my wife and I were really excited. We got our first house, we were moving in together so we had to prepare for the move and one of the things we had to do was get an internet contract. And we didn't think much of it, we were doing some comparisons and said okay we got this contract. It was some all in one provider and you could recognize it from the box, it was an all in one box, it was a TV, telephony and internet. But also outside of being a box that did everything, it was a modem, a router and it did so badly. It filled quite a bit of times at a certain point, it filled entirely and we had to wait three days without any of the services to get a replacement. But after another failure getting dropped out of an internet call I said okay this is it, I'm going to get a router myself so I know I can trust it and it's reliable. But I found out that this internet provider didn't really support that. It was all to me because previously it was on a telephone network connection and they were even advertising that you could use your own router and modem and some of my friends were doing so. Also if you would call them for support they say are you using your own modem? They would just assume there was something you could do but not with this provider. So that was all then I learned about router freedom and of course there's a lot of benefits here on the slide. Some personal but also some in the grander scheme of things like competition and creating a healthy ecosystem of devices. Now internet providers put up quite some barriers to prevent you from exercising router freedom, the ability to choose your own modem and router. Some of them actually have some technical merit. For example the telephone network is laid out differently than a coaxial network. If your modems are doing bad things on a television network, the coax network, as the lines are shared you might interfere with the devices of your neighbors. But that of course is why we have standards. If your device is compliant and the standard you can get one from the store, plug it in and it will work and there's no reason to deny you those freedoms. Actually one of my FSV friends said oh it works just fine. I have these and these devices running at some friends. I could really recommend to use your own router and get the freedom you want. So it wasn't really a technical barrier. Now at the FSV we've been at this for 10 years as Lukas said and we keep an interactive map keeping all the states and working with regulators to ensure that router freedom is actually achieved. One thing in 2015 is that we had the EU net neutrality act and it was like yeah, Inno says users should be able to replace or change their device. So you think okay router freedom everything is good but not unless the regulators regulate. That's the main thing and that's why we have this map to keep up with the regulator and the state. Now about two years ago in Belgium there was a consultation about modemvrijheid, a freie modemköser, basically the ability to choose your own modem. There was a consultation and I saw this like okay we have to get in on this even though I'm from the Netherlands, I care about this, I can do in Dutch. We got a different volunteer, some from Belgium, and together with Lukas we responded. We were quite alone initially not having other parties that had the technical knowledge to go through this legislation and have the community behind them. But eventually through a survey we were able to actually engage the crowd and come over the argument and we achieved router freedom. So shortly here are some examples of people in our community using their routers at home to also establish the practice of router freedom. There's also benefit of free software on routers but that's something else. And myself I now happily using fiber with my own router and Lukas if you want to wrap it up. Yeah, so yes, so it's a big win when we have router freedom and we fought against the operators that and they always came to say that interoperability is a problem, security is a problem. But with router freedom we could prove that this cannot be a problem and I hope that with our discussion with DMA we can bring our experience and say that we can also overcome this problem. Thank you very much.
EU Policy Devroom Wrap-Up
So we're going to close the session for today and for the whole day of session. So I'll ask the DEVRA managers to help us to organize if they could come as well with us and then we'll let Simon close everything. So thank you everyone for staying right to the bitter end here. So the DEVRA managers that have been organizing the day for you today are Enzo over there. Enzo was allowed to do this by clips. There's Deb here. Deb broadly did this because he wanted to. There's Heath here from the European Union. He's had to take it took him about a week to get permission from his boss to do this. So I think I thank him very much for doing that. Yeah, yeah. So let me see. We've got Alex here who is from FSFE. We've not in the room at the moment. We have Axel from Open Forum Europe who is out wandering the estate. And Martin without whom we probably wouldn't have done it at all right from the beginning. So thank you very much, Martin. So what are we going to do with all this? Well, the reason we've had a rapporteur in each of the four workshops is because we were told that if we want to get any traction at the European Commission, we need to give them a report. And so we've taken notes of all four of the workshops and we're going to construct a report that gives the essential feedback from each of the elements. We're going to make it look nice. And we're going to work out how to subdivide it so that it can be used in each of the directorates where it will be a useful tool. And hopefully that will be a way that there will be lasting change and not just a great weekend at FOSDEM. I am also very grateful to Alistair Kergan from the FOSDEM organizing team who has been our guardian angel for making this happen and making the keynote session that we had yesterday happen. Without him, we probably wouldn't have got it. And also last year. And so we're very grateful to him as well. And I'm very grateful to so many people, the people who are here now and the people who have been here all day, who have been so positive and encouraging and engaged so well with the European Commission staff. And I want to especially thank the European Commission staff who have given up a weekend day. In some cases, in the case of Omar and Benjamin, two weekend days to come and meet 8,000 friends who they didn't realize they had before. I'd like to encourage those of you in the commission to treat us as your friends. We're not lobbyists. We're subject experts. So please refer to us whenever you're preparing legislation in the way that you would refer to a subject expert. Many of us are freely available to you whenever you write. We're on signal. So. And on matrix. Maybe parliament. So. So behind the scenes, we've also had some support. You haven't seen anyone from council here today. We did try to reach out. We didn't actually find anyone who was free this weekend. And we're grateful to the people from the parliament who supported us. And it's really very good to have had all three parts of the Trident present here. I think it amazes me. I've been coming to FOSDEM since 2006. And it amazes me that it's taken until 2023 last year for this to show up at FOSDEM. But we are going to try and make sure that it remains an important instrument in creating end user agency and software freedom for people throughout Europe going forward and in perpetuity. And with that, thank you very much. And there is a closing session in Janssen.
Bad UX is Bad Security: Adventures in Qubes OS UX Design
Thank you. All right. So I would like to start with this very controversial sometimes notion, which is I want to convince you all a bit that that sentence that is up there that bad UX is bad security is actually true because I get often people who tell me that complete bollocks I will later talk a bit also about cubes but I don't want to start with this. I would like to start with the general principles. So why UX matters for security? The thing is very often when I talk with hackers about security people come to me like but we don't actually need usability. People can figure it out if you care about security you will figure it out and that's not a good approach. One thing is of course it's not like security and privacy are things that you should be you should have to deserve to work for them not everybody deserves only the smart people but the other thing is it doesn't matter if it's the fault of the user if it's the fault of the software if we get compromised if we get harmed the harm is done and I would personally like there to be less harm less damage to the users and that's that means making things more usable for people taking into account how humans work how human brains work this is of course sometimes a controversial concept but we are all human here and we make mistakes like people like humans who make mistakes user errors are a real vector of attack and a very important vector of attack when we read about compromises and for example big corporations very often the initial vector of attack was oh somebody clicked on the link or somebody answered the phone somebody talked to somebody and said what they shouldn't somebody made a stupid password so we cannot just say well I did the tech side all the problems there are user errors not my department this is not a good attitude it's like if the UX for the door or for the door control process is terrible and you end up doing oh nobody can remember the code just put the sticker next to the door then the person who designed the security system failed yes people shouldn't put sticker with the password next to the door but also the person who designed the process did a bad thing this is not good and also we are not mothers and fathers of our users we should not be like oh you have to deserve this you have to work harder why are you not paying attention chair a bad user we need to treat our users seriously like adults who also sometimes have different priorities than our programs not like children because the thing is humans make mistakes this is a thing this is a this is a truth universally acknowledged we all do we will make mistakes we may have other priorities than making using the software perfectly like very few people just want to use the software as good as possible they want to use the software to do something and also the problem is our brains were not exactly optimized for using computers also controversial our brains brains have a lot of heretics a lot of shortcuts that they take all the fascinating optical illusions just tell us this our brain is not perfect at perceiving the universe and reacting to what's happening we have a lot of iffy things in our brains and this is something that we as people who make software need to take into account people also do shortcuts they do it like they want to do things fast and if you keep noticing that your users keep doing a shortcut that is for example a secure less reasonable terrible there is a need to do we cannot be just like well stop doing that this is bad bad user no no if for example people people keep people keep walking on the grass then they probably get need to get somewhere and maybe that's not how this square should have looked like you have to take this into account that people will want to get close to their goal not necessarily in the way that we would like them to do it and again even the smartest person in the room can be in a hurry you can have a bunch of brilliant engineers brilliant physicists and they may make stupid decisions and they may sit and be like yeah and it can't be that bad right this one time what could possibly go wrong it's not that terrible or something exploded oops we have to take this into account we cannot be just making the software that we make with the assumption that people want to make mistakes you want to get perfect users this is just impossible that's that's not how humans work one of the big things that I find very important for designing things security related security related processes is in attention that is we generally just notice the thing we care about we don't notice everything that happens in the background this is not a bad thing this is very useful for our brains that's called cocktail party phenomenon when a human being can actually for most humans understand conversation in a very busy room and a cocktail party because our brain is very good at being like this thing I care about all the rest not important not my thing but this is very annoying when you are trying to design a good process for security because this means that a small red blinking light may be ignored the error message may not be read because the person just cares about one thing and I really like to refer to you to a psychological experiment that demonstrates this is how humans work it's called the invisible gorilla and the experiment was people were asked to watch a short film where a bunch of people was playing ball passing ball and told count how many times the balls is passed at the end of the short film people asked okay how many times the ball was passed cool did you notice the man in gorilla suit walking around and 50% of the participants did not notice the man in the gorilla suit because they didn't care about it they were told to count the passing of the ball so gorilla was there a gorilla and that's how humans work we cannot design our secure processes thinking yes people will pay perfect attention to everything all of the time that's just not how our brains work and I like to show it on the example of the error message this is liberal office error message this is what a designer program a sees there's an error message as an explanation what happens all very useful things and this is what a lot of users see because what they want is to get to the file and there are some words and they're annoying because they are stopping them from getting to the file please give me my file so it's just a bunch of annoying red stuff and a big button that says oh go do my thing and then the person opens the file and be like I cannot edit it something's wrong what happened is there an error message and this is I know this is annoying when we are designing things and making things is just like just read the error message why are you not reading the error message people want we have to think about communicating things not just in the error messages because a lot of people would ignore them because they don't care about them in the moment the error appears okay so this is my introduction this is my introduction on human brains complicated what is the thing I'm working on this is keeps us a reasonably secure operating system we don't we don't say it's perfectly secure because nothing is perfectly secure don't use computers if you want perfect security and cubes is a fairly complicated thing it's sort of a meta operating system which means that it has a bunch of virtual machines talking to each other everything's isolated this is my virtual machine that has my devices this is my virtual machine with my work everything is compartmentalized and the thing is we are trying to make it actually usable for people because you could have done the thing of partitioning things into virtual machines manually but it would be such a pain to actually make it work cubes provides the layer that allows you to actually use it to get all the security of really strongly isolating the things you're doing but also being able to use it without writing pages and pages and pages and pages of shell scripts this is a slightly cut but mostly visible diagram of how cubes works so you can see a lot of different virtual machines called cubes because we are funny like that and there is for January for the user there is a bunch of system stuff that does all the important system things and there is a bunch of user things like this is my cube for work stuff I have my browser my liberal office whatever I have my social media cube and those two cubes those two virtual machines don't know about each other they can talk they can share things if I click on a stupid link on my Facebook account it won't compromise my work which would be very nice so that whole idea of providing this separation is very very nice but it leads to a very complex usability situation because you don't have just one operating system you have a bunch of them smushed together that's not easy that's why we are providing a lot of interesting tools to make the process of using those things together a bit easier but also to still maintain some security and I want to discuss two things that we are doing that I think show interestingly how this can be done how you can make things usable but also think about security and the thing is the first thing is copying and pasting so in the normal system Linux Windows whatever you select text you press control C or select copy whatever text goes to clipboard control file V then the text goes to a new place this is of course terrible from the security standpoint mostly there is a bunch of attacks that are your clibbert that steal things from your clibbert put some things in your clibbert that should really not be there cubes makes it a bit more complicated sorry for the slight cutoff this is some technical problem first you copy text but this lands in the clipboard of the virtual machine you copied it from and all the virtual machines don't know about it to actually move it to another virtual machine because for example on your private Facebook you found this fascinating link that you have to share with your co-worker you have to press control shift plus C copy to the global cubes clipboard and then control shift V to copy it to another VM this is a bit more complex and yes we theoretically could have done this is more easily right we could be just like us always copy everything but that what goes all the security problems that would cause all the issues where one thing could steal the clipboard from other thing that's not what we want does the introduction of this separate step it also means that when people are trying to copy and paste things in cubes between different virtual machines they have to stop for a moment and think do I need to do that this is what I want why am I doing this this is something that forces you to stop and to pay attention for a second to this process and that leads to slightly better decisions with relationship to security of course it's not perfect some people get very much used to it they get it becomes also like automatic for them yeah this is yet another step just press the keys very quickly but and that means that of course further security is still needed that means we have to provide more layers of configuration of information of what's going on we do have a whole complex policy that allows the user to configure it and the thing is there's a lot of text here and a lot of you will be like nobody reads that yes that's why we put it in the settings so only the people want to customize what's going on actually go and read it the other people probably want because they don't care but only if you actually care enough to want to learn a bit about what's going on then you go to the settings and read it and then you can specify for example what can copy to where and how to control it so it is making the process so we are making the process of copying and pasting adding this additional step to make things a bit more secure by leveraging those two mechanisms technical one but also making people think for a second about what's going on the other thing that we are doing that I think is very interesting this is current work I could say devices things you connect to your computer they are evil like a lot of them can be very malicious you never know what actually happens within the thing that you are connecting to your computer maybe it is actually a USB stick or maybe not maybe it's some more malicious device that's just masquerading as a USB stick you know it's very complicated with them and even those devices that are not evil they very often can do far too much for example microphone camera they are very powerful things they can record a lot of things that we really would not like them to record and of course our browsers our programs are swearing to us that nothing malicious is ever happening but some people don't think this is sufficient level of security and for many people well attacks can happen and we would like to be protected about it that's because that's why QPSOS isolates all the devices in their own cube and the user can decide okay my camera I want to connect it to this cube this virtual machine from which I'm making calls but not to one from work because I want my boss to have absolutely no chance to see that I'm working in my pajamas or my microphone can only be connected to this cube not to the other and the problem with devices is that the initial user interface for handling them was made by engineers and it's not very friendly there are small things there is a list of stuff a lot of complicated technical details of what's coming from where one for example one USB stick can appear multiple times for very good and sensible technical reasons but it's very annoying when you have to figure out okay which one of them is the thing I actually want to use you have a list of cubes you want to connect to which is also very small and I ended up with this and I decided to ask my users okay does this work for you is it good and a lot of people said no this is terrible because I keep making mistakes because I I want to connect for example my USB stick to my development cube but I keep connecting it to my work cube because those things are very small and it's very easy to click on the wrong one and the thing is yes it's a user error it's not the fault of the system that the user clicks on the on the wrong thing but we would like the errors to be less common I know it's a user error but I still think we could make it easier for users to make less errors good decisions and that's why we're working on redesigning it and I think this is a decent example this is not yet working in cubes this is incoming will happen very soon once I finish working on it so extremely soon we we are changing things to one provide more information which is another thing that a lot of users told me when I started talking to them actually doing user interviews like yeah I know I should know that but I have no idea which of the devices I see listed is my camera because they all have like names that consists of numbers and letters randomly maybe we should we can actually show people which one of the things is the camera which one of the things is the microphone that's why icons actually show what's happening that's why there is much more space between different options and that's why the options are actually described not guess what it's going to happen no now I'm using actual full sentences to describe what the thing is doing and yes this is basically a visual update right this is not a technical change this is not deep delve into the back end of the USB of how cubes handles USB stack but this is a change that a lot of people when they saw it all of you said oh wow now I think I will make less mistakes now this will fix a bit of me problem but at the end we will have a more secure system everything will be better for me as a user even though it's just a visual change of course some people are like and this is terrible too big why it takes up so much space but unfortunately you can never have everybody be very happy this is basically the same okay so as a final word on these two examples and generally I would like to say a bit about how to design with security on your mind if you're a designer or if you're a programmer making a secure things that want to be secure designed for human error designed for mistakes not just for success take into account that people will do things badly people will be in the hurry if you if you ever want to design a process for a thing that's supposed to be used by human imagine that your user is currently having their six month old baby yelling and they're cut puking at the one at the other side and you want to design a thing that will not completely compromise them even if that unpleasant situation does happen the things that are secure should be easy making things in security should be harder making a shortcut the shortcut the easy way should be the secure way going around because people will sometimes go around also we are open source people we like to go around sometimes so the going around the insecure way should be harder design for actual human beings don't don't think that if it's a user error then it's not our fault because unfortunately user error is also our fault not just the user thank you five minutes for questions please yes isn't it creating more friction in the process and rather than focusing on adding more layers to like force people to go to read the like all the security issues rather than that why is there not a focus on the display of the error messages that makes them read it makes the user read them more properly okay so the question is why why are the more friction instead of just making better user errors so two reasons one reason is that sometimes it's difficult to tell a part user error for what user wanted if I copied what I wanted into a wrong cube this is a user error but it's not an obvious error that can be detected by the system and the other thing is friction is not always bad we like to think that friction in design is always a bad thing but friction also forces people to stop to think for a moment and at some sometimes when we design the system so that people have to make certain choices we give them a large variety of choices but there are some choices we have to give them a chance to actually make those choices and friction allows us for the stop to make a choice I don't want to add friction to every copy and paste within a single VM there is no friction it's just when you're going outside and the friction is by design to also show that this should not be a common operation to decrease on the making shortcuts thing yes Do you have some methods to encourage users for secure behavior for example let's say what prevents me to log to social media on my work cube so the question is how to prevent users from making bad decisions security wise for example logging into social media and work so in short we don't have a technical solution for it we have just a solution of describing like tutorials how can you use it sharing the setups of the developers of the core users so like educating people encouraging people to use different colors for different environments you also if you want to do it like you can limit yourself by limiting network access of different cubes so be like okay this is firewall and cannot access Facebook or whatever we don't have a good solution like system-wide this is still a decision that the user can make it has to make also because the user needs to divide their work into those virtual machines themselves this is something that the user generally has to do No so the question is do I have any favorite examples of UX oh this is a very difficult question security yeah oh I don't know I'd say that I really like how those usb tokens pass for u2f authentication work so I really like this process which adds just the perfect amount of friction with the need to press a button so I think this is my favorite example we have to finish you
Penpot 2.0 is here!
Change the slides. Magic. So we have Pablo from Tenpot, talking about Tenpot 2. It's here. It's here. So excited, so excited to be the last talk of the day. Also, there's some nice free chocolate. Not called to action yet. Just free chocolate there so that on your way we exit. But we will have a bit of a birthday party here with all of you. Because we're turning four today. So yeah, we have some waffle here from Brussels instead of a paella from Spain probably, that way we're using. But basically it's very important, very exciting. Every time we come to Fosn, I think my first Fosn was 2005, 2006. But it was only four years ago, 2020, that we announced this was going to happen. So every year we come here and say this is something new, and then we have Alfa and then Beta and then 1.0 and then 2.0. So very exciting. So I'm going to take a bit more water because of the excitement. So we're going to discuss PEMPO 2.0 and then it's time to meet Hanson Demo. We'll see how it goes, the staging server, and the Wi-Fi. So for those of you who might not know about this PEMPO, PEMPO is this open source design platform for design and collaboration, code collaboration. And we like to discuss, and this is very, very relevant for the open source design track, design and code collaboration. Perfect talk by Ariel, perfect takeaway also for PEMPO 2.0. So we believe we bring design freedom for our product teams, and we do so in various ways. The fact that PEMPO is open source, definitely a key ingredient. It gives you privacy, security and customization. You can hack it, you can do it whatever you want. You can use cloud, you can self-host it. We are pro open standards, so that means everything is SVG and CSS native. We make sure that we're not creating yet another proprietary format. We want to have this sustainable design and sustainable collaboration with code. And we do believe that it's important that whatever tool we build has to bring something that was not present in existing design and prototyping tools in the past, which was this collaboration between designers and developers. It was felt similarly as some good tools or code tools that are not welcoming to designers. Similarly, design tools, design prototyping tools were not welcoming developers. So what if we fix that? That was the whole idea behind Gradient PEMPO. The next generation of design tools should be about collaborating design code. So this is like the basic intro on PEMPO. But we are here to discuss PEMPO 2.0. This is a major release. You could call it PEMPO 2.0 or PEMPO 10.0, because it's just a massive change in just one year. So we're going to cover UI redesign. We're very proud of that. The new component system, wonderful new inheritance and overriding and all that stuff. CSS read layout and some of the stuff. So let's see how PEMPO 2.0 UI redesign looks like. Like this. No, that was 0.2. But that was only four years, five years ago. It is elegant. It is simple. Why frame me? Of course, the reference. Anyone gets the reference of the picture? No? Willow? Willow fans? No? Ah, I said, ah, Willow fans. Yeah, that's my age. And so, no, this is PEMPO 2.0. Look at this. It's very fancy, right? I would like to have the light steam, you know, but perhaps at this time that would not be smart for me to ask. This is just wonderful. I mean, this is just a design that is being created with a beautiful interface. Because open source and, you know, beautiful go along. What was behind this, the whole UI redesign? Well, this is a design tool, design prototyping tool. It has to be interactive, it has to be real time collaboration, it has to be multiplayer, and it's a productivity tool after all. So we needed to reduce the cognitive load. So, it's so tempting to make many things be achieved differently. So, in terms of real estate and how you would achieve things, goals, we reduce the cognitive load through heretics and through research, and just intuition sometimes. By the way, the picture you are looking at is a portion of our design system, which is completely available. I will show it now in a minute. We also improve accessibility. We, strong believers in accessibility should be a de facto standard for everything we do. It is absolutely challenging to include all accessibility in a design and prototyping tool, since it is very visual and it's very a complex tool and has a lot of micro interactions, and we already discussed cognitive load. But we try our best for the size of the team that we are, you know, just 15 people in the broad team. And still, we do want to pursue that. So, major work here was color, of course, and typography and size and relative shapes and that. So, pretty basic, but still, I think, worthwhile. We will continue to do that. Of course, you should be able to use Pemport to design accessible UIs. But here we are discussing Pemport itself as an accessible tool. And I think it is beautiful. I really honestly think it is beautiful. Probably one of the best, most beautiful open source tools, but also one of the most beautiful tools. Okay. I think it's also, sometimes it's just about pride. And why not, right? So, here we are showing just like a crop. We are going to see just the theming, dark theme, light theme. As in case you are, you know, you are fans of one or the other. It's not important what we are showing, it's just so that you can see how different Pemport looks now that we have support for both dark and light themes. And of course, you could create your own theme, whether it's corporate theme or just some other theme, because now we have the possibility of having n themes. We just created the two most common, okay? Before I go into that, why did, what? Okay, that's probably, okay. So, is it, yeah. So, here, that's for later. So, you can actually enjoy our design system as a library, if you want. I mean, this was meant to develop our own UI, but if you like it so much that you would like it to inspire your UI, why not? So, we have many libraries and templates available, thanks to our great community that, you know, continues to provide amazing stuff for everyone to reuse. This also will be available, and I think it's pretty cool. So, it follows the typical design system pattern, so, and all that. So, we use that, okay? Yeah, okay. New component system. This was, a ton of requests, basically, had the underlying theme of new component system. For those of you who are not familiar, basically, it is now a thing in design, I mean, not now, but like now in the past few years, to make everything highly reusable. Similarly, as we developers have thought about how to code. And so, we actually, part of this design work has borrowed terminology and abstraction. Abstractions from the code and engineering world into design, because it works. Design is also a science, and so, it is easy to borrow those concepts. The, what we wanted was to make it easier for everyone to build, like, the main components, like the original elements that are like the ideas, like the ideas of the components, and then very easily track the copies of those ideas. Pembot 1.x did not have this metaphor. It was much more abstract, and you had some trouble finding where the ideal component, or the master component, the parent component was. Now, it has a kind of physical representation, sorry, not physical, but, you know, what I mean. And it is easier to track those components, the main version, and then follow their copies. And that comes with all sort of very cool ideas about inheritance, overriding, overloading, and also using a copy to reset the ideal. If you are so happy with a copy that you think every other copy should now follow this copy. The way you do that is that you basically reset the main component through that copy. So, by the way, Riz of Hans, who here is a developer? Okay. And who here is a designer, or does the design? Okay. Both. Both, both. Yeah, yeah, yeah. The question is, not exclusively. So, then I have a call for action for you developers in the room. The proprietary design tools are coming for you. They're coming for you because you represent ten times, well, here, much more, but you represent ten times the market size of the designer world. So, it's now obvious for the proprietary design tools, design proprietary tools that you are the next in line for milking, being milked. I hope you have strong opinions about that. Also, the updating workflow is now much obvious what's going on. The synchronization, and we'll show, I hope we'll be able to show that during the demo, is obvious. I mean, when you are synchronizing things that are there, right there, simultaneously, that is very obvious, but also, and I don't think I'll have the time to show this during the demo, when things are synchronized behind the scenes, okay? You get notifications, updates, you can decide to dismiss some synchronization, and perhaps later on, now is the good time. So, that has been improved. And then we also have new capabilities, very obvious ones, very tangible ones. Annotation, which is, okay, I'm going to document this component, whether it's the main component or the copy of that component, but also the swapping, the quick stopping. Because when you have everything as a component, you sometimes want to swap that component for another one that is also capable of taking the role of that component within that context, okay? So, here's a very simple example where you have the main component, that's a very simple landing page. The main component is the one top left, right? You know that because it has a specific legend on top, so it's very easy to spot. And then the rest are copies, and the synchronization is instantaneous. This is really like capturing what someone is doing on Pembroke campus. This is a very simple example, of course, but it's good for animator GIF on a presentation here, okay? So far so good, right? Yeah, yeah. This is a component swap, and I was discussing a minute ago. So, here we go from image gallery to image gallery, but with title and description. So, basically, someone decided to have different components that could fit into, in this case, this is an app. Looks like an app. But what if I try this, or perhaps in a different context, I want to show different stuff. For whatever reasons, you should be able to have your components easily be swapped. And, of course, this is easy to navigate. Here you can see, you pay close attention. You see there's the content. The content is basically an arbitrary categorization that the designer or the user used. But you can then go back one level and find everything in your component library. This was just to show just a small list, okay? Very good. And, wow, we have CSS Grid Layout, or CSS Grid CSS Layout. Not Grid Layout, because Grid Layout, of course, we had that from probably 0.2. Grid Layout is the print media standard of columns and rows. This is the CSS. Why is this so important? Because this delivers on a promise. We said if we really want to unite designers and developers around one language, what if we're able to bring the code language, the expressiveness of the cloud-tip programming, that is, in CSS, natively into building a design, without using the code, just using a user interface. Okay? See some people saying, aha. So this is a complex theme. Probably it would deserve its own, I wouldn't say, Fawcett and Track, but perhaps a talk, which is the cloud-tip programming, the cloud-tip design. So this is about if you want to read more on this, just stick with the cloud-tip design. And the cloud-tip versus imperative is about expressing rules to get to a point, but not exactly how I get to the point. And CSS is perfect about that. Because the browser understands the rules, tries to get to a goal their way. So when you're designing for the real world, one could argue that imperative design is problematic. It's not fluid. It's not reactive. It is limited. But the cloud-tip design is able to be okay with a fluid world uncertainty. And CSS embodies that very keenly, finally, after the specs of CSS Grids or Grids CSS layout came in 2019. So it's very recent. And for an open standard, that's very recent. So we started with FlexLayout, which is about alignment that was present earlier in Pempot. So Flex is about one-dimensional alignment. But Grids is about bi-dimensional. So with both, and you can combine them the way you want, you have almost total freedom. You can do all sorts of compositions. You know, Flex, Grids, and you can nest them the way you want. And Pempot was able to build that natively. For the first time ever for any design tool, we decided to trust the code standard instead of creating our own interpretation of how design should be created with new vocabulary and terminology. So this is very opinionated software building here. So here you are seeing edits in a grid, again, cropped and very simplistic for the sake of, and you can see where, if you're familiar with CSS, you're basically seeing CSS visually. And you can see how the code next to the UI is automatically being updated because it's synchronized by design in a way. We actually started with the code, created the user interface, and it is trivial for us to impact the code. This code is part of the Pempot's user interface. You can go to inspect code and you can see that. I just pasted there, you know, synchronously. So it gives Pempot users the possibility of the CloudTip design, which is amazing. And all those YouTube tutorials about you designers you know about CSS, this is the code tutorial you, it's easy for you. You just follow this code. No need for that anymore because you can just use your visual language, knowing that it is expressed as code instantly. So I would rather, I mean, I would like to ask for a round of applause for the team to get us to this point. Demo time. So this is, I hope you can see it, yes. So this is a very simple design. I'm going to just select this and make it like this. So this is a, I mean, don't pay, this is a bento design, it's trendy, that's not important. So this is a grid. I'm going to actually edit it. And I'm going to just go and add a column to the right. Okay. So notice that we are using FR units by default. You could use whatever you can do. Go out, auto, pixel, doesn't matter. It's fine if it's like this. And by the way, I forgot to duplicate this file, so I'm messing with someone else as by now. We have a limited undo. So, yeah. And then what I want is to pick this one, this element, and I'll just put it here. It automatically understands this lot. This one, I'm going to do something different because I'm going to create a component out of this, so Ctrl K. And I'm going to duplicate this component, so Ctrl D. Now I have a copy and I'm in Components. You can see that, sorry if you cannot see it very precisely, but there are different legends there on my console. So what I'm going to do is I'm going to just move it here. And you notice it doesn't react really to the fact that there is more space available. And this is a reactive design, so I want to do that. So that's easy. I just select it. This is a copy. And I can go here and just... No, no, no. This is not going to happen to this demo. Okay. Just one, just a mouse. Just a mouse. And I'm using a trackball. It should be easy. It's great that I feel the... Everybody stop breathing. Okay. This is a certain level, you know, whatever, precision. So here I will go for just use the space you have. Okay. Totally... But notice, notice, there's more, there's more. Notice that the main component did not react because I overrided, I overwrote this attribute, which is fine. But if I go to the main component and change, perhaps let's go for something silly as the fill here. Okay. And I changed that, then the main copy does react. This is the synchronization that I was talking about. So I'm going to use something like this, I don't know. That's it. There's more. Because... And this is something that happens a lot. I go to the copy here. You can, of course, navigate all this, but if I go and select button and I change the fill, yeah, like this, let's say, something like that. And now we all praise to the demo gods, okay? I can decide, okay, I like it so much that I'm going to update the main component. Okay. And I update, and that happened. And now the main component, if had all the copies across not only this file, but elsewhere, if I use this as a library, would have the notification that that main component has changed. Do you want to apply those changes to your copies? This is very nice, right? And so to finish, because I know I'm out of time, one last thing. I have here a code pen. I always like to end with something like this outside pen pod. So I can take this, I can go to inspect, I can go to code. All this is there for you to enjoy, to fund, everything. So I'm going to copy the CSS and just copy this, okay? It's going to take a while because there's a lot of images that depend on the Wi-Fi. We now have HTML on top of SVG, so you pick what you want. We don't care, you know. It's everything is, as long as it opens up, that's fine. We copy that. So what this is doing is, since we are, if we are telling the truth, you should be seeing, the moment it downloads all those base 64 encoded pictures, the design, let's see. Yeah, that's what I'm trying. I have Wi-Fi, it works. Yeah, this is, well, I'll send you a link. But basically, this is basically what you need. You need the HTML, the CSS, and it's built exactly to your, to the perfect standards. Because nothing I did was not possible using the CSS expressiveness. So there's no way you're going to mess it. It is one-to-one perfect match. So that lost in translation, that back and forth issues that typically designers and developers, you know, express that they're having, very frustrating, doesn't happen with Pembot. And of course, this is real tank collaboration. I'm just single player mode here. But so, so quickly to finish, we saw the UI design, new components, this is right now. We have some other cool stuff going on. And the question is, when Pablo, when do I get Pembot to buy it all? It's coming, it's coming. Wouldn't it be nice if we had it today? We have a staging server. If anyone interested, come to any of the Pembot team members. We can give you the security URL, which is basically quite simple. And you can try it out. But it's in the next few weeks, basically. So we're aiming for February. So it's still forced in month. And so very, very soon. So thank you a lot, the team, the community, and everyone. You can find more stuff there. Thank you everyone for staying up to now. And I hope you enjoy all the work that came from Pembot to Bueno. And now, before we leave, or where all the track ends, anyone has a lighter? I do not have a lighter. So. May I steal the light? Yes. You didn't sing happy birthday? Yeah. Okay. Hello. How's it going? Oh, yeah. All right. So it's so exciting. This is our event. It's basically how we were born. So it's very exciting to do this. So I wish everyone wishes something nice for their open source project for Pembot, for Foslan, for the community. So it is like this. Yeah! Take it. It is chocolate. It is chocolate from Pembot. Thank you very much. Thank you.
systemd-boot, systemd-stub, UKIs
I had my second talk of the day. The first talk was very somewhat topic, but it focused more on the distribution side of things, how to build all this stuff. I welcome you to look at the video if you have some time later, because what I'll not be able to answer in these 20 minutes, hopefully the other talk might. So I will talk about this in boots, this in BSTAP UKIs and what those are, and why you should all switch to that, of course. So let's jump right in. So system reboot, what is it? We usually call it bootloader, but it actually isn't. It's a stupid boot menu. Like if a bootloader, at least in my view, is something that actually is capable of loading sectors onto disk, parthing them, and then eventually setting up the boot params and jumping into this, we do nothing of that in system reboot. All we do is we give you a menu and you pick something, and then we chain load some other UFI binary. So yeah, it's a fancy boot menu, nothing else. Makes it on one hand dumb, but also nice and robust. It's built around this model that you have drop-in files inside of a directory, which I guess is very different from Grub where you have these boot scripts and things like this. Our way to configure things is supposed to be as simple as possible and modeled after how we started doing things in package management and classic Linux distributions like that, but it's not this pattern where it has this directory, like a configuration directory or a directory where you put desktop files on the desktop into and things like that, where every package can put more stuff into it, and then the combination of all of them is what makes the system work. And we just said, okay, let's do it the same way. Have one directory in the ESP, and people that want to populate the boot menu just put one file in there, and that's what populates the boot menu, and that's already it. So, yeah, it just takes basically this Linux pattern around package management and just takes it to the boot loader stuff. So, this boot is UFI only, right? Like, this makes things nice because it basically means we don't have to actually do any boot loading. It implements something we call the boot loader spec, which is spec we wrote ourselves. Actually, it just basically tries to define in abstract terms how, like, where to place kernels, where to place descriptions of what to boot. It supports two kinds of menu entries. Type one and type two, we call them, I think the focus should always be nowadays on type two because they have much nicer properties regarding measurement, cryptography, and things like that. But type one still exists, and people will continue to use it because it's more flexible in regards, it allows you to configure the individual items manually. Type one is a configuration file, basically, which just says use that kernel, use that in already, use that stuff and things like that. Type two is something where the boot menu items are just binaries, UKIs, as we call them, UKIs. We'll talk about this later in more detail, but the very short version of that is it's a kernel glued together with its init.rd and a couple of other things and then turned into one UFI binary. So, basically takes much of the early state of the OS and makes one thing out of it that can be updated as one, signed as one, measured as one, loaded as one, which makes it robust and secure and very nice. Since Friday or something, System Boot is also like eligible for signing, like Suze actually did this ahead of time, but now it's officially okay, so you can get it assigned by Shim with the same infrastructure and things like exactly like you can get grub signed. System Boot is supposed to be fully automatic, no configuration, right? Like there's no boot scripts, no nothing. I mean, there are some configuration options, but the design is to just work and not require configuration, right? It should be just one binary you drop in and then you have this other directory where you drop in the menu entries and that's supposed to be it. Of course, there is something like you can configure something in EFI variables and there is also a configuration file, but that's just for the nerds and it's not supposed to be the default. It also has a nice functionality that besides looking at these directories for boot menu items actually capable of finding Windows installations automatically and Mac OS, which is kind of nice because you don't have to configure that either, right? Like from the West you don't need to do anything. SD boot when it boots up, it just looks, oh, is there also Windows installation that adds the one awake? It's really nice because it's robust and it has also benefited that if you add Windows after you install Linux it will just show up. It also has APIs to user space, which I think is very important, right? Like for us, the bootloader world and the user space world are not distinct, right? Like they are closely intertwined for various reasons, like for example, because user space adds and manages the boot menu entries because from user space you generally might want to be able to select what's going to be booted next because there are things like automatic boot assessment, right? That you figure out did this boot actually work? If it worked, booted forever in the next time, if it didn't work and you've tried a couple of times and give up and revert to the previous thing. So this always requires communication between the bootloader and the operating system. So we defined, that's actually another spec where we defined this generically with CFI variables and things like that. We said this is how bootloader and user space can communicate and can send each other commands basically. It also does early boot random seed stuff. This is because traditionally in particularly in VM environments there was no RDRan, no Virtio RNG things and then Linux really didn't like it. You didn't have to any entropy in your VM and then certain things just hung and that's super annoying. So we took a bit of inspiration with something that FreeBSD did which is an early boot random seed. So basically you have a random seed that is stored in the ESP. You can update it from user space and it is updated from user space. After we did this, Jason Donfeld who's also the maintainer of the Linux kernel RNG, we wrote a couple of things that we kind of confident nowadays that it's really good actually and the good thing is it works everywhere, at least everywhere where you have EFI and make sure that from earliest moment on you have really good entropy in addition to whatever the hardware might support you. It has automatic enrollment of Secure Boot keys which I think is actually kind of nice. It implements like this tofu concept for Secure Boot enrollment. So if you want to change your certificates, which I think people should do and particularly in virtualized environments, then you can just add the keys to the ESP and then on first boot up when we are in setup mode we'll just enroll the whole thing and then we'll be locked down. So you have the trust on first use. Like on first time you boot up, nothing is enrolled, nothing is trusted. That's the moment where everything is trusted. Then you add the keys and from that point on this is how it's locked down. It also has this thing where again with the drop in deer you can load additional drivers mostly exist so that people who really want can make the ESP or something like that, one of the weird file systems. Yeah, already mentioned briefly automatic boot assessment exists which is like the infrastructure that we count before. We boot something, how often we have booted it and then from user space can report back if that actually worked and then I get this kind of robustness thing going. So much about system reboot. Boot control is one part of the user space part side of things. Boot control is like a command line tool for installing system reboot. That's kind of its primary job but it can do a couple of other things as well. It's a use space side. You can tell it to boot on the next boot up to specific menu entries. You can list the menu entries. You can update the random seed in a couple of other things. We hope that it actually runs automatically on boot for example to update the boot loader. It always will do this. To make sure that the copy of the boot loader that is in slash user is instantly copied also like if it's for some reason the package manager who ever updated system forgot this it's always kept up to date. So the focus is really that the boot loader is always up to date. It also resets the random seed by the way like from the Linux pool so that there's a good chance that the random seed is as good as it could possibly be. So much about boot control. Next thing system destub. System destub is also UFI binary. System destub is basically a little UFI binary that you glue in front of a Linux kernel and an inodori that runs in UFI mode. It does a couple of things before it transitions into the actual kernel. Why do we have this? It does a couple of things like for example it measures the payload of what it's going to start. So now you might wonder if it's a UFI binary that a second would sign and things like that. Why does it need to measure because firmware already measures all second boot binaries. Very good question if you ask that. The reason we do this is because these measurements that the firmware does are PCR9 I think and there's a lot of stuff in there and that basically means it's hard to predict because there's stuff that is controlled by the firmware and there's stuff that comes from the West and you cannot bind security to a PCR that has sources that you cannot really control. At least you cannot do this in a predictable way from the West point of view like figure it already out. Like you cannot predict it on basically the Fedora bridge systems if you build Fedora. But if we do the measurement separately of the payload of the UKI we can do that in a separate PCR and then we can predict it because in that PCR there's only going to be the stuff that the West vendor controls and not also the firmware stuff and then cover the firmware stuff with something else. UKI is what this becomes when you use system-based stuff right like the combination of system-based stuff plus a kernel plus an inodori plus kernel command line plus all these other kind of things that's what we call a UKI unified kernel image. Yeah it is system-based stuff supports a couple of sidecars right like this UKI model that we try to want to push distributions towards where you unify everything into one image that you can sign measure as one update as one and things like that that comes as with problems like inherent problems like for example the inodori that you built into this you like we expect that vendors will build those on their build system so they're always going to be the exact same ones on every installation which is great for many reasons but also horrible for others because depending on the machines you will need large drivers large firmwares like Nvidia driver for example comes with multiple hundred megabytes of firmware if you would always build that into all the UKIs that you as a generic distro vendor ship to your people then yeah this will be really really large second-build binary as it turns out because of all the measurements booting really large second-build binaries works but it's kind of slow so I also had this inherent problem that yeah in this model where UKIs are built on OS vendor build systems the question is open how do you parameterize that right like because on a simple laptop you do not need to parameterization you can figure out everything on its own but a learner is supposed to be generic right like you have these installations that want additional parameters like they want to configure I don't know additional ways like route passwords so that it can log into the inodori or a boot ice-gazzy device that you actually want to boot to so these this I mean there's a reason why the kernel command line exists people want to want to be able to do this in certain setups not a laptop is mentioned that should not necessarily be necessary but the more you go to the server side they all want to do this so we came up with a couple of ways how you can have sidecars so that even though while we push everything to the UKI model where you have a single thing that is self-contained and that has everything you can put next to it the sidecars that configure individual things like there's one concept we call system creds I went into this in more detail in my earlier talk but let's just summarize this at system decreds like the asterisk cred stuff that is that is basically short little bits of information like like keys like cryptographic keys and passwords and things like that that you need to operate but they're individual bits and they they are encrypted and locked to TPM so that you can actually put them in an untrusted environment like for example the ESP where there's no implicit trust and you have to authenticate it before you use it there's another concept we called add-on if I add-ons which are basically the same idea as UKI's right you take you make a PE binary that you can sign can you can measure as one however you leave out the Linux kernel the inner D in all these kind of stuff and just insert the kernel command line that you would also add to UKI so you basically have a binary that looks like a binary but doesn't contain any code however you can authenticate it via the usual Secureboot usual shim APIs like it was a binary because the UFI just cares that it's a PE thing so these add-ons as we call them are our way how we can allow people to extend the kernel command line because when a UKI is booted and system D stop takes over it will look for the side card files find this add that to the kernel command line and boot on and it's in a fully trusted way because these things need to be authenticated the same way as everything else is authenticated via the the shim Secureboot stuff I already mentioned this the system you stop also does measurements right like of the content so that we get this isolated out so that we have yeah one PCR that only contains the US stuff separate from the stuff where the firmware is this means duplicate measurement but that's fine at least I think it's fine something that also does it can read additional kernel command line options from SM bias type 11 I'm not sure SM bias type 11 well I'm in the boot loader room so I hope you know what that is like SM bias you you you probably all know is like this descriptive thing that the firmware passes to the West and there's one object type you can add that's type 11 and it's wonderful because it's just called vendor strings and you can put we can put anything in there that you want and various virtualizers like QEMO for example allow you to directly set that from the kernel command line from the QEMO command line and yeah we use that also to extend the kernel command line right like so you can just on the QEMO command line sets a string that is just implicitly added to appended to the kernel command line that is eventually booted we kind of want to push people actually to the model where they use this more often it's actually an awesome thing and I'm kind of pushing like trying to push all the cloud renters to adopt this as a generic way to provision data into VMs but anyway other topic another component is UQify it's a basically it's a Python script that helps you gluing together UKI so it will take system to stop kernel and I read sign it as one it will also do the TPM predictions of what will the PCR will look like when it booted the signs all that with second boot it gives you one EFI binary that you can just drop in the in the SP and boots up and everything's secure and wonderful then one other tool system be measured like much of this like all of the what I'm talking about here is actually part of system me because I'm system the guy system the measure is a tool you probably don't have to interface with it anymore because UK UK fight does that behind the back for you it's the actual engine that predicts the PCRs that the UKI will will result in if booted yeah I just want to mention that exists and yeah there's another tool called common stall as for the traditional distributions so that they can ship inside of a devian package or RPM a kernel and that this tool it's like plugin based and things that will copy the kernel into the SP and potentially it built the UKI at that moment right like because we want to cover a couple of different models like one model where the UKI is built on build servers of the OS vendor and another model more for the let's say democratic devian style distributions where they can do this locally so that they can use their own keys so yeah the kernel and styles are infrastructure to make this happen it's really nice this full UKI support for example for this like so if you want to do your sign your own stuff you can trivially do this because you can just use that and drop in your keys and Etsy and then it happens magically there's something I don't have that much time anymore should we switch to questions okay this is one of my last slides anyway assistant you part PCR log is one of the most recent things it's a more complete prediction engine like I already mentioned the system we measure tool which is able to predict the PCR measurements that a specific UKI will result in system dPCR log is supposed to cover all the other PCRs that they are that are firmware stuff and things like that system dPCR log deals with like all the other operating systems generally have this like Windows Chrome OS Android they nowadays have all these predictions and well depending if they actually care about TPMs or have some other second-hand clif thing doesn't really yeah it's all a little bit different rare but the ones that care about TPMs generally have this prediction engine where they just look at all the different things that happened during boot analyze the UEFI event log and try to calculate a TPM policy to lock disk secrets to our version of the tool is called system dPCR log it's supposed to be modular so you have again drop-in directories where different components of the west that will show up in the boot pass like the UKI boot loader shim and things like that plus components that are not even necessarily under the OS control but our firmware stuff can be described with little jason fragments to just say the measurements that I expected for each of these components there is a concept of alternatives because usually you don't want to lock your secrets to exactly one kernel or one boot loader version because you want to update them and then if that update fails you want to be able to go back and things like that so usually for each component you want alternatives also very well supported and system pcr log takes all this information explodes what all the the the pcr values could be in the end and then generates a TPM policy out of this that it stores in a TPM and v index and that then our disk encryption stuff can reference as an access policy long story short this covers like this basically locks down the OS against the firmware versions with all the measurements that firmware does that are not necessarily predictable for the OS because yeah the firmware people suck there's also like support of course that if we cannot predict firmware measurements we have some some logic there to deal with that so if you do all the combination of this then you get a super secure system and everything's great my recommendation is do this but these components are relatively independent of each other and as the things happen and different distributions started adopting different parts of this earlier like for example Susan nowadays adopted system to boot already but rel for example for the confidential computing stuff already adopted system to stop and they all pick different parts of this okay my time is over so this is my my summary here if you use all in combination everything works great but if you pick what you want you don't have to pick anything at all if you don't want but if you use in combination you get this full boot chain stuff everything's secure and relatively robust because all the update cycles are around individual files you have ways how to parameterize it and extend it and yeah there are a couple of more slides we don't have to cover them but let's move to questions we have five minutes for questions so yeah so the question was regarding whether the system D stop stuff works outside of UFI environment the answer is no like it uses UFI APIs and it's just UFI all of the what I was talking about here is more or less model after UFI system to boot system you stop absolutely only UFI but the further you go with the with the like carline store for example that has nothing to do with UFI like unless you actually use the parts with well I mean I know it isn't there I think there are sorry so my my suggestion would that it would be a well just just adopt UFI and avoid all this mess I don't know I think everything has problems UFI have some my general like I mean I get this all the time like this thing like oh we have to stick to grub because it supports all the non-UFI world and I say sure my recommendation would always be if you look at this stuff like there are certain like philosophical ideas built into this right like you have a drop-in directory you put on type one type this kind of stuff is entirely generic and there's like type two is not generic but type one is totally generic right like so my recommendation for it for that by the way is just use UK eyes as they are right like they they are a PE wrapper like it's a really simple format actually PE right like and it's just an envelope that carries sections for you if I think grub now this can parse that too cannot see so if grub can parse it you stupid bootloader should have no problem at all parsing it and then you suddenly have a universal format and you boot windows style PE binaries even though it's not Windows but it's I think it's it's the way to go like model it after UFI UFI has its words everything has its words but I think it's way better than than the stuff that came before for it and just yeah so my recommendation would always be if you can't do this stuff at least consider the ideas behind her a bit behind that and like drop-in directories and sing like single file updates and these kind of things and then try to model it afterwards and the more you can take over the easier will be your life because this probably will end up in all the distributions and the less differences there are the easier it is right so I think even grub supports type one at least on type two or something like this so so we're on purpose we wrote the spec as a generic thing both of like of all like there's a spec about yuki yuki i there's a spec about the bootloader spec there's a spec about the bootloader interface like where how how we do because we will always clear to us that not everybody's gonna do you if I so uh yeah we did that as a nice service to the community but the other people have to figure out if they actually want to adopt this it took them long enough to not adopt it so far but now things are changing if you want to not do you if I my recommendation would be let's yeah look at the specs and I'm sure like adds to shit to it like it's a spec senders like there's a getup issue thing and if you need something else then send an issue and like if it makes any sense at all we have no problem with adding that to specs very short yeah very short so all these projects are system well it depends like the specific age like so the question was if all these projects are under the system the umbrella so that depends right like so we we created like a group which you call the uapi group where we try to standardize these things right this is to a large degree admittedly system the adjacent people right like who like adopted a way of thinking like we do it but there's nothing specific in it and the specifications are on purpose written independently of like the word system B doesn't show up in this but like it might show up in the things but it's that's not the point of it right like so the code right like that's a different story right like we this is in the system tree right like this is developed like unix was developed I guess you have like this get repository and you have all these components on this the fact that it's in there doesn't mean you have to use them you can mix and match them like is mentioned like different distribution pick different things up like opens us it at sd boot first and not sd stop and then sd the right hand was confidential computing take sd stop but it is not interested in sd boot because they didn't want a boot load at all so yeah this is how it should be right yeah we'll give you the buffet pick what you want don't pick what you don't want I don't care but yeah this is ultimately it's very Linux focused various ufi focused very systemally focused but I think look at the specs maybe you can reuse something but yeah yeah I think even the uk i stuff like because we don't like you know how the firmware jumps into your uk i that is not defined by us that is simply an artifact of the fact that's a pe file right like so you can jump like any find any other way to jump into it you can even look for the Linux kernel in it right like that's what grub does right so it looks for the linux pe section in it ignores all the other stuff and if it wants to and then does the classic boot protocol that does not do anything like this okay but anyway yeah thank you you
Kernel command line to configure userspace considered harmful
Okay, thank you. Good morning, good afternoon. Thank you for having me. My name is Luca. By day I work as a software engineer in the Linux systems group in Microsoft in the Azure organization where I manage the Azure Linux OS that runs on the infrastructure there. By night I'm involved in various open source projects. I'm an assistant demantainer, a debian developer, a DPDK LTS maintainer, a bunch of other stuff that I consistently forget about. Now, if you read the title of this talk, you might think, hang on, was that really intended to be that provocative? And the answer is yes, yes it was. This is my yearly talk to make new friends. But of course I tried to mean it in a positive way. So I want to provoke some thoughts and discussions, see what we can do about this that I consider a problem and that I think we are in a good place to start fixing. But first, even though everybody lives and breathes secure boot, some background, if you're not aware, if you work on bootloader, so boot system, you already know all of this, but just one slide. So in the beginning we had BIOS, everything was great. The security model was low as if. In the 2000s we got Ufi. So Intel, Microsoft, a bunch of other people got together and created this new protocol for firmers. And it actually has a security model, which is nice. Now, it gets a lot of mud thrown at it. Every time there's a bug in the news like the logo face stuff, people are going, oh, why do we need secure boot? It's always broken anyway. Well, having a security model doesn't mean that everything is perfect or never breaks. It's software. It runs on computers. Of course it breaks. The point is that we have a process to deal with it and we have a natural security model to follow. So the way it works is that there's a chain of trust that starts in hardware. So for example, Intel boot guard. So the hardware verifies the trust in the firmware. The firmware verifies the bootloader, the bootloader verifies the kernel. The set of keys and certificates used are stored in the firmware. I won't go into details in that because it's not too important for this. This is generally called secure boot in a nutshell. Now, this in the 2010s, thanks to the work of a lot of people, like the government and many others, Linux joined the party finally. We were shut out of that ecosystem for a while and by default, distribution couldn't boot on new hardware. You had to go and fiddle with the buyers on this secure boot. This changed in the 2010s. So we have Shim, Grab 2 and the kernel lockdown stack and distribution can work again by default. They are signed with the U-featured party CA. You get your Shim signed by Microsoft and then you sign your second stage bootloader like Grab or System Eboot with your distribution key. And then we have this patch set in the kernel that was called secure level in the beginning when it was out of three. Then it was merged as the lockdown LSM later that basically tries to protect the kernel and the firmware and the hardware before the Xeboot service is called. Xeboot services is an API call in the U-fe interface and when that happens, a bunch of things get locked down. You cannot change secure viable anymore. The firmware goes away and a bunch of other things. It's very important to protect the system before that. So this is what this ecosystem tries to protect. Secure level also tries or lockdown tries to separate UAD0 from ring zero. So the theory is if you are root, you shouldn't be able to change the hardware or the kernel memory outside of what should be allowed. So this is not perfect. It went a very long way and it fixes a lot of problems. It's not perfect. Of course, it's softer. It's never perfect. But yes, the idea is we have this boundary between UAD0 and ring zero. And this has been working for 10 years or so. It's great. We moved, we have no trust whatsoever to having trust until the point we start the user space. And that's great. But other operating systems are way far ahead. My course is way far ahead. Android is way far ahead. Windows is way far ahead. We do nothing for user space so far. But in the past couple of years, we've been talking a lot about how to fix this and things are starting to happen. So this is the next level. We'll have a unified kernel image. And by the way, Renard had a talk this morning about UKIs and I think he might have mentioned before as well in the previous talk. I could not get him because the room was full. But we've been talking about this stuff for a while and there were at least three or four talks talking about these things. So you might have already heard about these concepts and we'll repeat them again in a different context here. So what we are trying to do is extend that level of trust and security and authentication to user space. For example, the E30 right now on any generous distribution just sits on the boot partition on the ESP and anybody can offline or even online that has write access that can inject anything. And they're just built locally. They're unverified. You add a backdoor to the prompt that asks for your locks encryption password and you wouldn't have any idea because it's completely unchecked and untrusted and verified. Unified kernel images try to fix this. The unit RD is part of the PE binary that gets signed so that the shim or the firmware verifies it before loading so that we can extend the chain of trust a little bit further into user space. At least the first part of user space, the unit RD. But that's not enough. We want to go further because once you go to your root file system for the unit RD, well, that also is unverified. Now, Wacuzov is working on the IP, Intervity Policy Enforcement or Execution or something like that. It's a new LSM that basically allows you to write a policy that says any binary that runs on my system in user space must come from a DM Verity volume that is signed by a key trusted by the kernel. DM Verity is a mercury system to do online verification of block devices as they are loaded and opened. It's a very, very nice interface that has been available in the kernel for 10 years or so. And with IP, we can use this to move the chain of trust into the full user space. So now all the code that runs is fully verified with a chain of trust that goes back to the hardware. With discover of these images, we can also protect further payloads, so containers, end-spon containers, portable services and other things that are attached to the OS. If you're running a read-only system, you need some way to attach new applications there, of course. And with DDIs, you can further extend the chain of trust again in the same way for those payloads as well. So we put all of this together when we have the shim and lockdown stuff for the boot process and then UKIs for the InterD, IP and DDIs for user space. We have a very nice system that chains back to the hardware and implements a full root of trust. And that's very nice, except for the kernel command line. This is just stored as a plain text file in system debut type 1 BLS. So it's a type of boot images supported by system debut or in grab as well. It's just plain text file. If you have root access, you can write whatever you want there. If you don't, it checks, that just gets out there and run. It also can be edited on the fly if you have access to the keyboard, which probably is fine on a laptop, because if you have access to the laptop, you're probably the owner. But if you're on a server, owner of VM, or a confidential VM, that's kind of bad, especially for the Taster computing case, because the serial console is just a device owned by the supervisor, which is outside your PCB. So why is this a problem? Because it has become kind of a kitchen sink. Just for the kernel alone, there's that document there, which is very nice and documents a lot of options available. It's 7,000 lines long, and itself says this is not a complete list. So we don't even have one list that says, okay, this is everything you can do with this untrusted, unverified interface for your machine, which is not ideal. Also, I checked, I'm not a kernel developer, but I checked as far as I can see, the very first parsing of the kernel command line happens in the kernel's EFI stub before Exiboot services. Remember I said before Exiboot services is a very important point in the boot process that before that you want to protect your system. You'll be really careful about what happens and what is allowed to run and execute and change the flow of execution. Now, you can use the kernel command line to configure the kernel to do things like disable a Cerenox, disable IP. I talked about a moment before about IP. You can disable all these security components using the kernel command line. And it's not just the kernel that you configure. It's called the kernel command line, but it's just a command line. You configure everything and anything in user space. Everybody sees it by default. It's approximately aligned. It's there. Everybody has their own parsers. The custom written to parse it and read it, and it's used for absolutely everything. And again, this is bad for confidential computing. The Cereconsole is outside of the TCP. So this is a difficult problem. Now, of course, there are historical reasons for this. It's super convenient. It's amazing. You have a problem. You just press E to edit a debug, and then you can get some debug logs if your machine doesn't boot. That's amazing. That is super useful. But I think we're getting close to the point where we need to make decisions and whether we want to allow this always or in some cases or disable it completely in some other cases, because it is the last bit missing, as far as I can tell, in the security story of the boot process on Linux. So for system reboot, we have decided to stop allowing editing the command line and supplying untrusted sources of input to it when you boot UKIs. You cannot do that. And we made a lot of friends with that decision, I can tell you. So the problem is, of course, the flexibility is gone. Can we get it back? What are the use cases? So one of the main ones, the root FS auto discovery. In this case, you would do root equal, devs, da1, or whatever. Now you probably may be using a UGID to identify the disk so it doesn't switch partition. You don't lose booting. We have something called the discover partition specification system that is supported by our tools. So basically, the very quick resystem reboot tells the intrad where your root is. It automatically finds it based on UID set on the partition. So this use case is very well covered now. I already mentioned UKI, that we mentioned very frequently as FOS, so I go very quickly through this. You can add a command line to the UKI when you build it. It's very easy with UQFI, our tool to build UKIs. But of course, it's one entry. It's a fixed entry. The UKI is meant to be shipped by the OS vendor, and that is very not flexible, of course, because the OS vendor doesn't know what you need to have on your OS to make it work. Now we have a future plan. We'll get to this this year, but you'll be able to actually specify multiple options. For example, your OS vendor will be able to say, I have my kernel command line, which is the default one, and then one that has debug, and another one that has factory reset, so that you have multiple options. And in your boot menu from system reboot, you can select a non-default one if so you wish. This is very hard to list, and I get Leonard to implement that very soon, but he hasn't done that yet. The other thing we have, so system is tab is the small UFI stub that is embedded in the UKI, the first bit that gets loaded. So we added this thing this year called addons. Again, they can be built by UKFI, and what they are is they're just PE binaries, so they are signed, so the firmware verifies them using the secure boot certificates before loading them, but they don't have code. There's just a kernel command line configuration. So you can use this, and system is tab, we automatically load them if it finds them. Again, through the firmware, so verified and signed and trusted, and then you can use that for extending the kernel command line that was in the UKI, and that is fixed. This is really meant for platform owners. So for example, if you want to set crash kernel equals some amount of memory, that's probably the same across your whole fleet, at least for the same devices, so you can use the same addon everywhere to set these cases. Again, we want to add selection to, right now every addon will be used. We want to add a menu to let you select which one you want a boot in case that is needed, but we don't have that yet. It's again on the to-do list. Next thing we have extended extension images. This can be used to extend the IninterD. So you can drop them in the ESP, they are the invertees, so again they get verified by the kernel, and given the IninterD 2 is fixed, we can use this to extend the IninterD with additional code or configuration. It can be used for both configuration, overlay none ETC or code overlay none USR. Again, we don't have any way to select which one you want. We just pick every one, every extension image that we find in the ESP. And also you can embed them in the IninterD if you want and extend the router fast with this, or download them at running time to extend the router fast when it's read only. Finally, this is my favorite one, and I think this should give us enough flexibility that we can start to talk about actually disabling this stuff by default. So credentials are a very simple concept that we added to system this some years ago. They are just key value pairs. The key difference is that they are scoped by default. They are only visible by user space and only by services they opt into them by key. So in your service you say load credential key. If you have a credential for you name it there and it will be loaded. Everybody else will not see the content because they can be encrypted. And we are, I think we have that already, if not it will be ready very soon. You can encrypt them ahead of time if you know the TPM's public certificate for the SRK. You can encrypt them ahead of time for any machine. And they are decrypted only when the service starts and reads them. And only in the namespace view of the service. So they are fully isolated. Nothing else outside of it can see the credential. And you can drop them in the SP again in a per image or global directories. Again we don't have selection. Everything that is found in this location is picked up. And we are starting to add support to every system component and outside of it to use credentials to configure things that used to be configured with the Cora command line. So your networking can be configured credential. Your users, your autologin, your root password and a bunch of other things can all be, which you need to start. Like literally hundreds of things can be configured using credentials. I have a pull request open. Hopefully, as soon as I figure out the TPM measurement story, but we will also allow you to create new credentials from the boot menu. Like when you have a system boot with type one or grab two, you can edit the Cora command line. You'll be able to on the boot menu to just type credential and then a name and then a value. It will be passed to by system D and added to the interface so that it can be used. So I think this is very powerful. It's something that should give us all the flexibility, most of the flexibility that we need. Maybe. Is that enough? We have GPTO to discover for your root for a system. UKIs, add-ons, extension images, credential. Is this enough to cover all the use cases that we need in the 90% or maybe 95%? Of course, there will also be a case where you have to go put your hands on a machine that is completely broken. What you do in the case is you disable security. You break glass. You take your node offline. If it's a server, you take production workers away, you debug it. You disable security and then you can do whatever you want. Let's say 1% of the cases. Are we there yet for the 95%? That is an open question. I hope we can have some discussions about this. We also have a Secure Boot 2.0 comment. We are starting to think about it. Should this say something about what is allowed or not allowed to do to the kernel before execute services? It's kind of an important topic. Should it be a specification or should it not? Now, as reviewers and maintainers of your space components, I think it is time that we start to say, hey, if you're adding a new option via the kernel command line for my user space program, please at least also add a credential to it so that we can use both at the very least. And is this a full term? Most likely, but we still try because we are trying to push the envelope a bit forward every time and hopefully we'll get somewhere. So I think that's it and we have three minutes for questions. Oh, thank you. I was fast. Questions? Please. So you said about selecting crude partition, there was this discoverable partition specification. How you handle multiple installations of the same distribution like free fedora installations on the same disk? So the way this works is that system disk... Sorry, repeat the question for the microphone. If you have multiple partitions, multiple distributions, and start at the same time, how do you find out automatically what their disk is? The basic stuff is that the way system deboot system disk actually finds it is that it tells the initardivia EFI runtime variable which disk the ESP that was used to find the system deboot or the UKI was on. So you get the root of the disk. And then the auto generator there takes the disk and only look at that. So if you are installing with multiple disks, then we select the right disk like that. If you have multiple root partitions on the same disk as well, then I have no idea how we do that. I think we recommend to use different volumes for that. I think there's some way to do it. I don't remember to be honest. I need to look at the generator. There you go. So use a different UKI for the different root of FES basically. Yeah. Or again, with the credentials, we support credentials then for the auto generator yet? So we should add that. So that's probably a good way we could configure with the credential. But this is made so that by default you find the right thing for the simplest case. And then of course you need configuration for the complex one. If you use the same disk for multiple root file systems, well then you need to tell which one to pick. And that's one way. And I think we it is configurable and we should have credentials support for that. So you drop in a credential and then you decide which one to use. Yes. But it's a good question. That's a BTFS question. How to deal with starting from difference of volume on butterfests? I have no clue. I don't use butterfests, but the meta people here do. Do you have any other? I don't remember. Right. So yeah, it's not supported right now in the specification, but there was a buzzer, I think. Patches welcome. As usual. Yes. Yes. Anything else? Please. So the question was, can we use the auto generator when we create the UKI? The answer is yes, because then you would use the kind of command line. But if you are generating UKIs locally, our idea is that the UKI is generated on a server somewhere by your vendor. So I wouldn't work in that case. You could create a credential when you install it, for example, to tell it actually going and figure it out from that idea. But yeah, if we do support building UKIs locally, we have kernel installed plugins for that. It does work. And yes, you could do it that way. Yes. Yes. That could work if you're building locally. Yes. Sorry, that was according to the UUID in the UKI itself. And again, yes, it can work if you're building it locally. Yes. Anything else? Yes. And no, it is not a workaround for broken EFI variables. We added this so that we could configure autologene in VMs. I think that was the first time we added this back then. But this is a way to, again, the main use case was to be able to have secrets that are encrypted against the TPM and are not visible by default. So the services don't have to implement encrypted and decrypting all the stuff because that is hairy, especially against the TPM. That was the main use case, to have this as sealed stuff that is only visible by the service in its main space only when it runs. That was the main use case because normally a lot of times you configure secrets by environment variables and things like that. And of course, that is bad because environment variables are inherited now on the process tree. And you don't want your secrets to leak down to all your child processes. So this is one of the reasons we added the credential concept. Yes. So another question for credentials. What is the scope of credentials you load from this ESP? Is it the whole EUNITRD you can see it or some? The worry, so again, they are obtained. So sorry, yes. What is the scope of credentials loaded from the ESP? Does the whole EUNITRD see them? Yes, if they obtain. So your EUNITRD is trusted and verified and signed. So you build a configuration, you say service, full and bar can load this credential. Service, ABC, don't. So only full and bar that have opted in will see that and it will be decrypted for them. But yes, anything in the EUNITRD can obtain and I think we transition them across as well to the full OS. So they will be available also for services running from the after EUNITRD to full OS transition. Yes, credentials are awesome. You should check them out. They all, by the way, the slides are online and all these things are links to the actual documentation. Anything else? I think we have two minutes. I have a pretty dumb question, but let's say I want to put an RO on the kernel and online. I do that with the prediction. No, because that is for the, depends. Is that for the kernel or, oh, sorry, sorry. If you want, if you want to put RO on the kernel command line, would you do that on using a credential? Depends who's reading that. Is that for the kernel or is that for a user space in EUNITRD setting up your root file system? It depends on the case. If it is for your kernel, well, probably you want that in the UKI itself, because it's something you want your image to run in that configuration state, so you put that in the UKI itself. If you want that to only for certain cases, then maybe you can use addons and only deploy that on the machines that use the same image, but with a different configuration. So the answer, yes, it depends. There's many ways to do that and depends who's reading it and what the use case is and if you want that to be the default or the non-default on whatever else. I think we have, okay. Thank you.
Reducing Costs and Improving Performance With Data Modeling in Postgres
Who's using Postgres in here? Yay! Thanks for being here on Sunday and our next speaker, Charlie, is going to talk about reducing the costs and reducing the costs. Those things out there. And other things as well. Good luck. Thank you. Thank you. Good evening. So yes, welcome. Today we're going to talk about how to reduce costs and improve performance. Doing really easy stuff, really easy things. Well my name is Charlie Batista. The presentation is not about me and this was made by ChatGPT so we can get these lights later. Sorry. Okay. Okay. Good to go. Nice. So what we're going to talk today, so we're going to have a bit of a review on what's about this talk. So we're going to review some concepts on how the hardware works, right? So try to understand a little bit what is cash, how Postgres stores data, and the summary. And thank you guys. You are good to go. So I think it's gone. See you next year. I'll be a bit fast because I have a lot of slides and not much off time so I will try to get. If you have questions at any time just raise your hand. That's fine. So just interrupt me. Or if you would like to, you can also wait for the end. We try to answer as many questions as you can. So what is this talk about? So this talk is not about go there and modeling a business. It's about how to model database, right? We try to understand a little bit how the underneath hardware works and how we can play nice with this and how Postgres can play nice with this, right? So we see some concepts about how computer stores data, how Postgres stores data. And it may be get a little low level but I'll try to keep it as more higher level as possible so we can all follow together. And we hope that the end of this talk will be able to understand a little bit and save some money, especially for those running on cloud, right? You know that space and these and those things cost a lot of money. And well, that said, let's start. We're going to do a quick review on the hardware. So I suppose most of you guys have seen this picture before. So this is the memory architecture. So how the memory is divided on the hardware. If you see down here, we have those secondary storage. This is your hard drive, SSD, HD type if anybody still use that thing. So this is the thing. If you see it's quite large but it's slow and usually inefficient. The latency is high. As we go up to the top, we go to the CPU registers, they get really, really fast. That's where the magic happens but they also get really, really expensive, right? We want to do our best to always use them in a very efficient way. Things to understand about memory. Memory is either volatile or non-volatile. That down here is non-volatile. That's where you save your data and you should save your data there if you want them the next day. Because if something happens and power loss or whatever, so everything that is up memory on those non-volatile, they're going to be lost, right? So also, as I said, the down is cheaper, the upper is higher price. Memory can also be basically accessed in three different ways. We have random access, direct access and sequential access. We'll see that most of the times we're doing random access, especially in RAM, RAM is basically random by nature. We also always try to do, when you go to the hard driver, to do sequential access both write and read and you try to understand why. So if you see here, see I have four CPU cores and I have the IO controller. One thing that you need to realize is the CPU is not connected to your disk. There is no physical cable or path away that the CPU talks to your hard drive. It doesn't matter if it's SSD, ADDD, tape or whatever. It needs to go to the memory controller. The memory has physical direct access to the hard drive. So every time that you need something, your CPU asks the memory and the memory catches that thing inside of the hard drive. And then it moves up all the way to the CPU. Also a very interesting thing that most of the developers do not realize is we don't write bytes on hard drive. When you're a software, open a text file and you save your name, my name is Charlie, five characters, five bytes, no, I still save one block in which most of the systems are four kilobytes. For that very simple operation of five characters, I'm dealing with four kilobytes block on the operational system. And that is for everything. The database also do the same. So if we can start doing more work with those four kilobytes block, things can go faster. That's one of the main ideas. Another thing that you see here, ADDD, they're really slow but still have a lot of companies and people using them because they're quite inexpensive nowadays. But random IEO is terrible, is low, is horrible. Every time that you need to do a random IEO there is horrible. And the problem is it's a mechanical device. So most of the people believe that the performance problem is on the plate, that the spinning plate. Actually that one is fast. The problem is on top of the spinning plate you have a literal ah that needs to move back and forward. And this movement is really, really slow. Really, really slow. So if you do a random IEO and the arm keeps, needs to keep moving back and forward, that's going to be horrible or whatever application. And especially for database. That's going to be really, really bad. So on SSDs, it's not that bad but it's still, the performance or random IEO is still not as the same as the sequential IEO. Both writes and reads. So writes are closer but they're still not as the same. So this is a little bit about what I mean about sequential axis. Sequential axis is when you write one block after another. Sequential and random is when you have that mess. So most of college people there when you go to their bedroom they had random axis there. So you can think about that thing. So but yeah. What is this cache? We're talking about improving performance and talking about those things but what does it has to do with cache and performance and database? So cache on its very simplistic definition, I got it from the Wikipedia, it is a hardware or software component that stores data so you can access them in the future in a faster way. We have many, many different levels of cache. So we have natural cache, we have hard drives cache, we have application cache, most of the database they do have the application cache and we also have the CPU cache. This is the one that we really interest for us today. And we have some definitions here. So what is a cache heat? Anybody? Come on guys. Exactly. Let's say for example I want to do a select and I want to get the Charlie, the first time that I do select and get Charlie's information and goes for the disk up to memory CPU. So but it stays there. The database doesn't throw it back because next time that I do a select for a Charlie's information again the memory is not there. The information is on the memory. If luckily on the CPU cache, remember that high top we also have cache there close to the CPU which is really, really fast. This is a cache heat and a cache miss is the opposite. So if I do a select and that row has never been selected before it needs to go to the hard driver so that's a miss. It needs to go and then all the way so that's going to be really, really slow very inefficient. So if you have a heat ratio higher so more cache heat than cache misses the better and the faster is our application. So we always try to improve that metric there. It's a very important metric. We also have some writing policies. So we have the write through and write back policies. So the write through policy is when you send information to the cache especially when we save in data to the cache so the database is saving data it can keep the information in the cache now and save later or can immediately save that information both in the cache and in the hard driver. So a write through is immediately save that information. So it stays in the cache but then immediately saves information. So what's the problem with that? The problem is latency. So we increase latency when you use that policy. Oh is it all bad? Some applications are fine with that. So some applications that need higher reliability will implement that policy. And so we have tradeoffs here. The other one, the write back. So the information stays on the cache and eventually that's up to the CPU if it's a CPU cache or up to the aggregate of the application to eventually save that data back to the hard driver. Remember, everything that's up there if we have power loss we might have a problem. So in this case we might improve a lot of performance but we may lose in reliability. So we need to have that tradeoff. And then we have the prefetch. Different CPUs, they're very smart. At least they tend to be very smart. So those theory that when you go to cache, the information that you get now you probably will get more information around that data. So we call it cache locality. So based on locality if I have Maria, Charlie and John, if I select Charlie, the probability that they'll need data from Maria and John is higher than that I get down the line. So it's playing about probability. We also have the time locality. So the information that they just accessed now has higher chance that they're going to access in the future, like in five minutes and a few minutes. So what this prefetch does, when we ask for one block for the CPU, the CPU, oh, this guy's accessing this block so he's probably going to need the next block. So the CPU prefetch loads in advance that block and puts in cache. So if I need that block, it's already in cache. The CPU doesn't need to go back to fetch that information again. So that's improved performance. That's awesome, right? Now comes the problem. Cache are expensive. So when it gets close to the CPU, they get really, really expensive. If you're going to buy a laptop or device or whatever and you choose the CPU i9, i20, you're going to see they have those L1 cache, L2 cache, and they're usually in gigabytes, right? Thatabytes. They're usually in kilobytes. You're going to get kilobytes of cache. What can you do with kilobytes nowadays? Almost nothing, right? Another problem is what we call cache line. The cache line is literally the line that the cache is divided. So the cache is divided in many lines and each line has a specific size and usually depends on the CPU word size. What's that CPU word size? So if I have a 64-bit CPU, the CPU word size is going to be 64-bits and the cache line likely would be 64-bits. So why likely? Because when we go to the CPU, we start not counting the time in times anymore, in seconds or nanoseconds. Now we're counting the time in clock cycles. So inside of the CPU, it has those things that they call registered that are really, really fast. It takes only one clock cycle for the CPU to get information there. So it's really fast. When we go to the cache, the L1 cache, it usually takes between three to seven clock cycles. It depends on the CPU and depends on the cache. So things start to slow down. People always tell you memory is really fast. Memory takes 215 clock cycles. Can be 100 times slower than the CPU cache. So memory for the CPU point of view is really, really, really, really slow. So we don't want to go there. Can you imagine your hard driver? They didn't even put here because I don't know the number of years of data to fit there. That's insanely slow. So that's why we always try to put information there. Remember that thing that I said about the line size? So we might have a problem. We want to fit everything inside that size. We don't want to waste that thing. So can you guys spot a problem in this here? I put a really simple algorithm here. So a tip is not the code or anything. It compiles. So that's not a problem. Anybody? On the line. On the line. That's exactly on the line. So most of the developers, at least a lot of them, will believe that information will fit in memory just like this. So we have the int, the boolean, the int, the boolean again. The problem is it won't. Why? Let me see if this thing works. Can you see? No, that pointer doesn't work. So you see from 0 to 7, it has 8 blocks, little blocks. So it is 64 bits there. 8 bytes or 64 bits. This is in this example the size of my cache line. So because the CPU can only fetch one word size at a time, it can only fetch that information. So the boolean that is only one bit here will have to go to the next page. And see all those white stuff? We call it padding. It's waste. Waste of money and waste of time. So in this example here, if we go back, we have one cube, big blocks, right? We could fit them in two big blocks if we aligned them properly. So we will have two CPU blocks to fetch that information. But actually, we're going to have four, double the time. And we only have four variables here. So we mess it up things really, really quickly. Keep that in mind because that's going to be really important for our discussion here moving on. So yeah, how POSC is organized the data? So POSC and most of the database has its own file organization. Those are the most common one. We have the B3s. I just put here in ascending order. So no special things. It's not that one uses more or less on B3 or one is better on others. It's not the point. So but POSC uses what we call hip files. Hip files are really interesting things because they are very simple. If not the simplest one, it's one of the most simple. How a hip file does work? So it's basically one spaghetti of information. You put one block after another. That's it. That's a hip file. There's no order, no guarantee of order. There's nothing special. You just have one block after another. This is a hip file. So it's very simplistic implementation, right? But it can also be very efficient because remember the locality thing. So if you just have one block and if you prefetch all that information, that's going to be sequential. A problem for indexes, sometimes people do not understand why POSC please do not pick that amazing nice index that I just created for my application. So it was there. I have an index there. Sometimes, you know, but the database doesn't pick it up. And a lot of people, they want to facilitate the database to pick things up because they think there's nothing in the database. Well, sometimes they also do and then they realize they're not. But it happens. So one of the problems is the index is a B3, most of them. So B3 by nature is random. The access is random because you have one block here, another here, another here, another there. So you are changing the access pattern from sequential to random. Remember how random lies expensive, especially on the ADD thing that they spin in this? Yeah, that would be horrible, even for a really efficient index. So that's why the database say, nah, not today, maybe tomorrow, you know, today I'm fine. I just do a full table scan, let's just get everything. And more often than we sometimes think it's faster. So but if files in postgres, they have some very interesting properties that go back here. If I put somewhere, no, here, no, here. Well, they are eight kilobytes. So the block on the hip file, I think, yeah, eight kilobytes size. And each file has the limit of one gigabyte. So it means that we can only have one gigabyte table in postgres, right? No, limit one gigabyte, just create another one, and another one, another one. I have one terabyte information that I'm going to have 1025 files. So be mindful of that, because depending on the file system that you use, some file systems, they do not play well with many, many files on the same folder. Right? Be mindful of that as well. That might be a problem. So it's of eight kilobytes size. Keep that information in mind as well. That's also going to be quite important when we move on. So as I said, one row is just appended after another. So not in fancy. There's no fancy organization or whatever. And if you go for postgres, insert all your data in a nice way, index it there, okay, yeah. And if you do an update on that data, and then you search again, if you do not put another thereby, that's going to mess up your order. So be also mindful about that. So the hit files has no guarantee of order at all. Right? So this is basically how it's organized. And every single block, it has its own organization. So it has a header. The header has a lot of data. I put the information here, as I said, I'm not going to go through all of them, because that's not the point. But you can go and search on the documentation. They're really nice. Now what matters for our discussion here is how these things are organized. Right? So this is inside of the block. This is a picture of the block inside of the database, inside of postgres. So we have the header, and then we have the data. Nice thing is data is started and stored by the end, not by the beginning. We have an index, sorry, a pointer that points to the data. So the data goes towards the center from the end of the block, and the point also goes toward to the center. So when they get close together, that means I don't have space anymore on that block. Right? The block is full. The database needs to create another block to put more information. Also, postgres doesn't split the data between blocks. If your block doesn't fit inside, if your data doesn't fit inside one block that is 8 kilobytes, actually we'll see that if it doesn't fit inside half of the block, it's going to do something else. Otherwise, you're not going to lose your data. But yeah, the database will handle that. So also, those rows here that we see here, those tuples, tuples and rows, we call them interchangeably the same thing. So the rows here, they also have their own organization. I put here all those things. And I will go for a couple of things that are important for the discussion, not all of them. They are fixed size. Well, they have a header as well, that the header is fixed size. A question, does anybody know how postgres stores new? If I have a table and I have a lot of columns that can be new, anybody? The null bit map. Sorry? The null bit map. The null bit map. Does anybody know what is a bit map? Okay. Yeah, a bunch of bits. So you have what we call map, actually, you have the sequence of bits. You have 11111010100001. That's the map of bits. So what postgres does, it has this sequence of things. And the position of the one of zero on that thing is the position of the column inside of your row. So if it's zero, it means there's no null. If it's one, that means you have a null. So it's highly efficient and compact the way that stores null inside of postgres. Yeah. This is how it works. And also what is very important here, this is, again, another photo of the row, is this padding. Remember the CPU padding? Well, the database also tries to keep things aligned with the CPU. So if things don't align with the CPU, the database will add padding. And we'll see how nice that can play. Right? Remember that I said if the data doesn't fit there, the postgres will save somewhere else. This is what we call toast. This is the oversized attribute storage technique. And usually goes really well with coffee. So postgres uses another file, that's not your data file, to put the information that doesn't fit inside of the block. And then it puts a reference inside of the block, a pointer, to the beginning of the data on the other file. So the database does it automatically, you don't have to do anything about that thing. So every time that you have text, VACA, 3000 and something, that, you're going to have a toast because that won't fit in 8 kilobytes page. So there are a lot of improvements since being created with compression and those things. So it plays quite well. And when we understand how those things work, we can also change our data organization. But that by itself would be another talk. But I can give a just a small example here. So let's say we have the table user, where the user does the authentication, right? So for the authentication to work, we only need to compare the username and password. But then when the user has been authenticated, sometimes you want to show the pictures, the bio and a lot of information that's there in the toast. So if we create a table with only that information for the authentication, the authentication process that's a complicated process can go really, really fast because you can fit a lot of things inside one block space. And the database doesn't need to go to the toast. We have the toast in the information and the modeling. So and then when we need to show the information from the user, we can use, still use that same primary key that we have a reference for for an key and get that data on that very specific time. And for most of the applications, the authentication is a hot path. Everybody knows goes to the authentication. But really few people go to the bio page. So we can improve the performance of our application just doing nothing. Really simple changes. And we get a boost on the performance. So we already got the things, the performance improvement. How those works. So as I said, when it's filled a percentage of the block, it will be saved elsewhere and has a pointer to those things. And now we come to what really matters. So if the data is too large, they go into toast. But what if I have too many small problems? Well, if you have too many small post-process has a limit of how many columns you can fit inside of your table. Because in that case, it won't fit on the block, right? So the question is, if we have too many columns, how does it do if it doesn't split? So you just have the maximum of columns. Depending on the data type we're using, for example, if you're using big inch, that is eight bytes, so you have that number. If you're using smaller inch, you can fit more. So but you have a hard limit because of the block size. That is what it does. If it doesn't fit, you need to split the table in more than two, three, four, whatever many times you need. But that would be insane. Like have hundred or six, 64 columns inside of the table. Can you repeat the question because you're not talking about it? So the question is, the post puts the information outside because it's a larger than the block. But if we have too many rows, too many columns inside of the row, how post-process handles that? So that was the question. And again, post-process just has a lot of limits on the number that you can put there. More than that, you need to create more than one table. That would be the solution. Now we come to the data alignment and padding. So remember that I said the database also does padding and post-process also does. So the natural alignment on post-process padding is of eight bytes or 64 bits. That's the natural alignment. So what does that mean? It means that every time that you put one integer, integer is four bytes, you need to put another integer together to get a perfect alignment. If you put an integer of four bytes and a big integer just after that integer, so that big integer is going to go to the next one because it doesn't fit the natural alignment. And you can ask me, well, what's the problem? Just go to the next one. Not a big deal. So every type has its own alignment here, as you can see. Shaw has basically no alignment needed because it's variable. It doesn't mean that's good to use car. Actually it's good to put at the end, not at the beginning. And we'll see why. So we see that Shaw has alignment of two bytes. It means, yep. So does the order of the fields or can optimize the order of the fields? So when I decide my table, do I need to think this? I got it. So the question is, does the database automatically reorganize internally to put it in the best way? Or does the DBA need himself to change the order of the fields, the columns, to put in the best way? The database does not. This is work of the DBA. The DBA has to do this work, and it's a really good question. And we'll see sooner or later why it happens. As we see here, every type has its own alignment. This is on post-presokumentation. It's just a copy and paste. So the two bytes on most machine, that means that one will occupy two bytes there. So for example, we are aligned with eight bytes. So I can fit four of those data types one after another that does eight bytes. However, if I put one of two bytes in a big integer that's eight bytes by itself, we're going to have six bytes, pageant, waste. Remember, pageant is waste. Most of the time, waste of money, especially if we're running on cloud. All right. And it's really possible to optimize those things to make them to work in a better way. And this is an example. So in this example here, what I did, yes, let's say I create that table. See, we have a really few columns there, not many. I just put in a random order, like I put an integer, and then I put a bar card, and then a timestamp. So that was a very small table. I only inserted one million rows. So on that one million rows, I got a certain size, and then I organized it just to align better, to remove the pattern. The alignment saved me 25% of space. How much would be 25% less in your build on AWS of storage? We make and buy a burger, right? We never know. So we'll see a couple of our examples. Wait for people to take photos. All right. Yeah, question? Jason B, it has its own, if I test the Jason B data type, that's the question, right? So Jason B has its own specificities, so, and most of it doesn't go inside of the block, because most of the time it doesn't fit inside of the block, right? The problem about Jason B, it has its own algorithm of optimizations. So, but I haven't answered your question, I did not, but that would be really interesting one to do, to see how that works, how that plays together. Especially because Jason B has a binary format, so the binary itself, algorithm should be able to do a lot of optimizations on the way. But yeah, I'll take notes of that one, because now I'm curious. Another question? So the question is, if there is any analyzer tool that we can use to see how much space we're wasting, right? Not that I know. That might have there, but not that I know. That also would be a really nice open source tool to develop, you know? People looking for ideas, that's a nice one. Back to Jason B, the data base, the table. Like, there are a few things about the Jason B, is that like 60 bytes, 32 bytes, 64 bytes, how do you organize the database? Okay, the question is back to Jason B, how does it organize it? It's organized in 64, 32 bytes. I don't have an answer for that question, because I really have no idea. I haven't played with Jason B much. Most of the work I'm doing with those things, performance, and how I would say, netrodata types, right? So, but yeah, I don't have an answer for that either. How long do you take to review everything and optimize it? And are there any tools to make the process much quicker, like any automated scripts? The question is how long does it take to do the full reveal and the process, and if there is two? Well, I don't know of two. And the time, it highly depends how complex your database. If you have a small database, like the one that they use for the TPCC, it took me like a few minutes, right? But to have really complex database, it can take you some time. But it's time that works it. The question is if the alignment only works on Postgres or if the other database is as well. If I'm not mistaken, Oracle also does, but its implementation is specific. It's from database to database. And I have not played with those, especially because this type of documentation is not highly available. And if you do not have documentation, you really need to go to the binary and it's really time consuming. And I haven't been working with other databases for like 15 years. So yeah, but they probably do. The question is if we have a way to ask Postgres how the alignment is, right? If we have tooling, yes, we do have some extensions that we can go deep and see how the organization is. We even have some extension that we can check the memory for from time to time. Okay, moving on because I only have 15 minutes. Thanks, sir. Probably now it's less because he's been showing me that for five minutes. So, okay, what are the implications about those things, right? I showed you one example. Now I'm going to show you another one that they use a tool named sysbench that they did a TPCC-like experimentation to see. So this is just an example of one of the tables. This is how sysbench created the table. This is normal stuff. And you see it's really small table. There's not much. And most of the columns are integers. So nothing going on there, right? And this is what I did. Besides the flashing thing, did you guys not see anything? I changed some orders of the column. It's a shame that this point is not working. But yeah, I changed some orders. See, I put all the integers on the top and then I put the small integers on the way that they are in pairs to improve the alignment. And then I put the other columns. See that I put the timestamp. A timestamp, if I'm not mistaken, is also eight or four bytes. And after that one, I put the another small inch because they can still use the same space on the alignment. So I have four and two, still have six. So the only padding that I'm going to have is after that one. So I tried to minimize the padding as small as I could. Back to the other one. See, it's just not that bad. We had an integer as small as small integer and then an integer. So not so bad. So it shouldn't be of huge difference. So this is what happened. See the new on the left, the schema name. The schema new are where I created the new ones. The schema public are the old things. If you see the total size of the orderline one is 3,000 megabytes or three gigabytes, 3.8, to be more fair. The old one was 4.1. It's about like 15% increase, right? And also interesting, look at the index size. We also have improvement on the index size and they have exactly the same indexes. Let's go back. See the index, the primary key here, we have the column one, two, three, four, and then another two columns for the other index. And the same. Exactly the same indexes. Don't change the name. Exactly the same color, exactly the same order of the columns. I didn't even play on the index for the optimization. I just left them there. I just did for the table, right? And this is what happens. And I'm highlighting here for one of them how much space is saved. One table. And a very small table, right? So, but how does it play with performance? Because, okay, one thing is I can save any space on this and performance. So obviously it does a TPCC type test, right? Try it to use as well. The answer is I got an average 8.4 performance improvement. About around, on average, for this example, this load, 19% disk space production. I'm cutting down at 19% of my disk space views on cloud providers. I think that's why they never approve my talks on their conferences. Now thinking about that, I think it makes sense, right? The latency, I reduced it about 15% latency improvement. Just shuffling the columns around. The application does not even need to do. And I'll tell you, when I created those tables, because the SIS bench has its fixed structure on those, the insert and updates, I had to trick the SIS bench. I had to create views and then rows inside of Postgres to make those views insertable, updateable and deleteable. So on top of this, I still have latency on the database I had to put because of the tooling. And even though I got almost 9% improvement. Can I just clarify, I'm confused, your average write and read around 8.2, 8.5. So I thought the latency would be around between those two numbers. But obviously you mean latency in a different sense, or you're measuring? Latency on the application side, because latency is not about how fast or as low you get the data, but also at the end how fast or as low you process the data. So you have more data in smaller blocks. So on the application side, latency is going to be a lot better. So I'm also improving the application for free. You have some kind of improvement, but what are the things you need to do to reorganize the columns? So your question is what happens if I need to go to the table and reorganize the columns, right? So, yeah, if you already have an application, you may need to do the trick that I did here. Right, first create the new table, so you're going to double that for a certain moment type. You create the new tables, and then you create views and you create rules inside of the database. So where your application will insert, update, and delete, obviously select as well on those views until you change. Or if you use a tool like, if I'm not mistaken, PG... I forgot the name of the tool, you know. Actually, the guy had a really nice talk on Friday on PG Day that you can do online data change for your thing. So your application does not even need to know that you're changing the database. So you have a few different options. What's the name of the tool? PG Online Scammer Change? That's from ISQL. Oh, PG, okay, yeah. PG Online Scammer Change, that's true. Thank you. Why was there is a reason for the project not doing the organization for you? The question is, is there is a reason that the PostgreSQL does not do the organization? Yes, because even though we can have really smart things on there, the database doesn't know the full story. So it may try to reorganize the data and mess up the things on the way. So it's always safer for the DBA to do those things. And at the end of the day, the database should keep the data and retrieve the data for you. So you need to know how the data plays inside as well. Yeah, I wonder if it's important which columns can be known and how often these rows actually contain those columns? So the question is, is it important for the column to be known or not and how often it plays? So it is important, not for the column itself, but the following ones. Remember, PostgreSQL doesn't store the know, right? It stores on the bitmap. So then will be a pointer there and nothing on that place, right? So yeah, it does play a role. I haven't tested that to see how much impact that would be, especially because I just used the tooling, right? So but yeah, it definitely should have some impact, for sure. Would it be fair to say that this benchmark measures bit tree more than anything? And let's say that we're not prioritizing latency, but rather we're doing like the same sort of dense joints and we want more bandwidth. Is it true that if there's something like grid, right, we're going to be able to get as much bigger improvement in bandwidth points? Can you rephrase that? I don't think I understood the question. So for this benchmark, it seems like it measures bitmap scan. So this is like a bit tree benchmark. And let's say that we're not after latency reduction, but rather we want to get more bandwidth. Let's say in a case where we would have distinct or time-based joints. Could we use something like grid to effectively get a much bigger set in bandwidth if it's going to be dense enough? If I understood it correctly, the question is it's fair to say that the benchmark mostly tested the bit tree performance, right? So not really the density, if more density of data that would improve. Well, actually, as I explained, post-crisis, especially for insertion and things, we're going to have on the heap file. So we often do not do, at least in this type of benchmark, not do with the bit trees and performance would be really marginal, especially if you saw we don't have many indexes. It's like most of the tables here are the primary key, and I feel them, they're only one index. And the only bit tree structure on post-crisis for this example are the indexes, because the table itself, they are not. They're just heap files, right? So in that sense, I would not fully agree, but definitely density would play, like if you could have more dense tables. And the example that I gave for the authentication, so what do you do when you change? You just put a few columns there for the authentication. You are increasing the density of the information you have on that table. So on the same block instead of having three or three, that will have a thousand, right? It's a lot more dense, so that per se makes it a lot faster, especially if you think about that, time is also wasted on the network, as you mentioned. And network is not only the bandwidth of the network inside of the computer itself. Well, we have five minutes. Let's rush to the end. So this is a summary. So post-crisis stories. And actually, if you, now wait for the photos, almost, okay? So if you want to take something, this is the summary. Every data type has its own alignment on post-crisis and can cause padding. And it's really, really dangerous. So we can get really, really mass with all data, especially because most of the tables in our application has like 20, 20 something columns. And we are not careful on how we put them. Yeah, we can mess up. Questions? I think we have two minutes. Possibly, yeah. Yeah, possibly. When you have fields with varying size, like the text field, what are the problems? The text, the VARCA, right? So they do not play well with padding because as the variable, so the database doesn't know how to optimize them on that way. So it highly depends on how your information is put. And that's usually best to let them to the end. Sorry, can you say that again? Okay. Uh-huh. Yeah, the question is, if I test it on larger database because smaller database might just fit fully in memory. So yes, I did. And what you need to realize is all the data, all the padding that we have here goes to memory and all the pad that goes to memory goes to network. So in all the padding that goes to network goes to your application. So you're wasting space in your hard drive, in your memory, in your CPU cache, in your network and your application. What are the numbers? Well, it highly depends on application to application. So I would say that's empirical. There's no science on that one that you can get up to 30% performance improvements. That's all. If you have more questions, thank you guys. Here is my link again. Thank you.
Introduction to the Public Code and Digital Public Goods devroom
So, hello. Welcome to the Public Code and Digital Public Goods Dev Room. My name is Elena Finlay-Diracht. I'm here at the Foundation for Public Code. This is my colleague. Hello, everyone. Nice to meet you. I'm Amreen Taneja, the Standards Manager at Digital Public Goods Alliance. And there I manage, lead and promote the Digital Public Goods Standard. So, very excited for this Dev Room today. And I'm Jan Einley. I'm also at the Foundation for Public Code and I'll talk later. Cool. And I'm going to... So, in case there's any confusion about what we're doing here and who we are, this is a Dev Room dedicated to everyone developing public code. That is open source code that implements public policy used for the public good and by public organizations like governments, public administrations and state corporations. Digital Public Goods, DPGs, are open source software, open standards, open data, open AI systems and open content collections that help meet the sustainable development goals. So, we have a couple of housekeeping notes. Most importantly, the FOSM Code of Conduct applies here. So, please be respectful in the space. Oh, sorry, this way. On this side of the... Okay. Secondly, even more. Okay. All right. We have a window open for ventilation. That's to make the space a bit more comfortable. If people would like more than one window open, I'm happy to hop on that. We're going to leave the window open all day in any case. And that moves us to the third housekeeping point, which is that if you have any questions, if anything comes up today, talk to Jan, Amrine or me. And that's it. So, on to Amrine. Thank you so much. I'll just take a moment and get this up. Okay. So, well, I think moving on. So, I've already introduced myself. So, first of all, I'd like to warmly welcome you all to this dev room today. First things first, I'd like to share with you a bit about the Digital Public Goods Alliance for those of you who are new to this organization and concept. So, we are a multi-stakeholder initiative which was launched in 2019. And our mission is to accelerate the attainment of sustainable development goals by facilitating the discovery, the development and the use of digital public goods, which are essentially all open source solutions. So, I'll share about this as we move forward, but I'd like to kick off this conversation by introducing you to the Digital Public Goods Standard. Right. So, just to give you a little bit of context of where this concept and this definition came from. So, the DPG definition was actually laid out by the UN Secretary General. And there are five kinds of open source digital solutions that are recognized or can be certified as DPGs. So, these are open source software, open data, open content, open standards and open AI models. So, we have a set of nine indicators, right, that make up the standard. And I'll share a bit about each of them with you today. So, the first one is SDG relevance, right. So, this is a very broad topic. So, essentially any application that wants to do good for the society in some form or the other will come under one or the other SDG, right. So, what we expect from you here is, first of all, to establish a clear contribution to either one or more SDG and also explain how your application will be seeking to do that and achieve that. And then also we have an SDG tracker tool, which I'll be sharing in the presentation as we move forward. The second indicator is open licensing, right. So, the DPG standard has a set of specific licenses that we accept. And all licenses, supposing that are, you know, they're approved by OSI are there for software. We have Creative Commons licenses for open content. And then we have various other licenses for AI systems as well as data. So, because there's positive time, I'll not get into too much detail right now, but I'd love for you to have this conversation with me later on. I'll move on to the third indicator for now, that is clear ownership. So, that essentially what we mean by this is, the DPG status needs to be renewed every year, right. So, you have to send out an application everywhere every year and, you know, your application needs to be up to date with the standard that we have created. So, we need to know who the owner of this application is and it could be either a person or an organization. Both are acceptable. And what you have to provide to us is a proof of ownership, which is anyway a legal requirement for the application. Now, fourth indicator talks about platform independence, right. So, this is a tricky one, right. And the goal here is for vendor lock-in to be avoided. And we prefer for everything to be open source, but let's say you have a proprietary component within your application. So, when you apply for a DPG, what you have to do is provide an alternative open source component to this and explain how it should be implemented and the condition being that it should be relatively easily implementable, right. That solution should be easily implementable for anybody who has enough technical knowledge about this. So, we in fact have external, you know, facilitators and experts for this particular indicator. We have them with us today as well. So, Ivan, that's for you. So, if you have any questions around this indicator, please feel free to contact him. Now, coming to indicator number five, that is documentation. So, this is fairly straightforward, right. So, it basically means that you need to have all your documentation in place. So, this can be in the form of a repository or, you know, on your website or in the form of some good book. And it should essentially have enough detail, you know, that someone with enough technical knowledge should be able to deploy the solution by themselves. That is the requirement that we have. Now, moving on to indicator number six. So, that basically talks about mechanism for extracting data, right. So, if your project collects any sort of non-PII data, then it should be possible to access it through non-properity formats. That is the condition that we have. And now, coming to indicator number seven. So, this is adherence to privacy and applicable laws. In fact, I have some news around this indicator which I'll be sharing with you later on. So, essentially what this means is that your application, it should be compliant with, you know, any of the privacy laws that are there in the jurisdiction where the application has been created or where you intend to operate. So, if it's Euro, it'll be GDPR or anywhere else, you have to provide proof of compliance, of course. And that can be through, you know, providing us a terms of use or privacy policy. And of course, these things are held on a case-to-case basis. So, you know, you'll be speaking to our reviewers around this. And once you satisfy the conditions, then we move forward. Now, coming to indicator number eight. So, this is adherence to standards and best practices. So, essentially, any standards and best practices that apply to the industry where your solution belongs, you have to adhere to them and you have to provide some proof of adherence as well to us. And lastly, coming to indicator number nine. So, this is do-no-harm by design. So, do-no-harm by design essentially means that we, you know, we say design because we don't look at the implications that will be there, you know, down the line somewhere which are completely out of your control, right? So, we look at how the digital solution is being used or rather how it's being built and not how it's being used. So, that is what we kind of focus on. Now, moving on to the next slide. So, this is about how do you become a DPG. So, this is a three-step procedure, right? So, first stage is nomination. So, nomination essentially means that you can either nominate yourself or a third party can nominate you. And the second stage after this is technical review. Of course, this is a very, very rigorous process. We have level one and level two reviewers who go through, you know, your application. And if your application satisfies all the conditions, then, you know, your application is essentially certified as a digital public good and it is recognized on the registry. So, like I mentioned, step one. So, we have a five-minute eligibility test that anybody can take and you can figure out whether your solution is at the outset capable of becoming a digital public good or not. Step two is the nomination. So, this is what the application form looks like and this needs to be filled up as per the criteria that we just spoke about. And this is step three. So, success. So, if, you know, your application is selected, it is added to the DPG registry and this is the SDG tool tracker that I was talking about. So, this is where we have 150 of the DPGs categorized and arranged as per the various SDGs that they are striving to contribute towards. So, now coming to call for experts. So, I was mentioning about something about indicator seven. So, this is where, you know, the standard is entering phase two of operations. So, what this means is that we are going to be fine-tuning critical indicators of the standard through two expert groups that we are launching, one on privacy and one on AI. You'll see this poster across the dev room and outside as well. So, if you're interested, please feel free to scan the QR code and apply. And these are the requirements. So, if you're a subject matter expert in either privacy or AI with a technical background, legal background, academia or, you know, any other background which you think would be a good fit, please do apply. And it's not much of a time contribution. It's about three to four hours for this knowledge partnership. And of course, if there is previous experience in standard making, then that is also highly encouraged. And with that, it comes to an end from my side. I would like to introduce Yan now. So, who is our DPGA member as well as the co-host here for this dev room. Thank you so much. Thank you, Amreen. And I'll come from the Foundation for Public Code. We're a non-profit. We're based in Amsterdam, but we aim to work globally. Just last year, we started chapter in North America. And we exist to help the public organizations who already decided that they want to work with open source, develop open source, to help them do that in a collaborative way. So, ensuring that also that anyone can reuse what they have been doing. And to do that, we have the standard for public code. Here are some old versions. We have some new paper versions here, if you'd like. Just last month, we released 08.0. And it has a number of different criteria in it, certification criteria. I'm not going to go in as deep as Amreen did. But this is what we use to sort of like certify that a code base is able or easy to collaborate on. And our philosophy is that it shouldn't contain any surprises. It should be more or less the best practices in the open source business. So, you're probably already doing most of it already. And then there's probably also a lot of shortcuts that you have made to save some time that you're not doing, but that you wish you had the time to do. We have collected them all here, because that varies over. And if you comply with the standard, our thesis is it will be very easy for someone to come up and collaborate with you. It's of course an open standard itself. It's cc0. You can start using it immediately. You don't need our permission to do anything. And you don't need us to come talk to you. Reuse it, adapt it to your needs. If you find that something is, oh, this is shaping with me, please contribute back to us so we can continue to improve it with your feedback. And these are sort of like the type of requirements that we have. And just as Amrin showed with the DPG standard, we also have a self-assessment test that you can do. There's just 10 yes or no questions to give you an idea how close you are to dig into it completely, because in the entire standard it's like 116 requirements or something like that. And there's a review template, of course, and a checklist to easily check what you're doing. And we list everyone who is compliant on this website. Today it's a list of zero, but it is a list still. But we also include right now everyone who has said, oh, we are aiming for this goal. So everyone who has the ambition gets listed there. And then just a tiny little thing. We also have a number of governance game decks. It's a little sort of a game you can play with your community to figure out how do we want to do with our governance. And we're giving them out from the small fee of signing up to our newsletter. And with that, I want to introduce our first speaker from the day.
Some updates on Public Code in Germany
Okay. Hi everyone. My name is Marco and I'm an active member of the Floss community for about 10 years now of contributions to SignalDino also in the wireless community tooling, wireless mesh community tooling area. Currently I'm working for the German government for a German government agency that builds IT infrastructure for Germany, mainly backend infrastructure. And we like we are in the middle between the 16 federal states of Germany and the federal so we have a lot of stakeholders in place and to contribute to. Also during this job I get a lot of feedback and see a lot of things that are happening in Germany. So first maybe let's talk about a little motivation about this talk. So in Germany I have the feeling that the term open source is very omnipresent in the public administration also in politics. No one actually speaks about free software. So the open source term is the leading term here. Also there's very little information about how Floss is used in public administration and also there's little knowledge in public administration about how to handle Floss software appropriately and there's hardly any contact with the Floss community. There are exceptions of course but like generally speaking there are ways to improve that. There are also hardly any statistics on the use of free and open source software in the German government and so my impression after three years in this domain is now that everyone is talking about at least open source software. Maybe they also mean free software. Maybe they don't do the decision between the two terms. It's also okay but in practice hardly anyone is really doing or following these software development practices. Yeah right now there is a lot of happening in Germany and I thought it might be a good chance to give an update what happened in the last year or so and what's happening right now to give you a better feeling how these things that happen in Germany might also be like relevant for other countries or if you are from Germany I hope it might also be interesting for you. Yeah so the first question are we Floss yet in Germany especially and I wanted to start with the state of Floss laws and regulations there. So in June 2020 yeah two and a half years ago there was a principle defined in the service standard and this service standard like these designs or gives design principles for government digital web services like interactions between people and the government and this service standard is also mandatory for the largest digitalization program over the last five years. Those of you who are from Germany may know it as online two and a half years or short OZG. It's a law in Germany that mandates the government agencies to provide their services online and in this principle it says that a source code from the realization of digital services must be made available as open source so that's very progressive. We think that's a nice thing but the problem with it it's not mandatory. There made a survey I think in 2012 no what was it it's written down here 2022. They made a survey and out of 15 from 221 people that have been been asked to give it a high priority in their in their own projects but that's only a very few people from and in practice I also see many people don't don't know it actually so it's not not very broadly adopted. Then in 2021 there was another approach there was an obligation from the economic stimulus package also intended to improve government digital services and there it says the source code will be made available as open source whenever possible. Nobody really knows what this whenever possible means and unfortunately the federal ministry of the interior didn't really keep track of which projects actually released software as as open source under any open license. I think actually so I personally know only one. There were a lot of projects in there that got funding so this really didn't have had much impact. Then in November 2021 there we had a new parliament in Germany we had elections and the coalition that formed after these elections then decided or wrote down in their coalition agreement that they wanted to or yeah that the development contracts of public agencies should be generally commissioned as open source and the corresponding software that is being developed should always be made public. So this is like the same intention again and yeah there's a but because like after this agreement the German government spent 4.8 million a billion euros investment in proprietary cloud infrastructure in addition to 1.3 billion dollars to Microsoft licenses of course you can't just throw Microsoft software away it doesn't work. So this is like more a long time terms change but these 4.8 million cloud infrastructure that have been like this was a new contract that didn't exist before in this form you could have like invested in open source software here. Also in general less than 1% investment during like for the second by the current government investments from the current government from the current legislation went into the open source software ecosystem and also the plan financing for so-called senders that's the the German Ospo has been cut by nearly half due to resource I don't know they didn't find the money that was needed so they had only 24 million euros that's still a lot of money for an Ospo that's great but yeah compared to the initial plan it's less than we expected and hoped. And also there is still no floss procurement regulations that are badly needed to give government agencies a tool to really require these procurements to be based on open source licenses. But we have some policies in the German federal states we have 16 federal states in Germany and two of them Turing and Schleswig-Holstein they defined a priority for free software in their federal laws. The first one like it's mainly the same text in those regulations and the first aspect is that a priority for free software should be applied if technically possible and economical this is again we don't really know what this means it's like hard to define when is it economic to use open source software compared to proprietary software often this like also comes with long-term impacts so this is this is a really hard question and it's easy to like find some arguments why it's not even cheaper. Also for in-house developments the rule is that an open source license yeah has to be applied and the software needs to be published as long as it is not used for security relevant tasks and this is still again I don't know what your security relevant task is and even like for maybe people thought about like police software etc but still I think we in this room are know that even like especially in these domains it's super relevant to have open source software to have the possibility to see inside the code and see what they're doing there. First to improve the security and second like to improve control of what agencies do in their day-to-day business. Okay but still like these two federal states have these like thought about these questions, applied some regulations that's great I really like the effort there and in practice we see there are still also there are some very motivated people in the governments there and they're doing everything they can to improve this even further so I think that's a very good first step here that's nice. So let's have a short look at the European perspective here. I just created a graphic based on the information from the join-up platform also from a questionnaire to the German Bundestag like our federal parliament and we see that currently right now there are some countries in Germany actually in Europe actually I would say it's a huge amount of the or like relevant parts of Europe in terms of like their power also in the European parliament have some some regulations in place concerning open source software. The Swiss parliament just passed a law this year or last year sorry in March 23 to publish all government software and an open license there will be another talk in the legal and policy issue deaf room at 226th Oc so yeah head over to this talk to get more insights about Switzerland. Okay so but let's have a look at floss in practice and in generally we must summarize that these political objectives that have been defined are mainly ignored to be honest in public administration so there is no like the step from legislation to like the execution of this laws is hard and it's yeah it's not done not not not done yet. As in the industry we also have this phenomenon of open washing like presenting some kind of software as being open when in fact it's actually not. A small example for this is like the government site builder that's used to build the websites of all the German ministries and on their website they say it's based on open source we dig a bit deeper we can see that the technology the technological basis is 100% open source that sounds great so I wanted to dig a bit deeper and I tried to find some download link I have found some and unfortunately I didn't come to any git repo or something instead I was greeted with an HTTP basic auth because like the software is based on open source software it's correct but it's not released as open source software. So why is that that the public administration doesn't really respect these political intentions that have been formed on or formulated on like every federal level in Germany like from the from the top federal level in the Bundestag to the federal states and as far as I see it the public administration has like no no too little experience either with public procurement of free software it's hard they don't know how to like buy free software and buy support for free software and also they have no experience with releasing software or releasing their own code as free software there's little incentive to invest in existing free software coming from some laws and regulations and also there's little incentive to release own code and to collaborate with others to improve their own code because there's so little knowledge about the like benefits of all of that. Yeah, like in summary I think the application of these floss software development models still heavily depends on individuals we have individual cities and we will see there in we'll later see an example of this where it works really well but yeah my feeling is that it's still dependent on individual persons that yeah mandate for this and do the the heavy work in practice it's not really widespread and spread adopted in all government agencies. Yeah we will later see how to fix that but first maybe let's talk about some wins there are also great things that are happening in Germany. Germany just built an open source collaboration platform called OpenCode it consists of like GitLab instance there's a discourse forum there's a wiki.js wiki and it also is also based on the public code YAML standard that is used to annotate the purpose of public software and this encourages the public agencies to make things open today the administrations do not really dare to do this though like with this platform they can see that other government agencies also release their code and if others do it it might be okay and I might also be able to to release my code as free and open source software and I think that's a great thing. Also it's somehow a safe haven for for public administrations to get some first experience where they don't have to go to all the real free software or external free software repositories like like gitlab.com or or like even github where they have no experience with and like this is inside the government even if it's public it's something government owned and this might help to to convince some people to release their software there. I think it works okay so there's already more public organizations on there than on github at least for for the german organizations to be fair there are very little organizations on on github in Germany from the public administration yeah but still only a few real projects exist on this platform to be honest many of them are stubs many of them are just code dumps or other kinds of documentation consultation processes etc so it I think it's a good start there needs to be more more code there integrate all these products we know like for example next cloud colabora also Univention Corporate Server, Op-Mix Change and all this this kind of software that exists but doesn't really integrate very well and the idea of Open Desk is to pay the software vendors to to build integrations between these solutions. Also there is an interesting project from Germany that's called Kullibri it's completely public funded by an IT service provider of the federal government and it's basically a component library that uses web components and is meant to be what are meant to have a strong focus on accessibility and they also do a real open source model they also accept accepting contributions in my opinion they have interesting great tech but it doesn't really feel like public administration project it's the normal open source project and that's great great as a huge recommendation if you if you're looking into a component library maybe this this might be an interesting thing for you there's also the current design system that's meant to be used for all government services to build a unique or re-identifiable design system it's not an actual software project like the design system which that defines which design elements are used on the websites but still they have the philosophy and also the community building parts baked in in their DNA and they're trying to to get involvement and they're trying to to build a community that's also a great thing that that that happens right now already mentioned there are some cities they're doing great progress the city of Munich built its own open source transparency website and this is really interesting because they document which floss software they use they also document which software they contribute to both in terms of code and also in terms of funding and they also document which software they write and publish so they really understood the benefits of free and open source software and build a website to make it transparent i think that's a good example maybe for other cities too and we have the national documentation portal that's like read the docs like project where documentation for developers on the core government infrastructure can be found it's by itself a license under the european union public license and it's also contributing accepting contributions so let me close this talk with the question what does it take for free software to be to become the the default in public authorities and i've brought three challenges here the first one we need to release custom build software of course under free software licenses i think the regulations here are very very important so there needs to be some regulation in place that enforces governments to do this because otherwise they have no knowledge in this there's no so like there's little to no motivation why to do this in the first place so regulation helps very much here to to release all all the code yeah and of course also knowledge and skills in this area need to be built up in the administration maybe our ospo made contribute a lot to this in the next years but that's a major challenge in all government agencies second challenge uh for software procurement is of course a real yeah a thing here we need you you you you very important because we need to measure our progress does it really work does it do we make progress in in this in this area right now there are hardly any statistics so i think it's it might be a good idea to have the mandatory use of a researchable software catalog before buying any software like the italian government does this already there's the italian free software catalog and all italian government agencies need to have a look at this catalog it doesn't really say anything about what they do about this like results they just have to document that they have searched here for for the software or for the kind of software that they want to buy and if there's something in there they that's a good opportunity to look into it and see like for example is yeah is this software useful for us before they're buying any yeah non-free software here yeah if you want to learn more we collected some infos on best practices about free software in the german government also some examples this kind of follows the idea of of like the awesome list but like just find some information about what is what has worked in the government to improve free open source software maybe this might also be something for for other countries too for your communities too i really encourage you to build some knowledge about what already exists and communicate about the efforts that have been taken already okay thanks for listening and if you have any questions you can contact me here or yeah maybe later outside if we have time maybe one or two questions we don't have time okay
GNU Health. Incorporating Digital Public Goods in the European healthcare system
All right. So first of all thanks to the organizers for having us here. And I got to say I'm not Louis Fai-Khann but I'm spontaneously replacing him today. So nevertheless I will introduce both him and myself. Louis is both a computer scientist and a physician and he founded New Health a bit more than 15 years ago. And he's specialized in genomics and medical genetics. And apart from being active in social medicine he's also involved in animal rights. Then shortly about me I studied computer science in Hanover and there I'm employed since a bit more than two years. And mainly I'm working on an Ansible deployment of New Health to ease and improve the installation process but I'm also reporting and fixing bugs or rewriting the documentation. And last year we also hosted the annual conference of New Health in Hanover. And it was also together with the Orson conference. Sebastian will do the following talk about Orson. And the institute I'm working at is called Computational Health Informatics and even though we are only working inside computer science it's always related to medicine. So behind New Health there's a non-profit, non-governmental organization called Ngu Solitario which is working globally and it's focused on social medicine and New Health. But there's also the Global Exposome Project that aims to investigate how the environment has an impact on our health and how social problems like pollution of water or factory farming or wars also impact this environment and consequently our health. And then again there are also projects about animal rights where it is involved. Ngu Solitario is spread quite around the globe but when it comes to productive use in hospitals then we hear the most of projects in Latin America or Africa for example in Argentina or Cameroon. And then there are many research institutions, hospitals and so on for example in the top in the middle there's a university in Argentina that is cooperating quite much with New Health. Okay, so what is New Health actually? In general it is a hospital information system but the core is a hospital management information system that is often called HMIS node. And there you have one client-server architecture and it takes the quite realistic approach compared to other ways of organizing the infrastructure of hospitals. And it is first of all based on Frighton which is an enterprise resource planning tool so you can overtake the user management and inventory stock and finances functionality from this. But then we are adding modules for hospital functionality and putting this on top. And like Frighton it is written in Python and using the PostgreSQL database back end. Even though Frighton could theoretically use others we are always taking this to first have a uniform way and then also because there are many good functionalities for productive use. And then for example you have really many modules that are part of New Health for example about surgery or the laboratory or genetics and bioinformatics and as it's used in many precarious reasons, New Health is embedded as also one subproject which basically means that there are for example images for respiratory pys because sometimes yeah it's really a matter of resources what to use. And as the name says, New Health is a GNU package. So the HMIS component as I said is a client server architecture and on the upper left you can see a screenshot of the client and with this you can generate graphs, you can display images, there's a calendar you can use yeah and also the electronic health record is part of this. Then there's a reporting engine coming with Frighton and so all the information you feel in the database fields can be exported as an ODT. So there's a LibreOffice in the background and you can yeah generate this and print it or start outside the program. Yeah. Besides there's an integration with Orsan which is a DICOM server to support medical imaging and actually there's no DICOM viewer integrated in New Health and as usually there is the DICOM format used. It was chosen not to reimplement any DICOM viewer or do all the work Orsan has already done but to integrate Orsan and so to synchronize patients and studies between the two of them and to just use Orsan's DICOM viewers that are integrated there already. Apart from this there are also other components of the New Health ecosystem for example the Federation and my new health. So my new health is an app that is that can be used to enter vital data and in the end also to share that vital data. And last year at the 40th birthday of GNU the second version was released where all the dependencies outside Python were eliminated because many people don't have Linux on their phones and we had requirements before they were now eliminated and it was migrated to Kivi so now the idea is to have something cross-platform. And then the GNU Health Federation aims to connect multiple of those HMIS nodes and ideally also make the people, give the people the opportunity to share the vital data they recorded with the hospitals. And so to give one example the colleagues in Argentina also used this already in the beginning of the COVID pandemic to trace how much, yeah, just to trace the situation of COVID. And now, yeah, to come to the topic of the room also, GNU Health was declared a digital public good which is in the context of the sustainable development goals of the UN where many goals should be achieved until 2030 and one of them is healthcare and so, yeah, GNU Health is part of this and also just advertised at the European Commission join up where, yeah, free software or open source software is, yeah, advertised inside the European Union and then compared to other software projects, of course there are always bureaucratic barriers and also certification processes but there are many steps to check if your project is a medical device software but actually at least the hospital information system itself and the electronic medical records are not a software or a medical device. Of course then there's the other stuff for example in Germany would for sure need to have an interface with the insurances and most of the productive use is somewhere else. Then, yeah, from our point of view, proprietary software and public healthcare is a contradiction, yeah, and we think that there should be, yeah, a move to free software and there's really many barriers and a lack of funding especially for free software projects and, yeah, there could be really many benefits of putting more resources in communities like this so that everybody can profit from what people are working on. This is why we also signed the campaign public money public code. I already saw it in the slides of the talk before. I guess the most people know it but basically the name already says if there's public money spent for a project then the code should also be available to the public. Said quite easy but also not the reality. Yeah, I'm finishing with a side of that Luis often says which is who has this a social project with a bit of technology behind, yeah, to highlight that it's not only about the software but also about the philosophy behind. Yeah, that's it. Thanks for your attention.
From disconnected elements to a harmonious ecosystem : The Epiverse-TRACE project
First up, we're going to hear from Hugo Brisson from Disconnected Elements to a Harmonyne Ecosystem, the APRiverse Trace Project. Hi, my name is Hugo. I'm the lead software architect at data.org and today I would like to talk to you about the work that we are doing to build a harmonious ecosystem for epidemiology as part of the APRstrace project. So today's scientific research relies more and more on data science and computational tools and this is true across fields such as epidemiology, climate science or econometrics. But the pipelines that are used by these data scientists are getting also increasingly complicated to maintain and to update. And to change just a single step in this pipeline just to use a different piece of software, you may have to spend hours of data wrangling just to get the right format for the inputs and the outputs. And the problem is that this maintenance that is really complicated is something that we cannot afford when we are in the middle of a crisis. The price is just too high to pay. When the next pandemic hits and we want to get really fast results to understand what's happening, it's not the time to do basic boring data wrangling, we want to do actual science instead. And so set differently, we have some good isolated free software tool but we don't need just good isolated pieces of software, we need a robust ecosystem as a whole. And this is precisely what the APRstrace project is about. It's an international multistakeholder project to harmonize the ecosystem of epidemiology tooling in R. And we do this by making the existing pieces interoperable, by supporting existing tools to adopt global standards such as the ones that are defined by the Digital Public Good Alliance or organizations like R OpenSci and by developing a community, a sustainable community around these ideals. I can also define our goals by what we don't want to achieve. We don't want to erase the existing established communities. We recognize that diversity of solutions is good, it's nice to have a rich ecosystem but we need interoperability in this ecosystem. And so the way that we do this is by involving the community. We work with existing established communities and by this I mean both established communities of users such as public health institutes or NGOs but also existing communities of developers. And in the end what we want is to come up with a solution that increases usability, sustainability and maintainability for everyone involved. We've had already quite a lot of success with this approach. We've managed to package and release a lot of un-maintained non-portable code bases and including many more tools than the ones that are presented here but just for the sake of this session I should mention that two of them are already registered DPGs and one is in the process of being submitted. Having a sustainable network of collaborators is something that is really exciting and really ambitious but as you can guess it comes also with challenges and in particular research and academia are really competitive spaces which makes it difficult to build some collaboration between some communities. Additionally, because we have a multi-stakeholder community, communication is really difficult in a network that has so many collaborators and so many nodes which creates delays and miscommunication and the question is how to build something that is sustainable even though funding isn't probably in this space. To conclude I hope that I managed to convince you that responding to this crisis be it the climate crisis or the next pandemic will require interoperable tools and that this can only be done for collaboration and multi-stakeholder project. But even though it's necessary to have this kind of complex community it also brings a lot of extra challenges especially around communication, collaboration and sustainability and in the end what may appear initially as a technical challenge is even more of a communication and social challenge. With this I will finish just with a picture of the entire core team of the project and invite you to come to talk to me if you're interested about any of this. Thank you. Thank you. Thanks. And thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you.
Legislation Editing Open Software (LEOS) - an innovative open-source solution for drafting legislation​
Yes, we just go right on. Okay. Thank you. Good afternoon everyone. I'm Fernando Nubla. I'm a project officer from the European Commission and I'm going to tell you a story, the story of Leos. So, upon a time we were in LaGislan and you can imagine LaGislan is not that funny. It's talking about legislation, you know how complex legislation is. And in this case we had law legislation that was complicating the life of everyone living in this kingdom. Legislative laws was enforcing very complex rules to everyone that wanted to create a new piece of legislation. Rules about the structure of the documents, formatting, etc. He was not taking care of the versions. We had versions everywhere, in local computers, in surf folders, in everywhere I can imagine, even in a paper. So it was getting very complicated for the people working with the legislation. And there were a lot of people and no one was collaborating because law legislate was not helping them on that. So we went to the round table, we tried to find a solution. We needed to help these people. So we started by creating a work plan, an idea of what we need to help them. We were defining more or less the solution that we wanted. But there was something very important. It was the financing. We couldn't do this without budget, without any financing. So we used two programs of the European Union, the ISA program when we started in 2012. And then, right now, the ITA Euro program since 2020. With the budget, the financing, and with the work plan, and with the idea of the project, then we created Mr. Leos that you have here, and you have also in our search. Mr. Leos, we wanted him to be an open tool that is a web application. We wanted to be able to draft tests. We wanted to have a rich editor where you can put images, formulas, track changes. We wanted to have collaborative tools to create comments, suggestions, to work with other people, everything centralized. We take care of all the versions. So now they are not spread everywhere. They are all in a central place, and you can go and check them. And something very important, we wanted to use open standards. We didn't want to keep drafting legislation in an unestructured standard that you can use further. We wanted to create something that is structured. And we are using the commandos for you standard that is open for everyone, for any administration or government that wants to use it. And the last aspect is that we wanted to do it open source. That was something new, the European Commission doing open source projects. We did it. And we bring with us the community. We are not just alone. We wanted to do this with member states, with other countries, with academia, et cetera. Whoever wanted to help us is welcome. And then I know you were waiting for this. There was a battle, of course, between the Leos project and the legislation. But no one was harmed. So our idea was to convince them that there was a better way to do the things. So finally we end up with a legislative in our team. We are running around Europe. We are helping other member states, other institutions to use this tool that is open source, is available for everyone. And together we are going towards the future. And the future is leading us to artificial intelligence, machine learning, imagine drafting legislation just with one click and getting the proper test. And this is the tool. So everything that I said is true. You can check it out. We had all the features that I was saying before. We had the structure. We had the versions everywhere. We had the research with the touch changes. You have on the right the collaboration to create comments, suggestions, the use of tests. We are using open projects like the CK Editor, Hypothesis, EUI. So we are creating our software open and using open source projects. You can check us in this QR code. You can scan also our t-shirts. We are in code.tropa.eu Leos. All the software is available there. You can check it out. You can contribute while waiting for you here. Thank you. And this is the end.
TruBudget - a DPG to support the project workflow in international multi-stakeholder environments
So, hey everyone, my name is Zuri, I work for KFW in a field of development cooperation and in this case development cooperation is not about GitHub but working together with third world countries and donor organizations like us to make the world a better place. So together since a couple of years we developed the digital public goods so we also registered that and that's what I would like to show you and also to invite you for collaboration. There are three slides, I would like to explain the problem to you, our proposition and where we are at the moment. So that's one example, if you look at development cooperation and how it works today, so you see one country here, maybe someone recognizes it, it's Ethiopia and we started to count how many government organizations and NGOs are actually working in this country and supporting it. And at some point we stopped counting so we didn't have any time to put more logos on it. The problem here is that we don't trust the data exchange and information exchange for the partner countries, they don't trust basically us. So in the end we end up, many NGOs and we as well, to do the project ourselves instead of just giving the money to the country so that the ministry there can actually do, build schools, hospitals and so on. And that's the real problem and already in the Paris Declaration 2005 we decided that we should do all this stuff on the systems of our partner countries that never happened because of lack of trust. And now what's our proposition to solve this? We called it true budget. So we figured out that the solution is not to install some kind of SharePoint or Google workspace or something, so another intermediary because whoever owns the data is actually more powerful than the rest, the data is the new oil basically. So we don't want to own the data and also the partners, if they own the data we actually don't trust them potentially. So the idea was to build this decentralized solution, which is truly decentralized and managed data there. What I mean with data is actually to manage the workflows, who is doing what, how are the projects implemented? So for example we build schools, how are the tenders done, how was the money dispersed, who did what basically. So what is actually stored as decentralized data is the status of all the different workflows, of all the different participants in this network of people and organizations. So technically what it is, it's actually a front end based on material UI React, so the JavaScript stuff, also the API side, it's a Node.js server and on the data side it's a very small blockchain solution, it feels like a key value data store basically, but it has a very nice property that actually synchronizes across different nodes. So there's a kind of a conscious mechanism that the data is then synchronized across the different parties. And that is very important also from a political view that basically you process this data on i-level. There's not one party that owns the data and the other one doesn't, so if any one of these participants here, that's only a very simplified view, would get out of the picture, it would still work out, right? So where we are at the moment with this, yeah, we did this since a couple of years, as I said, we are registered as a digital public good. We have a couple of pilots, for example with a Brazil Amazon fund, it's a very important one where sort of Germany paid money if less Amazon forest was destroyed, so it is very good. We had the vaccine alliance also very important, Burkina Faso is I think one of the oldest projects we did or one of the first we started with, we have the Ministry of Water, we try to manage this data and get to a situation where we can actually give them the money and they actually use the money to do good stuff instead of us developing the projects which we believe is not the most sustainable approach. Yeah, of course, as I said, you are invited to contribute to this project, it's running since a couple of years already, we have a couple of contributors, it's also the first open source project we did as KFW, it's a state owned, German state owned bank, so remember the talk we had before that state owned organizations, it's tricky to do open source, we achieved it here and I'm quite happy to be part of this project and would also be happy to if you join us. Thanks a lot. Thank you. Excellent. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you.
Moodle: Empowering educators to improve our world
Hello everyone, so my name is Noel and I'm here to talk about Moodle. Moodle is a learning management system that you can use for your online learning and teaching and our mission is to empower educators to improve the world. We want to do this in an accessible way that can be used for everyone and that can be customized for every use case. We do this through open source and actually Moodle started more than 20 years ago and the first commit was actually the same year as the first edition of Forstdom and preparing this talk I have been looking at the archives and this is actually the first talk about Moodle. It had been mentioned here and there but this is the the first talk specific about Moodle so it may be the first time some of you okay it may be the first time some of you hear about it so I hope you find it useful we are a certified B corporation and Moodle is a registered digital public good and in case you don't know who is using Moodle at the moment more than 400 million users translated to more than 160 languages mostly contributed by the community the translations and you can find these stats in stats.moodle.org but I have to mention that since Moodle is open source and can be self-hosted all this information is only obtained so in reality there's probably more people using Moodle than this. In this slide maybe there are some logos you recognize we know that Moodle is used by more than 60% of higher education institutions it's also used by many education ministries and it's also used by many governments and NGOs so Moodle is used all around and who is making Moodle. Well there are some important part of the contribution from the open source community and other companies but mostly it is done by Moodle HQ which is the company I work for. We are currently more than 200 team members distributed in more than 20 countries and we speak more than 26 languages and I didn't want to leave them without mentioning the text tag so Moodle is made with Vanilla, PHP and Vanilla JavaScript with an SQL database and the mobile application which is the team I actually work for is made with Ionic using Angular so in case you want to learn more you can look in the ad repositories to look at the code. Also I mentioned that it's very customizable to different use cases so you can build plugins for Moodle and if there is something that is not doing already there is likely a plugin already working for that and if there isn't you can make a plugin yourself. Here you can read the developer documentation to see how to build it both for the LMS and for the Moodle app and finally even though the Moodle LMS is at the core of everything that we do there is also many other things. For example I already mentioned the Moodle app which is interesting for low resource environments because you can use it to use offline so you can download the contents and fill the exercises and everything and it synchronizes when you go back online. We also have Moodle Cloud so you can self-host Moodle but if you want to get started we have a software as a service solution which is Moodle Cloud. We also have MoodleNet and Moodle Academy to share and find learning resources and if you want to integrate Moodle with your organization we have Moodle Workplace and Moodle Certified Partners and service providers so there is a lot more that you can dig in if you want to learn more. So that's it you can learn more at Moodle.com and if you need to contact me my mail is noel at Moodle.com and that's it. Thank you.
What can digital open source projects do to reduce our environmental footprint
So how many people here are worried about, say, climate change? And yeah, a little bit of a concern. There's definitely issues, crazy weather, blah, blah, blah, blah, blah. What does that have to do with open source and my talk? Well, let me tell you. We live in a finite world. And as much as we want to believe that the cloud is green, it's not. Everything that's digital is tied to an atom. So as I say here, if you don't measure it, it still matters. It still matters because we're looking at the environmental footprint of our digital lives, and it's significant. It's about the same size as the airline industry, and it's growing exponentially. So when we're thinking about our digital infrastructure, every piece, every bit is tied to an atom. So whether that's electricity and whether it's hardware, it all comes from somewhere. It all has an impact. Think about the lifecycle of our products. There's not just the process of these devices that we have around. It's also the question of the creation and disposal of them as well. There's a lot of systems that are interrelated, and they have a huge impact. So when you're developing code, or if you're doing open hardware, think about what is the ecosystem you're working in. It's so interrelated. Whether you're a JavaScript person who's working on the web, you're a PHP person who's building content management systems, whether you're a Python person that's involved in creating other data processing tools, all of these things that have communities rely on networks of other pieces of code and a lot of people to maintain and organize them. So think about that ecosystem of people and code that our projects work on. So so much of this is thinking about sustainability like a measure of quality. So how do we make sure that good code is both accessible, because I have to say that I'm an accessibility person, but also is sustainable, that we're trying to minimize the impact that we have on the planet. And that that is baked into the definition of quality. We're thinking about it early in the process. We're not waiting until the very end to evaluate it. We're trying to build it into our CI CD pipeline so that we're catching errors and we're looking to minimize our website or we're minimizing the impact of our code on a spring by spring basis. And you're having a livable planet is not a feature request. We need to have that. This is a bare minimum that we need for our society. We need to start working together around that. So there's so much to learn and so much that's happening in this space right now. 20 years ago, this was not something that was generally thought of. People were like, well, just don't print out the web and don't print out your web pages and your emails and you'll be fine. It's like, no, that's not good enough. The information is changing very quickly. There's a lot to learn in the space. And I think it's really important to try and think about sort of learning that, but also finding ways to contribute. So where can you give back? Where are the experiences that you've had? How can you get involved in measuring your project's impact and moving ahead on that? How do we look up at leveling up our expectations, encouraging more people to discuss and to learn about this? So it's really important to have these talks. There's a whole section of talks here at FOSTAEM on energy as there was last year and that's wonderful. If you're going to the State of Open Conference, last year they had a whole sustainability track as well. In making sure that there's some conversation about sustainability as part of your project is really important. Getting people engaged and doing something about sustainability is a good way to go off and keep optimistic about it and keep our attitudes that we can make a difference, we can make a change. This is something that is doable. So getting people involved. And this is a huge problem. But everything that we do in the end is going to be insignificant as an individual contribution. But as Gandhi said, it is important that we do it. We need to go off and find ways to contribute and to play our small part to move things ahead. There's lots of best practices and best standards out there. There's one from the Web Sustainability Guidelines that was just launched as a draft in September. There's so evaluation and development on that. That's the Sustee Web Group or the Web Sustainability Guidelines. There's also the Green Software Foundation that's done some really good work. I'm building infrastructure around that. The Green Web Foundation is another one that has infrastructure and information about that. Also the IETF and the IEEE I think also have sustainability projects as well. So there's lots of different ways no matter how you're involved in the tech world to look at best practices and sustainability that you can work with and extend. And that's all I have. Yeah, any thoughts? Okay, any questions? Does anyone here have a sustainability? Go ahead. Any kind of practical tips that make matters good? Look at how much processing there is. Sorry, any practical steps. One practical step is to sort of look for where there is time and transfer. So how do you try and minimize the efforts and try and make sure that you're counting the milliseconds used to process it? What are the process heavy things that your code is using? Yes?
How to Use Private Data in Generative AI: End-to-End Solution for Retrieval Augmented Generation with CrateDB and LangChain
in the morning on Sunday. It's nice to see all here, looking very bright and early. So we shall get straight into it. Let me welcome the first presenter of the day, Maria and Christian from CrateDB, who are going to be talking about privacy and generative AI. Thank you. Good morning from our side. Pleasure to open the Dev Room today and thanks for being here that early on a Sunday morning. We're going to talk about a very interesting topic, generative AI, how to use your own data and how we can build such applications also based on open source software. I think everyone is used to open AI and chat GPT, but you never know what happens with your data in these cases. So very, very brief overview. This is gen AI. I think everyone in the room played around with it already. Just a very quick summary of the basics here. You have your source data of any kind of sort of data. It can be text, it can be code, images, audio, videos. Everything is transformed. We are encoders, but billions of parameters that we use, a lot of text, a lot of input to train the so-called foundational models. We as users formulate some prompts against it. We ask the models some questions. It does its job and it generates the output and a language model does nothing else than predicting the most likely next token that it should generate. That's all the magic behind. We see a very, very big potential. When I first tried chat GPT more than a year ago, it was amazing. It started to write code for me. It starts to generate articles. I even went to some tools out there, took 30 seconds of my video and all of a sudden I can be a virtual speaker. Very, very impressive, super fast, but there's also a bot assigned to it. Obviously, some quality issues. All of you heard of hallucinations. Last week we had the example of what color is the water. Is it blue or is it really transparent? Depending on your training data, if you use children's books, the water is obviously blue. If you use the real-world training data, water should be transparent. Same as snowflakes or not white. They are transparent technically. Also, a lot of ethical questions, a lot of governance questions. Official government people talking to deep fakes, not realizing it. Also, a big threat that we have in the future. We have to be aware of also some environmental impact. The key thing we want to talk about today is quality and reliability with the importance of current, of accurate, and also of private data that is not available publicly. Because all of these foundation and models have been trained on public data. What's in Github, what's on the internet, what is in the documentation. Yesterday I watched a presentation with a clear message to everyone writing docs. We are responsible for what these models tell us. If you write bad documentation, we get bad results from GEPT or other models. It has been trained on not so good training data. Here, for example, Maria figured out promo code, open AIS web. If you register there and put the code 20% off. But unfortunately it was not working. So asking GEPT, hey, how can I apply the promo code? I'm sorry I know about this promotion. That's something you don't want to happen if it's a company chat bot. You want to avoid this. So perfect example why we need this current and accurate data up to the minute, maybe even up to the second. We need this current data. And obviously non-public data, private data, it's internal documents, it's confidential documents, documentation that is not public. And also if you are working with, they use legal documents, they use the technical documentations, vectorize it, put it to a language model and then for the maintenance workers, they have an application ready. But this is information that also must not leak. And this brings us also into a little bit of a dilemma because there are multiple options to bring this private data into the foundation and models or to enhance this foundation and models. First thing, again, I think everyone in the room heard about it, is fine-tuning. Where you give some input data, you really change the parameters, the weights in the foundation and model so that the knowledge gets incorporated into your fine-tuned LLM. Very good. You put the domain knowledge in there, but there are also challenges, right? You don't solve the frequency issue of the data. It's still some static knowledge. So there's research out there that one single wrong training data record can kill the overall performance. One guy says the water is blue and all of a sudden the response of the chat, but it's all water is light blue or something like this. And it doesn't solve the problem of hallucinations. You might still get a lot of hallucinations and not talking about the resources that we need. So second option, retrieval augmented generation, which is kind of developed into kind of a standard when you want to work with your own data. So first step is you really need to make the existing data, whether it's videos, it's data from internal database documents available to create the embeddings, to calculate the vectors, how this knowledge is internally represented. And then as soon as your user asks a question in the knowledge assistant or the chatbot, there's a called retriever is then asked, hey, please give me the relevant context. And this can now be a similarity search in the vector database, or it can be a combination of various searches, a full text search, to your spatial search, a regular SQL query to get information out of your databases. This context is returned back to the retriever. It is put into a special prompt, as context, as additional information to the prompt, and together with the question, and this additional context, not a large language model can generate your answer. And you can put into the prompt, as we will see in the demo also, please use only this contextual data. If you don't know the answer, please say you don't know. Limits the hallucinations a lot, doesn't prevent them 100%. Good. I think I talked about disadvantages and challenges already. And one advantage I forgot to mention is access control. Now that you really get this context from either vector store a different database, maybe create, you can put fine-grained privileges there. The example application that I mentioned before, some of the maintenance workers are not allowed to use legal documents, for example. So they don't use the index, use the embeddings of the legal documents, but they are obviously allowed to use the technical documentation. And someone from the legal department, oh, what is the support contract with XYZ? Are we now in liability? Et cetera. Obviously, they need then different indexes, different search indexes. How to do this? How semantics represented? Key is the vectors. So, or embeddings. And the vector is nothing else than a series of decimal values or an array of decimal values with a lot of different embedding models out there already. And every model has its strengths and weaknesses. Some are more optimal if you use, for example, German text, if you use Chinese text or Indian text, right? A very different way how to come up with the semantics and to analyze how the attention mechanisms internally work, right? Because the sentences are built in a very, very different way. So you see different performance there or highly specialized models. You do an image recognition. Oh, it's a sleeping cat. And this can then be vectorized as well. And you can search for this context in your vector store. And now, if we think this one step further, how could an architecture look like for such a knowledge assistant or a chatbot? Prototype is always easy to build, but you need to think about a lot of a lot of additional topics. First of all, it starts with the data, right? The data that you want to train, that you want to vectorize, that you want to make available for your search. So we've shown here a landing zone from different sources, can be the original sources. You might copy it, depends on the architecture you want to build. And the important thing is the processing layer. How do you chunk your data? How do you create the vectors? And obviously, you need to store these chunks of information together with the vectors and provide proper data access control. Second part here, the LLM part, talked about it now multiple times. You need access to the embeddings, you need access to the large language models, and then also needs to be some logging. What do do you use a query? How much cost does it incur? Is the performance okay? A lot of logging that also occurs here. And intentionally, an LLM gateway put in front of it because it needs to be changeable. Chatbots with a lot of functionalities don't want to go into all the details, obviously monitoring and reporting. And the beauty of it, you can build all of that with open source tools nowadays. And also the embeddings and language models can be open sourced, a lot of alternatives out there. Now, why create a long chain? You need a robust data management. As we have seen, there's a lot of different data sources involved here, data stores, whether it's logging, whether it's semantics, your agents communicate in JSON. So you need to store all of this information, ideally in one store, not five, six different databases here that you need to operate, you need to learn the language, et cetera. And also long chain, other opportunities are also out there. Think of Haystack and others that you could use. But all of these frameworks give you a very good set of building blocks. You can just use them. It's available in Python, JavaScript, there are also Java ports out there, ports to other languages are now available. Everything you need is already in these libraries to come up with your overall architecture. And that's now the point to hand it over to Maria. She will guide you through a demo where we want to use it, try to simulate how you can use support tickets, internal data. Here we took some Twitter posts from Microsoft. We will vectorize them and we'll show how a support or a customer can then interact with this chat bot, ask certain questions. It won't demonstrate it's not such a big effort. You can get started right away. And all the demo, we put the link here on the slide. You find also the link to the demo in the app or on the website for the talk. Thank you. Do you hear me? Okay. Awesome. Thank you. So you have heard a lot of theoretical aspects of the drug and how it works. I have a little bit more than 10 minutes to show you a practical example. But believe me, we can have hours long workshop on this topic. So essentially, the idea today is to show you how to augment some of the existing LLAMs with the private data and how to use it for the context of some specific questions that this LLAM has not seen so far. So we actually use data that capture customer interactions on Twitter and these customer interactions involve different questions from the users about Microsoft, Amazon, all these different products today and how actually the support from these big companies actually answer to these user questions. So this is not something that you usually see on the Internet very easily. So if you have maybe some problem with some Microsoft product, yeah, very often you can actually find the solution out there. But some very specific questions that are asked directly to customer support is probably a very good reason why it sells to the customer support. So you didn't find the answer to this out of the box. And we will use CradyBee as a vector store to support this example. So I think Kristina already gave you a good overview of what the CradyBee is. What is the long chain? Long chain is an open source Python project that actually is used to facilitate the development of LLAM applications. It's a pretty cool project that integrates a lot of large language models, a lot of models for calculating embeddings and actually something that helps you integrate some data source with some language model without thinking out of the box how the full engineering pipeline should look like. Actually you can just do this in a couple of lines of code. May I add one point here that I forgot to mention. Although you use long chains, very good starting point. What we have also seen for very advanced purposes, you want to directly interact with your data, with your source data, with your vector store and all of that is available in standard SQL, no matter which data model you're using. And CradyBee is an open source store, one of the easiest ways to run CradyBee is actually to use a Docker image. So a vector support in CradyBee has been available since 5.5 version, but if you actually always pull the latest image, you should not actually think about this. So once you run this Docker run command, we actually run the instance of CradyBee cluster and then we can access the admin UI in the local host. So currently I think because of the resolution of this screen, yes, not everything is available, but actually in this admin UI you have a couple of tabs that you can use actually to monitor your cluster to run some query in the console and also to have overview of the tables and the views that are available in your database. So let's go back to the example because the time is flying very fast. So what we need is the first step, we need a couple of import statements to make sure that the long chain and all libraries that we use in this example are available. What is also important is that you import CradyBee vector search interface that is available in one of the long chain versions, let's say, which is used to interact with the CradyBee. And as a next step, because we need to interact with the CradyBee instance, we need to specify how we interact. So this is done by specifying connection string. We are using open source version running on local host, but you also have option, for example, if you want to deploy CradyBee cloud cluster and at this point we also give option for all users to deploy one cluster that is free forever so you can just run it and use it for testing purposes. Finally, we need to specify the collection name that we are going to use in this notebook session. So if we run this piece of code, the connection string is now available and then we can start interacting with the CradyBee. So for purpose of this notebook, I rely on open AI models. Of course, there are long chain supports, so many different models, you can actually integrate many of them, but if you choose to use open AI, make sure that you have open AI key as a part of your environment variable. So now let's take a look at how the dataset looks like. This dataset is also available on our CradyBee dataset repository, which is also open source and it contains the customer interaction about Microsoft products. So essentially we would like to now kind of narrow the scope of this notebook for the for the illustration reasons and time reasons. So essentially this dataset has some information like who is the author of this question, whether it's unbound, outbound question, when it was created, what was the context of the question or the answer and actually whether this text is response to something or is it response tweet or is it created in response to something else. So essentially all this information and now the idea is to feed them to the large language model and to ask questions that could be for example seen in this dataset. So as a first step, if you remember this big rug image is to create embeddings. Embeddings is actually the representation of your data that is suitable for machine learning and AI purposes. So first as a first step we need to load the data from this dataset and for this we use CSV loader interface that is available in Longchain and now in this like few pieces of code we already we already creating embeddings for all the data set for all the entries in our dataset. So if I go back to the admin UI I can see two tables. So in the first table actually gives me a collection of entries. So as we as we define the the first collection we created is called customer data but essentially what is interesting now is to see like embeddings created for all the entries in this in this collection. So for example this is the instance of the of the document that we are actually using for the training purpose or for the context purposes and you can actually see how the embeddings look like. So if you use open AI embeddings usually the length of your of your vector is going to be 1040 something yes it would be size of 1040 something but you can also for example choose some other embedding algorithm for example hugging face as you can see suggested here which is which is open source and it can easily be used out of the box in just two lines of code. Now once we have these embeddings let's define our question and our question today is like okay I have some I have some order on my Microsoft Store but I want to update the shipping address and how I do this. I also here put alternative questions so like when you play with this notebook you can also put your own questions and see actually whether this data set has enough information to answer this question. So once the question is answered what we want to do is actually we want to find the context that is relevant to this question and this context is done by doing similarity search of the vector representation of our question compared to the vectors actually that we stored in the creative instance and this is actually done in just one line of code. As Christian suggested vector search is one way to find the relevant context of course kdb supports other types of searches like full text search or geospatial search or just key search keyword search so like you can use different type of searches combined together to find what is what is the relevant context for your question. Once we do this we are now ready to actually ask our LLM to answer our question and how we do this. First we need to create a prompt that explains LLM what his purpose is. So his purpose is today to be expert about Microsoft products and services and should use the context that you are going to actually give to the LLM to answer relevant questions but if the answer is not fine in the context it should reply with I don't know and this is very simple way to create a prompt that actually gives instructions to LLM how it should answer specific questions and finally we just need to create small chatbot by using some of the available models that are integrated with the long chain and also passing this context together with the user question. Once this is completed we can access the answer and in this case it says to update the shipping address you will need to cancel your current order and place a new one. Maybe that's something that is still up to date that is relevant maybe it's not relevant anymore but it's actually something we learned only from the dataset we provided so this is a way how to actually how you actually use your private data to teach LLM actually what should what should be the context for any incoming questions. So I hope you like this demo you can play with this notebook it's on our creative B examples repository and you also can see there are other similar notebooks for different different different types of examples for different prompt engineering examples or like how to create another another form of chatbots how to use another embedding algorithms so please let us know what you think give us a feedback open a new issue on this repository and we are looking forward actually to work with you on these topics. So I think that is all from us thank you for being part of this session maybe we have time for one question okay awesome do we have questions anyone thank you for the talk I have a question about the embeddings model because if you encode prompt with language model and use external embeddings model they cannot be in different spaces and if you do similarity search have you tested it and do you see the effect of different embeddings I mean it's a very important question now if you the way you create these embeddings is super important and you're usually limited to one embedding algorithm because you need to they need to have the same length and obviously they need to capture the same semantics simplifying a bit and this is also what I meant with the customers that we work with they were able to create different indexes right and then the retriever gets more and more complex as you've seen on this architecture slide this is a simplified example you maybe you need to query different different indexes created by different embedding algorithms you know so that you can search your images you can search your textual data right obviously you might use different things there and then re-rank the results come up with the really relevant context maybe from different indexes and maybe you also want to combine it with a full text search or limit it to customer support tickets from Europe trying to come up with a good example there are or to customers support tickets from the US with some geospatial inhibition but this is then the re-ranking of the results that really identifies the particular context that is really relevant for the question okay thanks a lot any more questions no so thank you very much for the very nice talk thank you you
A murder party with Lea
Okay, so now we can start. Thank you very much for coming to the Python Dev Room and getting up early on Sunday morning with this cool weather outside. So now we are going to have a very, very nice talk by Pierre Denis, who is a long-time Python user. He's also the creator of Liya, and he's going to talk about Liya in this talk. Liya is a Python module for helping to calculate probabilities on situations presenting uncertainties. And what that means, I hope he's going to explain to us now. Thank you. So welcome everybody. So we are here about something serious, a sad story. I'm not a good storyteller. I'm afraid, but okay, Dr. Black has been killed last night. Maybe you have heard about that. And okay, we have three suspects that have been identified with given probability to be the killer. And it seems that colonel Mustard is the most likely, is most likely the killer with 40% to be the killer. Then we have Mrs. Peacock, 25%. Mrs. White, 10%, and Professor Plum, 25%. Okay, then these are prior probabilities, but we have the help of a profiler, a segment, the profiler. This guy is very smart. And he can tell, for example, that if Mrs. White is the killer, she'll be absent for the investigation with 95%. Otherwise, if she's innocent, she'll be absent with only a probability of 20%. And the profiler tells you several statements like this with probabilities. So when you see this kind of situation, you see, okay, it's quite complex. How can I use this information? Because nothing is certain. Okay. So the investigator is Leah. Here, Leah is not a person, as you have understood. It's a module dedicated to probabilities. So okay, I have several statements here. In other presentation, I elaborate on this, but this time I prefer to show you Leah in action so you can better understand what it is about. My claim is that Leah is something quite easy to use, quite intuitive. You know probably that there are several packages dedicated to probability or statistics. The core feature of Leah is to be easy to understand and probably well suited for education. Okay. Let's start. First, I import Leah, which is here in version 401B. Anyway, so first of all, I want to define a fair coin with head and tail. I do that. So Leah can work with any Python object here. I define probabilities on strings, but you can define probabilities on numbers on any Python object. Okay. Here for education, I prefer to switch to fractions. You know that Python has fraction included. So I've switched the display to have fractions. So if I want to create a biased coin, I can define several values and here it means that tails will be three times likely to go that then head. So I have a new probability distribution. So what I'm doing here is a crash course on Leah because we want to be acquainted to it before doing the investigation. I can also use a probability mass function to define probability in a fraction. So Matplutlib is integrated so you can display your histogram about any probability distribution. Okay. So I want to make 100 throws. So I use my B coin variable, my probability distribution to calculate to make 100 random coins, throws. So you see in this random throws that there is more tail than head. But okay, how can I be sure that it follows the probability that I given? Simply you can use Leah, the same function as before, Leah, VALS. You provide the values and this time it will use the random sample as it will be a frequency counter and you see that here more or less it conforms to the probability distribution that I provided for the biased coin. So what is interesting on this kind of object is that you can use many of what you do usually on your Python object. For example, you can index. If I ask for zero, it will take the first letter, head or tail, H or T. I can chain with the Python lower method and I have lower case H or T. I can map Python function here. This means that it count the number of characters which is four, head and tail, four characters each. So we have a certain four. And as you could expect also, all the operators are overloaded. So if I concatenate my distribution B coin with fixed string, I have a new distribution that follows what has been defined. Okay and here it's something a bit funny. What happens if you multiply a dice with a coin? You get that. Okay. Let's now throw two coins. So the new method allows you to define new event with the same probability. Here I define two coins which are biased together. If I add them together, I have all the combination possible with associated probabilities. So we will see that this is very important. We are able to calculate conditional probability with the given method. So here I try to see, okay, assuming that I know that the first coin is tail, what is the combination of the two coins? So here we see that the previous result has been filtered out to get just the two remaining possibilities. So it's a common feature of LIA is that when you define variable, there is a kind of lazy evaluation. They remain linked together in a network that define the relationship, the dependencies between the random variables. Okay. And you can also define Boolean events like, okay, what is the probability to be? I define it at 140 seconds. And then I can use operator like to be or not to be. And the result is it's true, it's certain true. Because okay, to be it's either true or false and not to be it's the contrary. So together it's certainly true. Okay. And there is also a dedicated function in LIA which is P. So you can extract the probability of true. So you get really a real probability like this. Okay. Let's go on. So here it's an excerpt of a book that it's three centuries old from Abraham de Moivre. It's probably one of the first problem solved by de Moivre here. Okay. Let's ask to find the probability of throwing a nace in three throws given a fair dive. This is how to calculate in LIA. So here I define a dive. I create three instances which are independent, which are assigned to the 1, 2, 3. And then I ask for the probability of any of one of these dives is a nace. The result is 91, 216th as calculated three centuries ago by de Moivre. So far so good. Okay. No, I don't know if you like playing a role-playing game. So there's a small example that where you can use LIA. So imagine that you have here this dwarf which fights against a troll. Okay. I first define a new kind of display with percentage because it's more convenient here. I define two different kind of dice. Okay. Imagine that your attack hole is d25 plus 4. Okay. What is the probability to ever hit? You see, okay, it's easy to calculate with inequality. So you have to be greater or equal that the troll armor class. You get this probability. So the damage of the magic axe is to d6 plus 5. Here is the result. But this damage is only applied if the dwarf can hit the troll. So for that we have a special construction, LIA if underscore underscore to avoid collision with the Python if. And okay, this means if there is a hit, then I apply the magic axe. Otherwise, the damage is zero. And here is the new histogram. So this is the probability, the actual damage that is done to the troll. And then from this data you can answer the, okay, assuming that the troll has 20 health points remaining was the probability to kill him in four rounds or less. You see it's deadly simple with this formula to calculate. We find it's 40%, something like that. Okay. Okay. You follow? So I will, I have many, many, many examples. But by lack of time I will drop maybe some of these examples. Boys or girls paradox, something very funny also that you can find on Wikipedia. So the chance to be a boy or girl are even. So okay, boy, one half, girl, one half. Mr Smith asked two children, at least one of them is a boy. What is the probability that both children are boys? Many people and including myself, the first time I heard this I think, okay, the information give me no clues. It's one half. But if you calculate like this with Leah, so you define children as two, a joint of two children and that you count the number of boys, calculate the conditional probability, the answer is one third actually. And what is interesting with Leah, you can understand why this is the answer by asking Leah to show you all the combinations. So here I show you the gender of all the children, the number of boys and given that the number of boys is greater or equal to one. And we see here the answer is here and we understand better why it is one third. Okay. It's a bit fast but you can do it at your own pace later. Okay. What happens if you have more elaborate problem? Like here we have several children. The eldest is a boy and he's got three brothers at least. What is the probability that all children are boys? Okay. You can model this like this. Here I create seven children. And I put, so you see when you read this expression, it's quite close to the initial problem. Of course you have to understand the elements of Leah to do that but after that it's quite easy to model. The answer is one forty second. Again it's possible to ask why it is so and here by joining you see that's okay, seven children is this part and the other are this. So you can better understand why it is so. Okay. I will drop this Monte Hall problem which is well known. You can read that after the session offline. Okay. Let's go back to the initial problem. So okay. First I change the display options. So the, first we define the pure probabilities like that. So here I ask Leah to display the probability in one line because it's more convenient in this case and as a percentage. Okay. So we have like this and we see, okay, colon and mustard. Our priority is the killer, the most likely the killer. Okay. Let's now try to write down the different information we have. So if Mrs. White is the killer she'll be absent with probability ninety percent. So I define here a variable. Mrs. White is absent using the if as we've seen before. I put the condition if the killer is Mrs. White then she'll be absent with ninety five percent else twenty five, twenty percent. Sorry. Okay. This is the percentage that Mrs. White is absent. But it's not very interesting because we, we, we are more interested about who is the killer but we will see what will happen later. And then we can continue and define other rules like this. If Mrs. Peacock is innocent she knows who's the killer with probability seventy five percent. So you see here there is a missing information which is the else part but we assume that Mrs. Peacock is not insane and if she's the killer then she knows who's the killer hopefully. So I put here the else part as one hundred percent. Okay. And then we can elaborate on more complex information like this one. Okay. I will not detail but you see again it's quite, when you see the statement, the tradition in LIA it's quite straightforward. And the last one is here. So what have we done here is to define what we call the Bayesian network which put the relation between different random variables. What is interesting with this kind of network is that if you get evidence about something you can go backwards and refine the probability to be the killer. So for that, okay, I define a list of evidence here. So first of all it's empty and the conditional probability is the same as before because I have no new evidence. So imagine now that Mrs. White is absent. I can add it to the evidence and define a new conditional probability. So you see it change a bit. Evidence two added to the previous one. Mrs. Peacock is drunk. Okay. I add this information and I get new probability and so on. Professor Plum accuses Colin and Mustard. And finally we know that the killer is a woman. So for that I use here the Python start with Mrs. because it's a handy way to say given the suspects that the killer is a woman, I add it to the evidence like that and you see, okay, there is a new probability. So there are just two suspects remaining, two women and Mrs. White is likely the killer. Okay. Yeah. Maybe you can consider this as a game but sometimes probability can play a very important role in some trials. So long time ago there was the Dreyfus Affair. There was a big flow of a so-called expert that makes a mistake in this affair. And also more recently, Selik Larch case where also there is a bad reasoning about probability. Okay. So I want to mention also that Leah is able to do symbolic calculation. So by using the SIMPY module that maybe you know, so it's very easy. It's the same interface. So instead of number you put variable name between quotes like this and you have probability defined with formula. So you can redo all the same exercise and you will get formulas to be the killer, etc. So a small example here. Okay. I don't detail here. It's a binomial function here with P and here I calculate a conditional probability and it displays me a nice formula. So you can check offline if you want that it is correct. Okay. I want just to finish about my bullshit generator which was made 15 years ago. So here the goal is to produce sentences at random based on a list of words and a list of grammar rules like this. Then you see that I put here a probability on each grammar rule so that the most simple rule are used preferably to avoid to get to long sentences. So yeah. Okay. I get... So it has produced... I don't know why it's... Okay. So maybe I don't know what happens but... Okay. I restart my kernel. Normally it's supposed to speak and to write down sentences but... Okay. Anyway. You can play that also. The Python code is really small so you can try it at yourself. Oh yeah. Okay. Of course I didn't import LIA. Okay. That's it. But anyway, sorry for the small interruptions but I think we don't have time for questions or... Maybe one question. Maybe one question. Okay. Thank you for the presentation. I have indeed one question which is about performance. So do you have information about performance, your libraries compared to other libraries or yeah, what are your insights on that? Yeah. It's a good question. So okay. It's not really the concern. So here as you have seen the results are exact. So okay. As you have seen also it's quite fast. So there are several optimizations. I have no figures but okay. As you expect there are many problems which are very complex and for that LIA provides Monte Carlo several Monte Carlo algorithm that gives approximate results in a fair time. Yeah. But I have no figures. Yeah. Okay. Thank you. Thank you very much. Thank you very much.
A slow migration from Django templates to Vue+GraphQL
Okay, so now we have both students. So now we have both speakers here. We can start the talk, the next talk. The talk is going to be about a slow migration from Django templates to Vue and GraphQL. So Jonathan, Jonathan, Jonathan and Dominic Georg, both Germans, they're going to talk about a system, Alexis, which is a school information system, which was apparently written in using Python Django templates and they now ported it to Vue and GraphQL. So give them a warm welcome and thank you very much. Thank you. Can we get the microphone for the other speakers? Thank you very much. These were speakers, the last one is the prize speaker. Hello, first them and Python Deafroom. We are the Alexis project. That's the all Libra school information system and we want to tell you how we transitioned from a Django app, with a templated web front end to an interactive web front end as the needs arose for one of those in our project and how we did it incrementally. I'm Michael Bauer and I'm a developer at Alexis and I work mostly on the new frontend and the new features we are enabled with that. So with that let's introduce the rest of the team. More of the team. Yeah, my name is Nick. I'm more or less one of the founders of the project. I started tinkering on the school management system. When I was still at school, I don't think I can remember when that was. Today, yeah. I don't know what my role is on the project right now, but someone might know. So. I have a microphone on my own, so I don't need to microphone. That's decent. So I'm Jonathan. I'm the lead developer with the Alexis project and I'm coordinating the dev process and everything connected to this. Okay, so let's get started with the talk. What's about Alexis? What is Alexis? This is a free and open source school information system and it has a free software license, a European public license. So it's thought of as an alternative for schools that they have a free option to manage themselves and organize themselves. It's a modular system, so any school can just take what they need and don't have to use the whole system. And it's also done in such a way that it complements existing solutions. So we're only focused on the parts that aren't there yet in a free software way. It's developed by software developers, but also students and teachers. So we're working together with pilot schools and already have it in use there. The main Alexis features, of course they're divided into these components, but this sort of the main components is the base data management. It's the basis of the schools, like we have classes and pupils and teachers and so on. Then we have a timetable system. It's like a calendar system just for schools. So you can create timetables and you can serve them to the students. So each student has its own personalized timetable. Also the teachers have them. And there's a digital class register to take all the notes and information for classes seating plans. So you can design and show seating plans for the classrooms. And it also integrates with other services. We have a matrix integration, a O of integration, LDAP and CSV, V import, export. And also we just have a calendar system inside Alexis that's producing standard Eichel calendar feed. So there's lots of choice in which ant devices are used to hook up to Alexis. And it's a quite universal system. There's also provisions for student ID cards and inventory management in schools. With that I would like to give over to Nick. He is presenting you the telecom technology stack. Yeah, thank you. Okay, yeah. So thanks for making this nice graphic to help me know how this works. Jonathan, yeah, well, our legacy code base was a traditional Django project and with all the modules as Django applications. When we started basically everyone was doing server side rendering with all the nice templating features of the Django framework. To introduce you to the rest of the tech stack on top of Django, we use PostgreSQL quite heavily. There's a salary task broker and Redis for caching and for synchronizing several nodes when running Alexis in a multi-node setup. Yeah, and for the front end parts, we, as I already said, we used the Django templating engine and some not very well integrated front end utilities like the materialized CSS framework which at the time somewhat allowed for making yeah, modern interfaces following the material design standards, but it started to bit rot quite quickly here and Jonathan will give you some idea about that later. Okay, so that was the legacy tech stack and where is my name somewhere else here? Do I have to say anything more? Yeah, you can see a page in the legacy tech stack. So you have to. Yes, yes, nice. Little overview of how it looked in the past and yeah, I have to say then the problems started. We occurred some very ugly bugs like I think users described to us there if there was like a select menu, depressed an item in the select menu and but actually was selected the item above or below this item. So that was not so good because many users were using iPads. And in addition to strange bugs like this, there was also a problem with maintenance, with materialized as you can see by this issues here. So yeah, there was a big discussion whether materialized will be developed any further. And in addition to these problems, there was also a request for new features. As we spoke about time table planning or sitting plans, we needed some way to do this highly dynamic features in a better way because the control of time table planning is a very complicated thing. Also these customizable calendar views and auto saving views where you don't need to press the save button. It all wasn't possible anymore with our old front end. So we had an idea Nick will present to you. Okay, so yeah, probably many of you know that it's now the new thing to separate front end and back end entirely and make a nice shiny mobile app or whatever. And yeah, Jonathan more seriously already gave a few hints about why we would want to do that. I think there's one other challenge that we faced. Did you mention offline capabilities and caching? No, because you know, Alexis is used in schools and things might be different in other parts of the world but in Germany only two things are certain in the school system. Namely that your mobile network will not work at school and that's the wifi won't work at school. These two things are certain and therefore teachers always complain that they could not use the server side rendered views when they had no connection to the server. So I think this was more or less one of the biggest challenges we tried to solve. So separating the front end actually makes sense here. Okay, so what we wanted to do, we wanted to replace materialize because materialize was stuck somewhere in 2015 and wasn't really developed, it was abandoned. We had a few patches on top of that, I think somewhere even upstream but it didn't get better and it lacked the dynamics that we needed for a really new shiny intuitive interface. Yeah, so what are reactive? All right, yeah. Yeah, reactive front end libraries and yeah, to make the interface, yeah, to not have it reload on every single interaction and yeah, and also a very important idea. Alexis provides a very good foundation for handling organizational data at schools but yeah, we want to tailor to the needs of different schools, of different types of schools. The ideas they have, one of our most important claims that we share with schools when we expand the benefits of free software is that we can make the software work like the school works and we can transform the software instead of transforming the school. So on top of the foundations for organizational data management, the idea was now that if we could replace the front end for some parts, like make a different class register for an elementary school because they have very different needs, we do not have to replace the data structures, the models and the APIs but we can make a front end that is more tailored to the needs. Yeah. Okay. Yeah. This is not my part anymore. No. So we then decided on how we want to do our new technonesis tech so we, as we said, we just took the backend set, okay, that's our backend and then we decided we want to do an interactive front end with UJS and the front end library beautify and some other UJS libraries and we want those both parts at communicate and yeah, we are a graph API. And so this was our plan and there were some challenges with this plan. Oh, just a graph API. So, yeah, let's see again. Thanks for helping me keep up with my tradition. I always give one very good talk before BiaNite and one very bad talk after BiaNite. Okay. So, yeah. So as we already said, the platform is supposed to be very modular. It consists of, don't know, do we have some figure how many jungle apps we had at that point when we started the migration? Like around 15 I think. Like 50 apps that could be loaded dynamically into the jungle project. We actually had quite a bit of magic in there to discover the modules of the jungle apps dynamically. So the administrators who deploy servers for schools could simply install the Python packages needed for the system they want to put together and then everything falls into place in kind of some black magic way. And now this did not turn out so well for separating the front end because normally when we separate the front end, we want to have one JavaScript or whatever application that is delivered to the clients, nicely bundled with whatever JavaScript bundler is the current type. And then it is one JavaScript application. We could not do this because we do not know which parts of the system are used and in which versions. This might be very flexible for every school. So we need to bundle the JavaScript front end application on the machine where Lexus is deployed. Yeah? 10 minutes left. Oh, yeah, thank you. Okay, and you need these 10 minutes? Probably. Probably, okay. Yeah. So the right way would be you have one front end application, one backend application. They are more or less separated in development. They could be developed independently, but we cannot do this because, yeah, what's? Okay. I have to switch the display so you can see this. Okay, this is where we actually generate parts of the bundling configuration for VIT because when we build the bundle, we know which applications are there. We have the JavaScript front end code bundled with the Python packages in the same repository and at deployment time, we need to extract the JavaScript front end code and let it all fall in place like we did with the Python applications, which was sort of a major challenge. So, yeah. Yeah, the microphone is developing, that's good. And then we faced another challenge. We said, okay, we weren't able to migrate all these apps at once. So we had to find a way to integrate the old front end with the new front end. And what you can see here on the Bima is like how the new front end does look like. So there is no real optical difference with the old front end, but it's the new front end and we have had to find a way to put those old pages somewhere in this new front end. And if I just say the word iframe, I probably get some scary faces here. So, yeah, we made it and just put an iframe somewhere in there and then we built some glue, which takes the URL, which is actually called and then called the different URL with a prefix where the old site lives and integrates with in the front end. And that looks like this. So what you see within this container is an old page. And what you see around this container is the new front end. So if you can see which URL is dated here, it has the prefix Django. So it's within the iframe and if I click the button, the iframe will navigate to this Django URL. I will do this and you can see that magically the actual URL from our new front end is also updated. So it's a kind of, yeah, ugly magic. And this also goes one way further. So this is an old view within the new front end and now I click one of these links and it's navigating to a new view in our new front end. So this needed a large bunch of glue to put all this together, but now it's working with some exceptions. Nick will come too. Some exceptions, yeah. So like this iframe with a server-side URL page in the new view.js front end, they are always communicating using some sort of JavaScript message passing. I did not yet understand. Okay, so what are we doing here? This is the dynamically generated bundler config or something. Yes, it is. I don't think we have the time to go into detail about this. And oh, whoa, there's a video. Michael is fine. Yeah, I had to Michael. Yeah. Here you can see the new front end in action and why we did this transition because we wanted to have more interactivity and here you see how you can design a timetable now with the new view front end. Someone's inserting lessons into the timetable and it's highly dynamic and all just works. So we just want to tell you about new problems and I think this last part will also be done by Nick. So, oh, yes, this problem. Okay, we already talked about iframes and how they communicate or sometimes we all know communication fails and that you have Alexis and Alexis and Alexis. And I think this visualizes quite well what sort of trouble this slow migration caused for us but we did not see this too often in the recent time, right? Mm, not too often. I don't think so. Prove me wrong, okay, thank you. We called it mini Alexis. Now we call it Alexis Matroszka situation. If you know what this means. So, yeah, it did just a good. It pops up every month again. Every other month here. All right, so now we have ugly front end bugs for the integration and all of this will be sorted out once we get all applications and all views migrated to the new front end. The JavaScript ecosystem shares some of the same problems we had with materialized situation because you know there's beautify three and it's pretty neat. We needed to migrate to view three. View two has been deprecated for two years or something. Pardon? This year, this year, this is not too far in the past. Okay, but it's deprecated. And beautify three is cool and we would want to migrate to it, but it's still missing the calendar component, the calendar date picker component, right? And we are basically the only thing Alexis ever does is handle dates. So this is somewhat of short, some sort of showstopper here. We hope that this will be sorted out. I think the release date for the date picker is moved every quarter or year to the next quarter of the year of some, but we will see how this works out. Yeah, of course there's an easy solution to the problem and an obvious solution here because we could just do this, right? No tomatoes for me? To get some new problems. And so we are always shifting from one set of problems to the next set of problems. Okay, thanks for bearing with us. I think I'm slowly getting awake. You can find us in the hallway track if you want to get more information and less chaos maybe. All right, do you have any last words, Jonathan? I think we have like three minutes for questions if I'm right. So maybe if someone wants to ask a question, otherwise we also will be available via email. So yeah. Any question? Thank you. I have a question. Why did you think about GraphQL instead of something like Django's framework and exposing APIs and using that instead of adding a new layer in between the front end and the end? Yeah, well, I think we chose GraphQL. Because I think the obvious alternative would be US or something like this. So, but we chose GraphQL because we were able to select what we deliver to the front end. We have like very complex models. And we say that, okay, we just take this set of information for this page. And from the other page, we need a much larger set. But of course, this GraphQL integration is causing us problems with an un-maintained or slightly maintained Django library and things like that. So as we said, another set of problems. Yeah. I think for the presentation, it's not right. Help. Yeah, back to you. I can just be loud. Yeah, just be loud. Okay, I'll just be loud. So thanks for the presentation. I know your pain. I've had to do that job a lot. So my question is, why didn't you, what I've been having success now is the back end for the front end, right? Because all these fancy new reactive libraries now have these meta frameworks, which is an awful word. But they kind of work. And so like, have you considered doing that? So the way I like to do it is you have the new back end for front end. And when they don't know what to do, okay, PHP help, then they just get the page back. So why did, I don't know if you looked at it like, why did you try to keep a single page application? You want to answer this? I can transfer from there. Yes, I would, what was the question about this? Have you taken a look at these back end for front end? Do you like them, do you not like them? Is that? What exactly do you mean? So like next JS, for example, that's the reactive. Yeah, okay. It has one like that. Yes, it's a kind of, we never have been using this. So it's like two years after this migration started, we just thought, oh, we could also use, have used this. So, but now the work is done. We have to go on with this. Our developer capacities are very limited. So yes, it's a kind of knowledge we didn't have. Okay, so thank you very much for the very nice talk. Interesting system. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you.
Django migrations, friend or foe? Optimize them for testing
Hi everyone. How many Django users in here? Raise your hands. Keep your hands up if you are dealing with Django projects with a lot of migrations, with time and continuous integration minutes. Okay, let's talk it for you. Perfect. You are in the right room. Now, I am Denny. I am on your right side of the photo. I work again in JavaScript, Python, Vue.js, Django, everything. It's pain-me-stuff. So let's start with Django migrations. Our way to propagate changes from your models to a database schema and keeping track of them. Let's quickly recap migration commands. So you can use make migrations, migrate, show migration, and SQL migrate. The first one, make migrations, create new migrations based on your model chains. You can use different parameters in there. For example, an empty migration you can customize. You can give a migration a specific name, and you can restrict the creation of a migration to a specific application. The model, for example, if you want to recreate Twitter, we know the reason for that, is this one. You can create a class for a model, and then creating the migration with the command will create a new file in your project in the migration folder with this content. So initial equal true if it's the first migration in your project. A list of dependencies if you are using something like, for example, authentication, or if you are on the second migration in the project, the first dependency is the first migration, and a list of operations performed during the migration. Then you can migrate your migration, of course, using this command specifying an application or not, or a migration name. So if you want to move to a specific point in the history of your migration, you can specify this. So as a new project, you can migrate everything using managepi migrates, and everything is at the last version of your database schema. Then if you want to roll back every migration in a project, you can migrate to the zero migration, and everything is rolled back. You can move to the second migration in your project with this, and without specifying a migration number, you can migrate everything to the latest version. Now how this works under the hood, you have in your database a Django migration table with a content like this, so the application name, the name of the migration, and the date time when the migration has been applied to your database, so everything is on your database. There is a better way to show this, so using show migration, you can have a view of your list of migration in your database, in your schema, with a tick if the migration has been already applied in your database. And then with SQL Migrate, you can print your SQL statement for a specific migration. So with our example, we can display the SQL code for this. So let's take a look at this. A transaction will be opened, every command will be applied on your database, and then the transaction will be committed if no errors. Now if you need to make further changes in your model, you can apply those changes and then create another migration. The migration will depend on the first one, and then the code will be another transaction, the SQL command, and commit. And again, and again, you can apply migration on your database in production using this. What if you need to do further changes, then for example, an every tweet likes and a lot of other stuff, then you can make change in your models, create a single migration, because of course I like to be well organized and structured, so every single change for me means a single migration. Then you end up having a lot of migration like this one. But even worse, if you need to create, for example, a shop app for a customer, then you need to create a model, and then during the lifetime of your application, you need to do a lot of changes to your model structures. Okay, we won't list this, but we had to do a lot of changes, for example, adding tables, switching data from a table to another, to a main table to a detailed table, and a lot of other stuff, changing data during your workflow. So changes can be a lot of pain, a lot of stuff, and when migrations become a lot, then your performance during tests could decrease a lot, because during the deploy is perfect, you can move forward and backward with simplicity, but in tests it's not that simple, because you need to wait for every migration to apply before running tests. And if you are paying for your testing time on GitHub workflows or other platforms, then that could be painful. As a disclaimer, the timing for this talk may change from laptop to laptop, so keep this in mind, but on my old laptop, this is brand new, so it's faster, hopefully, on my old laptop, it was the timing. So running tests on 20 apps like Shop, I just copy pasted them 20 times in the example repository. Test took just a single second, less than a second to run, and that was perfect, so there's no need to do this talk. Well, not exactly, because creating the test database took 20 seconds. So one second of tests for this project, and 20 seconds for database creation. And that was not optimal, because we were on the verge between the team license and the enterprise license for the timing of workflow runs, so between 3,000 minutes monthly, and that wasn't optimal, we wanted to remain in the team license, because it was cheap, and then we wanted to optimize that time. The first possible workaround is to use KIPDB, running tests, and this parameter preserve the test database between runs, and that's perfect, because the first run applies the migrations, and then the database will be kept on your cache somewhere, on your Oculus, for example. If the database, of course, does not exist, it will be first created and migrated, and during other changes in other prequests, for example, migration will also be applied, so everything is okay, hopefully. So this approach saves 20 seconds for us after the first test run. The problem was configuring your CI CD, because a solution could be using cache or artifacts in GitHub workflow, but this takes time to create and store artifacts in GitHub, or, for example, using an external test database from inside the GitHub workflow, but that wasn't optimal, and a friend of mine, or mistaken, suggested me this package, Django migration CI, that allows you to simply configure an external test database, so you can consider this and save 20 seconds if you have an external database. Another possible workaround, one line workaround, is to use in your settings migrate equal false, so if you are using this, migration won't run during the test, and this is similar to set none as a value in migration modules, but for every apps in your project, so it's better this way, so single line change, and this has a lot of pros and cons, pros, of course, single line change, and it doesn't run migration during tests. The problem is it's like make migrations plus migrate before every test run, so this will add in our example repository five seconds of time, so that was the opposite of what I wanted to obtain. So diving into the Django documentation, I discovered this great, great comment, squash migration, and this squash an existing set of migrations into a single one, you can specify your migration name, and optionally start migration name, it will squash every migration into a single one. This was pretty good, I tried this one on the shop application, and I decided to squash every migration into a single one. It was good, not perfect for us, but it was good. The problem is that we needed to add manual porting, because for example we used a lot of functions, manual function, from a migration to another, from a version to another, and that weren't migrated or automatically squashed, so we had to copy paste the function code into the squash migration and make some adjustments. And if we inspect the squash migration file, we can see there is on the top of the class definition a list of things, a list of tuples in the replaces variable. So the first item is shop, the application name, and the second one is the migration name, for every one of the 26 migration. And the recommended process is first squash, keep the old files, commit and release to production, to staging the demo until production, then wait until all systems are upgraded with the new release, then you can remove the old migration files, commit and do a second release. Then last but not least, you need to transition your squash migration to a normal migration, delete all migration files, all old migration files that has been replaced, and update all migration that depends on the deleted ones with the new squash migration, and after everything you can remove the replaces attribute in the squash migration, and everything is fine. Then if you want to clean up your database, you can prove references, so in your database there won't be references to old migrations. Let's test performances after squashing, after spending a week on my work project doing that, and oh no changes, so I lost a week doing that without results, and don't tell my chief. So what's the point? Well the point of squash migration is to move back from having several hundred migrations, five to just a few, for example if you create a branch, a separate branch where you are working you alone, you can squash migrations and propose just a single migration file in your request. I know, I know you wanted to speed up tests, so let's do it. Are you ready? It's not that easy, but first you need to recreate migrations. So let's annotate migrations for a single specific application with show migrations, and then copy paste all the names of your migration files, and then you need to manually create a replaces, you remember this one from a moment ago, you need to recreate the replaces list with application name and migration file name, and store it somewhere in your computer, then move your migrations in a temporary directory, so out of the way, and make sure that show migrations doesn't show stuff. Now it's time to recreate migrations using your application name and a name, a specific name, for example init squash, so you remember that this is the squash migration, and that will create the first migration at your last model version. Then open your migration file, copy paste the replaces array list, you created a moment ago inside your class, then you can restore your old migration files in the original directories, make sure for missing or overwritten files, and then remove the temporary directory. Now with show migrations you need to check that everything is there, so all in this case 26 migrations are there, and the first one, the squash migration is there but has not been applied, then apply your squash migrations and check again with squash show migrations that everything has been squashed and you have just a single migration, and then you can go back to your post squash task, so commit and release to production, upgrade those systems, of course staging demo production, everything else, update on migrations that depends on the deleted migrations, remove the replace attribute, and if you want to bring references to the little migration, and everything is perfect, right? Well, not exactly, if you have migrations providing initial data, you need to create a new migration for that, because recreating migration from scratch, it doesn't create that insertion in your modules, or even better, you can use fixtures, and in the doc you can see how to use fixtures in both database migration and also in testing, and that's perfect, and then you need to be aware of circular dependencies, because if your project is big and grows during the time, you could have circular dependencies from a project to another and backward, and this problem requires you to remove all foreign key, causing the circular dependency, create the first migration, restore the foreign keys, and create a second migration, and this way you will hopefully solve this. Now, let's try to test performances after all of this, after another week spent on the project trying to tell your chief that, oh, I'm working on something useful, I promise you, and yeah, of course, after recreating everything from scratch, our database creation task took five seconds instead of 20, that was perfect. Yeah, it was perfect, but does this apply to everyone? It depends, because if you have really big, big projects and you are paying by the minute your CI CD workflows, and you are on the verge of having to pay $3-4 per user per month, to 20 something dollars per user per month, then maybe you want to stay on the little cheaper branch of this, so that could be a solution, but if you just want to make order in your migration file, then just use squash migration without everything else, or if you want to speed up tests on your localhost, you just need to use KipDB, and everything is fine, without having to spend, in my case, two weeks working on this, just to save maybe a couple of seconds on your project, so it depends on your use case, and we are done, so if you want to see the example repository, it's there with three different branches, if you want to compare them on your local machine, and I uploaded the slides on the FOSDEN website, so they are there if you want to take a look at them. Thank you very much. Okay, we have time for quite a few questions, I see one up there. Given your salary, and these two weeks of work you've done, how many years of enterprise lessons did you avoid? That's a nice question, hopefully my chief didn't ask me that, but I think we could have paid maybe a year, I don't know, one year of this, but yeah, it was fun to play with this, and for me at least spending two weeks trying new stuff, or trying to discover hidden stuff from the jungle. More questions? Good question. Yeah, thanks for the great talk, I was wondering if you looked into using like seed data betas for CI, so that... Sorry. Yeah, you don't hear it? No, I didn't hear you, sorry. If you looked into seed data betas for CI, so that you run your migrations locally, and then dump the database, and then use that database during CI to start off with a pre-migrated database. No, I didn't think about that, it's a good idea, so you just upload your database dump, and then on your... Yeah, so you just set up your CI script to use that database when it initializes. That could be a good idea, I need to try that, thank you. So you restore the database and just applies your last migrations without having to apply everything. Yeah, exactly. Yeah, that's a good idea, thank you. Thank you. I was also wondering if you're using Postgres for example, you can disable fsync that will just keep database in memory, so that probably be a solution for big time. So locally we kept the database in memory, the problem was on our CI CD, so we created a service in the workflow files, and that was creating a database from scratch. So it was just a configuration you can just add on your Postgres site on the CI... We had to consider the time for storing and restoring the database on that configuration from the cache. So it was a little bit of time for that, but yeah, that was an option I tried to... More questions? So very cool talk. I like your method. I basically came up with the same method about five years ago for this approach. Do you think there's an opportunity to create a tool to automate some of this process? Well, that's a good question. Maybe implementing that in the squash migration in some way, I don't know. We could, we can try to do it just to save other two weeks of salary from other people. Okay, I think we're done with questions, so we're going to have another five minute break and then continue with the next talk. Thank you.
How can we trust 3rd party code? Using Python to understand the trust relationships within the python ecosystem
Hi everyone, good afternoon. You're hearing the Python Dev Room. So we're going to have now Nigel Brown speaking about how we can trust third party tools when we have Python. Many times we don't realize all the dependencies that happen when we install a package from PyP and Nigel will be talking about how we can avoid getting some dodgy packages and things. Thank you very much Nigel. Thank you. Thank you. Right. Okay. My name is Nigel Brown. I've been programming since 1981 as a kid. I got a job about 12 years later. I've done mobile devices, security, data, lots of different languages. I currently work at a company called Stack Lock where I'm doing some data science and some engineering. If you're interested in the supply chain and frankly who isn't these days, you'll love Stack Lock. You should check them out. This talk covers some of the ideas that we've been grappling with there for the last nine months or so. Okay. Here are some supply chains attacks and recent examples. I don't know much about these attacks. I'm not a security researcher. Every time I read about one, I feel vaguely uncomfortable. These are things that could apply to me on the whole. And this is why we're looking at these things and the flames prove that they're scaring things. Okay. So, recent lawmaking legislation. We've got the executive order 1428 in the States and cybersecurity resilience proposals. The EO pushes S-bombs. What's an S-bomb? The software bill of materials. It's probably a bit too much detail to go into right now. Let's look it up. There's tracks over in the building about this. The S-bombs are more of a first step than a solution. They're a step in the right direction. Creating them sounds simple. But the practicalities get in the way actually and doing something with them is more of an art than a science still. They are progressing. The key point is the responsibility for the security of your code is shifting towards spenders. That means it's shifting towards you on the whole. There's some more scary flames there because that's quite scary. Okay. Supply chain attacks aren't you. It all boils down to who and what you trust. The key point really is that security, insecurity most often comes from behavior rather than the technology. Why are supply chain attacks becoming more fashionable? Maybe it's because they're easier than they used to be. Perhaps maybe everything else got harder. Okay. Perhaps they were always there and we just didn't notice. I don't know the answer there, but there is a lot more focus on them these days. So a word on trust. Basically, we want to trust some third party code. That circle represents us. We're victims, skateboards, developers. The supply chain is how this code actually gets to us. We generally get code delivered as some form of package. And that package, the source and the package have to live somewhere. Sometimes they live in the same place as in go, which is a very good example. Sometimes they're different places. Some other package repository. These can be private, but we're talking mostly about open source. Important point, we have to download it. These are all potential failures for the software supply chain. Of course, we have multiple versions. They're changing all the time. They're moving target. And there's normally tags in a source repository that point to the different versions. And these are delivered as a bag of files to us on our laptops or our servers. At this point, we can scan them. We can do vulnerability scanning and we can do static code analysis. We should do that. Definitely should do that. And the code has owners. And the point here is that you can't really trust code. It just is what it is. It's the owners you're trusting. And the question we're faced with a lot of the time is do we trust the right people? And it's not just the code owners. There are multiple other people, contributors. And we trust those people because of their reputation. Reputation comes from several sources. It comes from various media. Personal knowledge, you might know some of the developers. They're all quite often we trust in a community of one sort or another. And companies have reputation too. Sometimes good, sometimes bad. How do you trust a company? If you've got close source, that's the only trust you've got actually. The web of trust here is building up. Now, turtle is all the way down. It's an expression of infinite regress. I heard it once. I thought it would be a good metaphor for this stuff. It turns out, I'm looking for an image, turns out Cole Kennedy thought the same thing. So I nicked his image because it displays this quite well. The average middle size, medium size project has about 1500 transitive dependencies. So you depend on something and it depends on other things. And you can investigate one package at a time. You can look at its origins. You can look at the people. You can perhaps do a code audit. But doing thousands of them is hard work. It will just take too long. Now, we probably want automation to help with this. And that's one of the things that we're working on. Trying to give this thing some oil to keep it going. So this web of trust, the supply chain can be attacked at any point here. And it can break at any point. It doesn't have to be attacked necessarily. And also the main point, there's thousands of ways you can draw this diagram. It doesn't have to be like this. But there is complexity there. And it's messy. So what do we do about this mess? Okay. So what we currently do, we really like to see these. And that's because we can count them. And we can fix them. We can show improvements. They've been guilty of a little bit of a misdirection actually. In reality, only about 2% of these are exploitable. So if you're not careful, you end up doing a lot of work that you don't actually have to. This comes from Red Hat Report. I've seen other estimates of this 2% value. And they are similar sizes. Okay. Another thing you can do, static code analysis. Currently, it's mostly signature based. Finds things that we've found before. We think there may be more legs in perhaps grabbing the source, grabbing features from the source code and running it through a neural net. And this may or may not be more effective. There's still, there's lots of research out there. But there's still lots to do. We think we're going to be doing some of this work ourselves at Stack Lock. So, but that's more for the future. I mean, but the criticisms aside, we should definitely do CVE monitoring and static code analysis. Right. So don't take anything I say here. This is an excuse for not doing these things. Okay. So another idea is to look at metadata rather than the code itself. Okay. So that would be descriptions of the package links to the source repositories, activity around it, et cetera, et cetera. This is like a bit like a classic security traffic analysis or perhaps in bank fraud detection. We're looking for behavior around the package rather than the actual code itself. Okay. So this is a graph. A graph. We got basically malicious packages look different from non-malicious packages on the whole. The ones on the left, these little blue dots, malicious packages. The ones on the right are non-malicious packages. They're surrounded by nice bunch of purple users and orange source repositories. You can't see that probably in any detail from where you're sitting. The point is they look different most of the time. Sometimes you get good packages over here that are sort of isolated and you get malicious packages over there that are well connected. So I don't know. It's a malice parent there. It is some of the malicious packages look fine. Most malicious packages don't make any effort to hide the fact that they're malicious. If you look at their metadata, it's quite obviously something. There's no description. There's no effort put in at all. Unfortunately, a lot of legitimate packages look like that as well. It makes it a little bit harder. We started off. We put a neural net on this and we tried to put a classifier and we classified into malicious and non-malicious packages. It worked beautifully. It's like so what? You don't really need a neural net to tell the difference between those two things. You just need to look, you know, has it got any data associated with it? So not necessarily very fruitful. So we don't need a neural net. Instead, we did a simple score. We did a simple score. It looks at some malicious packages. It's mostly Python. We just start with some Rust and NPM as well. We looked at the activity and the provenance. I'll come on to that a bit later. We normalize it with a whole set of packages that we ingested. You can see here that most of the malicious packages, these are just malicious packages, they scored really low. So hey, it looks like we can spot malicious files using the metadata. Not so fast. Unfortunately, the base rate let us down, as I mentioned, we do get low scores for malicious packages, but we've got 10 times as many, at least 10 times as many good packages at school zero as well, which isn't great. So if we get a low score, it means we've got a one in 10 chance maybe of finding a malicious package. We don't know for sure one way or the other. So you've got to go on to your code analysis then. And also, I should point out, this isn't a representative sample. We don't have a labeled data set of all the malicious packages in the PIKI repost, because we haven't found them all yet. So we've got samples, we sample as best we can, but we don't know. So does that handicap matter? Probably not, because most of the actual packages we want to spot, most of the packages we want to use are probably on the further side of the scale. They do have good description, they do have good information, and they are linked up. There are some exceptions. All right. Okay. We act like this currently. Vulnerabilities are all that there is, and they're all deadly. Okay. This creates a lot of work for everyone, as I mentioned earlier. We're only really worried about things that can hurt us. Right? And the real, the reality is that it's more like this. Most vulnerabilities come and don't hurt us. We should use things like Open Vex, right, to describe the vulnerabilities that are actually exploitable in place, and that's a sort of an emerging standard. And then we only have to deal with the shaded bit between the two circles there. Obviously, you want to fix all vulnerabilities, but, you know, there's a prioritization system that we can employ here. Another thing to note is that malicious code doesn't always use CVEs, and there are other things that can hurt us that aren't CVEs. So, we've got malicious code leverages bad habits, like leaking keys and manual processes. We got abandoned code, and it gets taken over, and it's not updated. But bugs and bad habits and abandon where can also hurt us accidentally without being malicious, right? Malice isn't everything. So, we want to avoid all these bad things. Most of the things we actually want to know about are actually hidden from us, right? Okay. So, the malicious code is hidden by stealth. Buggy code is hidden by incompetence or apathy. And since we started patching CVEs, bad actors have moved increasingly to zero-day exploits. All right. And let's remember, most code isn't malicious. When we look at the metadata, buggy, poorly maintained, abandoned, malicious code all look similar. And you have to ask yourself the question, well, we can't tell them apart. You have to ask yourself the question, do you really want to use any of them? So, given that this is a hard problem, why not do something simpler, right? Which is to invert the question. Look for the good, not the bad, right? Looking after your health. It's like looking after your health instead of focusing on disease. So, the good bits are everything outside the circle, right? We want all the rest of the code. And for the Rust developers who insist that code could be finished, it's this bit as well, the abandoned bit. Right. So, what does this look like? We want things that probably don't hurt us. So, this is the inverse of what we just had. So, it's good coding and hygiene habits, active development, regular releases, developers we trust. CBA and stuff like that. Coding now is clear. And the key point is looking for good things is easier because it isn't hidden. Okay. Right. So, I mentioned provenance. Right. The first challenge is provenance. Okay. If you're going to do anything with any of this code, if you're going to scan it, do whatever you like, we need there. Provenance means origin, right? We need to find out where the code came from. Starjacking is when a code lies about its origin. And it tends to be a better package than it is. So, you'll find that lots of different packages share the same source repository in the package systems. It's very common. How do we find provenance? Okay. So, remember the executive order earlier mentioned S bombs. Okay. S bombs are basically a shopping list of all your of your piece of code, whatever it is, operating system, game, package. It's like a shopping list. It's a document of provenance is what it is. What you put in an S bomb isn't quite standard yet, but it's becoming more standard. There's lots of work going on with standardization. Open SSF, there's a track over in the other building that covers this. It's where we probably want to go we want to be able to record these things strongly. Now, if you've got an S bomb, you want to put it somewhere, you want to put it somewhere safe. You don't want people tampering with your S bombs. So, a thing that's becoming more common is Sigstore. And now this is artifacts signing. It's storing artifacts in a transparency log. It's a distributed ledger. It gives us cryptographically strong provenance. It circumvents most of the problems with delivery that we've got. And there's a sort of convergence on this. It's being used more and more in the community. I think it's where we're going to end up and it does solve a lot of problems. But the fact is at the moment, most code isn't signed for now. And I think it would be a few years before it is. And nothing. Historical provenance. That's a stack lock thing. Okay, so we basically, we take a bunch of tags from the source repo and we take versions and we see if we can match the dates. And if the dates match up, then we say it's got some provenance. It's a statistical process. Quite hard to fake. There's a whole video on that on the, on our website if you're interested. Videos and blogs and things like that. I won't go into that any further here. Right, so just because you've got some code, you've got rock solid provenance for it, you know where it came from. There's no, actually, it's no really shortcut way of saying if it's any good. The old fashioned ways are the only ways you test it. You measure it, SCA, again, code review. It requires the provenance of course because you don't want to be reviewing some other bit of code that doesn't apply to your package. And you become intimate with it. And with all those turtles and packages, like, intimacy takes a lot of work. Right, we've got a community of people, right. So to make this viable at any scale, you want to share the work with the community. Okay, and also we want to automate this, right, because this is, you don't want to have to be on the email talking to people all the time. All right. Okay, I mentioned reputation a couple of times. So the reputation of the people and the, the companies that we're talking about, what do we know about someone? We know, perhaps we know them, we know a company that big, we know the size of a company mostly, we don't know much about them internally. We guess and we hope and, you know, do we even care? And the executive order says that we do, apparently. So that's where our reputation currently comes from, I think. Where should it come from? It should come from prior art, participation, recommendations. Generally, we want some proof. And generally, we want to automate this. Okay, so, the key points. Once again, look for good things. They're easier to spot. You don't trust code, you trust people. Trust is complex. It can break in many places. Reputation is important. Communities can share work. And automation makes this possible at scale. Shameless plug. That's the kind of stuff we're working at at Stack Lock. We're open to ideas. Try our tools. They're on the website. Joining a conversation with Discord. The source, if you like, if it's open, if it's not yet. And that's the end of the presentation. So, any questions, please? Page presentation, Nigel. Thank you very much. We have time for one question now. There. I'm coming to you. One second. Thank you for the talk. Maybe it was some humility. But what does your product Stack Lock do exactly to apply all you said? So, is it where you attend your packages and say where are the vulnerabilities or what does it work in practice? To this URL, www.cracksypackage.dev, you'll get to a web portal and you can type in the name of a package and it'll give you a score. What we're doing is we're increasing the number of facets the score is based off of. We've got provenance measures in there and we're going to be doing a reputation engine for it as well. So, there's a website and you can go straight there. To bring this to the developer, there's a VS Code plugin and you type along, you import a file and it'll give a squiggly line underneath it and it'll say, yeah, this has got a low score. Obviously, some of the low scores are absolutely fine but it just gives you an indication that you've got to do some more investigation. There's ways around most of this stuff but it's kind of like it just gives you flags. But yeah, go to the website. It's fairly intuitive. You don't need instructions for it. Cool. Thanks for the question. Thank you very much, Nigel. Please feel free to reach out to our speaker after. Thank you.
Match all things Python: Parsing structured content with Python's new match statement
Good afternoon. We have now Mach-Andre Leimburg. He's the CEO and founder of E-Gennex. He's not only that, but he's a Python C-Python core developer. He's also one of the organizers of the EuroPython. He's a EuroPython Society Fellow, and he's been making many contributions to Python. So, yes, we have this pop star here. Now, he's going to talk about Match of Things Python, parsing structured content on Python's new match statement. Thank you very much, Mar. Thank you. And thank you all for coming. The reason why I'm doing a talk about the match statement is that I'm getting a feeling that it doesn't receive enough traction. So, I wanted to know from you how many of you know the match statement? How many of you have actually used the match statement? A lot less. Yeah, that's what I thought. So, maybe a short introduction. Tatiana already mentioned a couple of things. I did a lot of stuff in Python. I've been working with Python since 1994, so a very long time. I did lots of things in the core development, Unicode, Db, API, the platform module. I'm based in Germany. If you have a need for, I don't know, a senior software architect, then please contact me. But that's not the point of this talk. The point of this talk is to show you this. So, this is the match statement that you have in Python. And it's actually a very, very useful thing, especially if you want to parse structured data. Now, the match statement itself is actually quite complex if you look at all the details. And I'm going through all the details in this talk. There are so many details that I have to rush a bit, unfortunately. And I'm not going to be able to show you live demos or anything because I simply don't have the time for that. So, let's just head right in. So, what's the motivation behind the match statement? People wanted to have something like a switch statement, as you probably know from maybe C, your other languages, for a very, very long time. I just, I wrote a pep a very long time ago, which basically suggested adding something like that to Python. It was rejected at the time, so it took another 20-something years to actually something like this to make it into Python. What we now have with the match statement is a lot more powerful than the switch statement. So, you can do not only matching on literals, for example, but you can also do matching on types. You can do matching all kinds of things, including conditions that you apply to these things. You can combine all of these things. You can also do parsing and matching at the same time, which is quite useful, so you don't have to have two passes. First, to figure out whether something is actually valid, and then in the second pass to then figure out how to actually use the data that you have there. It all started in Python 3.10. That's more than two years ago. But, like I said, it hasn't received that much traction yet. So, what you see here, or maybe you cannot see it, it's a graph from py-code.org, which is a very nice site. If you don't know that one, you should go there and have a look. It basically scans all the PyPI code and then does analysis on that. The maintainer did an analysis in July last year and looked at various features of the language, whether they were being used in the packages on PyPI or not. As you can see, in July, there were only 2,600-something packages on PyPI using the match statement. That's two years after the release, and it's only 0.55% of all the packages, so it's next to nothing. So, I guess one of the reasons for that is that the documentation for this match statement is not all that great. I'm talking about the official Python documentation. There are many blog posts about it, and many other resources that you can tap into and overviews. But the Python documentation for the match statement is not ideal. What you have is these three PEPs, and this is basically the best that you have in the official documentation for Python. If you want to get into these things, then I would suggest to go with the PEP 636, which is a very nice introduction, a tutorial kind of introduction to the match statement, and then you can go to the other PEPs to have more detail. So, how does it all work? We're going to have a look at this example, and I'm going to go through the various different parts of it. So, the first part is the match object itself. This is what you want to match, this is what you want to analyze. The next thing is what you have behind the case statements in there. Those are called match patterns, and there are quite a few of those. I'm going to go through a list of many other patterns that exist. Then, of course, you have the match code. This gets executed in case, one of those case statements, the case patterns actually do match. And then you have something called capturing variables. I'm not going to explain what that is now, because I have a few slides on those. This is a way basically to store the data that's being matched in a variable. Plus, you have something that's a bit strange, which is just the underscore. These are non-capturing wild cards. So, it's basically like an ELTS in an if-else statement. So, if the matching goes down, and you have a, as the last case, you have one of these wild card things, then this will always match. So, this is a way to do the ELTS in the match statement. Matching itself is always tried from top to bottom, and the first match wins. So, the order in which you list these match statements, the case statements, is actually very important. There's no fall through, like in C. How many of you know C? Well, quite a lot. That's good. So, you don't have that, because in C you can often make a mistake. If you forget a break, for example, in one of these, the code that comes behind the case, then it just falls through, and then you execute code that you probably don't want to execute it. So, let's have a look at these pattern types that we have. Like I said, there are quite a few. I'm going to go through them rather quickly. So, the first one is the literal. So, you can just write a little bit of string, a little bit of number, an integer, a float. It can also handle a couple of special singletons, like true, false, or none. Not many more. If you have something else that you want to actually match, and you don't want to write this down as a literal, you can use a variable kind of notation for that. So, if you have some other value, you put that into a variable that's accessible to the match statement. And what's very important is that you have a dot in that reference. The reason for that is a bit strange, because the match statement also works on types. And in order to differentiate between type names and variable names, the match statement and the parser, they need to have some kind of hint for this, so that they know what they're dealing with. And the dot is that hint. Now, the next two types are sequences and mappings. They look very natural to a Python programmer for sequences. You just use like the square brackets or the round brackets, and then you match a sequence. What's not necessarily intuitive about this is that this actually matches sequences, not just lists or tuples. So, if you write something like, for example, in the tuple notation, and then you pass in a list as an object that gets matched, the tuple case will still match in your match statement. So, that's a bit like a gutcher. You have to watch out for that. And it's similar for mappings. For mappings, you write them like the, like a dict kind of notation. It actually matches all kinds of mappings, not just dictionaries. There are ways to, you know, just match dictionaries. I'm going to show them. You can also match, like I said, different types. The very, you know, very simple ones are like all the built-in types that you have there. You can have support for user-defined classes. You have to pay some attention in user-defined classes about the order of the arguments that you have in there. I'm going to talk about that in a bit. What's very important are these parentheses. If you don't have parentheses behind this, then the match statement is going to basically treat this, the name that you have there as a variable, and very often as a capturing variable. So that's going to, that's another gutcher you need to be careful with. Of course, you can nest all these things. You can combine all these things that I just mentioned in various ways. There's an OR combination with a pipe character. And to make things even more complex, you can add guards to these match patterns that you have. So you can say, OK, for example, down here, if you can see that it's a sequence AB, and then this should only match if the value A in that sequence is above 10. So you can write very complex things in those match statements. And then finally, you have these white-card patterns. I mentioned those already. There are two types of these white-card patterns. One is the anonymous one, a non-binding one, which is the underscore. And the second one is one where you basically put something at the bottom of your match statement, and you just assign a variable to that. I often use unknown for this because it just makes sense. If you read that, it's very, you can easily comprehend that. If you read the code, you can easily understand that this is actually something that matches anything a bit unlike the underscore. I'm not too much of a fan of this underscore thing. Right, so now let's have a look at the capturing variables. Like I mentioned in the beginning, the nice thing about the match statement is that you can actually combine the matching and the parsing. So whenever something matches, Python will put the matched value into a variable that you define, which is very much like, for example, the ass notation that you have with context managers. There are two forms for this. One is an explicit form. So I put an example here. So what happens is it matches a list. And then if the list type matches, it will put the value into the variable sublist. And then you can use that variable in your other matching code that you have or in the actual code that you want executed for that particular case. Very easy to understand. It's a bit more verbose, but it always works, which is nice. And then there's an implicit form. This can cause some problems because it introduces some of these gotchas. The way that this works is that instead of putting literals in these, for example, sequence notations or mapping notations, you put variables in there. And what happens there is that implicitly, for example, in the first example up there, the first entry in that sequence will go into A, and the second entry will go into B. And then you can immediately use A and B, for example, in guards that you have on the code that comes afterwards. And these things are actually bound variables in your code. This works very well if you have well-defined variable names. If you don't, you can get into lots of trouble. So using short names is probably not a good idea. They should be very explicit. This does also work with some of the built-in types, not all of them. So there is a, I think this is actually a full list of all of the ones that support this. It does work with classes that you define, but you need to have a look at this pep for the details. There are some special attributes that you have to define in order for the parser to know in which kind of order these variables should be assigned. Unfortunately, it doesn't work with ABCs, but there are workarounds for that. So if you work with ABCs, for example, if you want to test whether something is a float or an int, and you want to put that kind of logic into an ABC, then there are ways to still make that happen. There are some things that don't work with the match statement. Some are a bit unfortunate, because, for example, if you use a scripting shell language, like bash, for example, a very, very common use case for matching is regular expressions. So basically, you have a case, and then you put a regular expression there to match a particular regular expression, kind of like how the string should look like. This is not supported directly. There are ways to work around this. I'm going to show you a reference later on, where you can basically find how to do this. Something else that doesn't work well is a set member matching. There are ways, again, to work around this. You can use a guard to kind of do this set matching. So the guard works by having the wild card, so it always matches. And then it uses the guard to do the actual check whether something is in a value set, or you can use the OR pattern. But the OR pattern is sequential, so it's not really efficient. Optimizations haven't been done yet, which is a very common theme that you always have in Python. First, something gets implemented to have something to work with. And then, in the next couple of releases, people then worry about performance and add better performance. So that has happened a lot in Python in the history. It's probably going to happen for this as well. So I talked a bit about the guard trust. I just want to reiterate some of them. This I already mentioned. If you use the tuple notation or the list notation, and you think that, OK, this is just going to match a tuple or just a list, you can easily get this wrong. So if you want to do this explicitly, then you actually have to use the type notation for this. So you have to write list or tuple, and then the sequence that you want to match. The same issue you have with the mapping types. So you have to pay attention to that as well. Another gotcha is the wildcard pattern. So you can only use the wildcard pattern at the very end of the list if you put something up at the top of the list. For example, if you start with case and then wrong values, because wrong values is a capturing variable, it's regarded as a wildcard case. And so it will match anything. And the parser will actually complain about this. So this is not valid Python. However, if you put a guard with it, then you can use it. Which is probably in order to make certain workarounds possible. I don't really know what the reason is why this works. It's a bit strange. And then the parentheses. If you look at this code, if I wouldn't have put an error there, you probably wouldn't have seen this. What I did there is I put a dict there, meaning that I want properties to have a dict, like a dictionary value. And they want to match that. But I forgot the parentheses. So what's going to happen is the parser is going to regard this as a binding, sorry, capturing variable. So it's going to put the value into a dict. And then it's not only going to not parse correctly, because it will just put any kind of value that you have there into this dict capturing variable. But it will also bind dict to this value that you have in there, possibly breaking code that comes afterwards, because you can no longer access the built-in dict. So this is something to watch out for. And finally, this is the talk that I wanted to mention. Raymond Hettinger. Who knows, Raymond Hettinger? Not that many people. That's strange. You should definitely look him up. I mean, he has done so many good talks. It's just incredible. I mean, if you want to learn something deep about how Python works, he has all the talks in his stack. So definitely have a look at that. He did a great talk at PyCon Italia 2022, also on the pattern matching. And he shows a lot of tricks on how to work some of the deficiencies that you currently have in the match statement. So I was actually faster than I thought. So I'm done. So yeah, this is always my last slide. Never stop to learn. Always learn new things. Never always try out new stuff that comes out in Python. And I hope this talk will kind of make you have a look at the match statement and maybe use it more, because it's actually quite useful. Thank you. Thank you, Mark. Thank you, Mark. So now it's time for questions. So I can say a few people with the hands raised. I will start here, and we will go up. So we have four people, at least. One of your first examples, you first had to check whether this is a list, like with the list in the parentheses. And then two cases later, you are trying to catch against the sequence. That means that this will only match if it's a sequence, but it's not a list, I guess. Like on your first slide, literally. The first one, like this one? Yes, this one. So on the third case, it will match if the thing is a sequence with three elements, but that sequence is not a list, because otherwise it would have gotten into the first case. Is that correct? Given this one, yeah? Yes. Since you have a case list, oh, yeah. Yeah, so you're right. What happens here is that this will always match for lists. So if you put in a real, like a true Python list, then you will always go in here. If you have defined your own kind of sequence, that's not a Python list. Only then it will get in the top. Then it will drop down here, and we'll parse here. And as Heckelman and Laska mentioned for me, what happens if you put a generator in there? Can you match against generators? Because then you will kind of mutate the element while casing the case. Would that work? This is a good question. I think if you put a generator in there, it will actually match the generator type and nothing much else. It won't actually call the generator to give back any values. But it's a good question. I'm not really sure. It probably works like that. Hi. Thanks for the great talk. I had a question regarding the caveat you gave at the end regarding the dict. Is there a proper way to do it, like putting parenthesis, or is it not possible to match a type inside of a hash map like that? Let me just find the slide. This one, right? Yeah, that one. So what was the question? So here you put the dict, and you said that, of course, if it will overwrite, let's say, the Python dict, would it be possible in that case to put parenthesis to match the type here? Yes, of course. And that was the code is actually written in a way that this would have been intended, right? So the intention was that properties, well, it's matching a mapping, right? So if you put in a mapping that has, as one of the keys has properties, and as a value has a dictionary, then this will match, right? Without the parenthesis, it won't match any mapping that has a key that has properties, but not actually look at the value, and simply just put the literal value into the variable dict. That's what happens. OK, I think I see you up there, right? Yes, hello. I was wondering, with this capturing variable, it can sometimes lead into ambiguity. So I was working how well this would work with the existing typing system, where you would, for example, have an object that, like, dict that represents the type. So that is something that I did not really cover in here, but perhaps you noticed the syntax that's being used here is actually somewhat different from the type annotations that you have in Python, right? So those are two distinct kind of, basically, systems working here. These types that you have here are actual Python type objects that you work with, whereas the type annotations are being used by, for example, MyPy or other tools, other static code analysis tools to figure out whether something is correct or not. So this actually happens at runtime. I don't know if that answers your question, so. Well, sort of, I guess. So you can't really put the typing types in here, let's say, because there is generics in there, of course, that would be highly convenient for matching. Right, right. I think that, I mean, in typing, you do have some actual Python type objects. Those you can use in here, right? But you cannot use the type annotation kind of syntax, for example, for matching an integer or something, yeah? No, it doesn't make sense, of course. That doesn't work. Thank you. Do we have any more questions? We have time for one last one. Yes, we do. Oh, my God, we have two. I'm going to the right side, because we haven't had many questions from there. I'm coming. Let's go. Thank you. So, yeah, maybe this is wishful thinking, but how difficult would be to implement or to provide, like, a match that will match not in order, but it will give me the best match? Would be that possible? Because, for example, I'm working in code generators for wrapping CAP from wrapping C into Python, and sometimes you can't do that. And from C++ goes over, function overload. So I can think, OK, I can have function overload to Python and translate that to a single function with match for different signatures. However, I will have to, I don't know, I need to know which is the best match for each case in order to order the match statement. Will it be possible to have that kind of logic embedded in Python, or that's too wishful thinking? You can try to do this by ordering the cases from, you know, the longest match to the shortest match. But apart from that, I think it's, this is actually a hard problem that you're describing there. Because if you want to, if you want to figure out what's the best match that you have, then you actually have to go through all the different cases that you have in here, and that's going to have different semantics than what you have now in the match statement. Usually the problem is like to know which is the most concrete type. Usually the problem that I have the most is like to know which is the most concrete type to the base type, so to that it matches the most concrete one instead of the base one, because it's like it can match us both. But in C or C++, it will always match the most concrete one. And if, and it's not there, it will get to the base. So, and for example, for now, it's like right now in Python I have no idea how I will solve that when I'm wrapping APIs. You can do that by ordering, like I said, you can order the case statement that you have here from the most, let's say, abstract one to the most concrete one, and sorry, the other way around, from the most concrete one to the most abstract one, and then like in the example I just gave where you have a list, yeah, when, if you pass in the Python list object, then it will match the first one. If you pass in, in this other example that I had here, if you pass in, let's say, a user defined sequence, then it will drop down and then match that one. So that's more abstract, right? Thank you very much, Mark. Another round of applause, Mark. Thank you. Thank you.
Python 3.12's new monitoring and debugging API
It's time. So thank you very much to Johannes Bergberg. He will be speaking about Python 3.12 new monitoring and debugging API. For those who were in the previous talk, there was a brief about the profiling features. Johannes is a JVM developer working on profilers at SAP. He also writes blogs about profiling and debugging topics. Thank you very much. Thank you for introducing me. Before we start, I want to introduce you to the concept of debugging, because I'm sure nobody of you have ever debunked. So the first bug that's what's like found was in the 50s when they found a moth that was in between their relays and it makes zip and the whole system like crash because like in the olden days it relays. At SCADAISCAR once wrote, if debugging is the process of removing Zafferbacks and pro pro-peering must be the process of putting them in. As I'm sure all of you are doing lots of programming, I'm sure you're also doing lots of debugging. So that's what we're here. So consider this example program. It's called a counterprong. It just counts the lines in this example in itself and it returns zero. And we're like, why? So that's the problem because the file actually has 26 lines. So let's look at the code. I'm going to see clans of the code. Make this shortly so you don't see what it's about. But the idea is we use the debugger through this because the debugger is a great to understand our system. And the cool thing is now with the new APIs that we get in Python 3.12, writing the debugger is far easier and far faster as I show you in the following. I'm Johannes Pechberger as you already heard. I work at SAP machine, which is the third biggest contributor to the OpenJK, which is like the major Java runtime. And I started talking to people about Python because I also like Python. So it's a bit easier to VM than JVM. The question is now, why do we need a monitoring and debugging API? Because when I'm from the Java world and in Java, we haven't built in debugging API. So we have the ability to set breakpoints, to ask for values, to walk the stack and have everything. But in Python, does the Python interpreter know about the concept of breakpoints? So because I'm here, not only, but with a few here, who of you thinks that the Python interpreter knows about the concept of breakpoints? Please raise your hand. And two of your things. It doesn't know about the concept of breakpoints. Okay. It's a trick question, of course, no, because otherwise I wouldn't be asking this question. So it doesn't know anything about breakpoints, which is not a bad thing. So any ideas how we could implement it? So the first idea that came to my mind was like, we have this code. This is actually the code that was part of the code. So the idea was we just place in front of every line a debug statement, like a debug method that we define somewhere. The idea is simply put in the debug method, we check are we currently at a breakpoint in this file online, and if yes, we open some kind of debank shell. If you've ever used PDB before, that's essentially what, so it's the PDB shoulder, we could be opening. But the question is how did we get this file online? And the easy answer is we have this get frame method. It has an underscore, and the important thing is it has an underscore because it's kind of in C-Python implementation detail, which is great. Because it's pretty slow in PyPy. But we have to live with it because that's the only way we can walk the stack. We've seen before that we can do some EBS stuff, which is nice, but usually most profiles, not most debuggers, don't do it. So the idea is here, we have O stack, and the bottom is like the main, and then the count count line is code line, and then our debug method, and essentially what we do, we can ask get frame, the zero frame, that's the top frame, because we currently in the debugging method, and so we ask it for the frame one. And also get it from the other frames, and essentially what we can get is get information on like which are the local variables, which is the file name, which is the line number, and such, and that's quite nice because this allows us to easily implement the debugging shell because we can just open the shell that contains these locals. And so that's how we implement our first debugging method. And it's nice, and it works, and we can even write some basic debuggers with it. The problem is we want to automate this because we don't want to put this DBG statement in front of everything. So how do we do this? And first I'm going to tell you about the pre-3.12 way so you know the pain points of debugger developers here. The pre-3.12 way was the way of this dot set trace, which was an arcane way to do it, but the idea is essentially we pass it a handler and this handler is called multiple times. Essentially this handler gets parsed the frame type, the frame, and an event type, which could be like call line return or exceptional opcode, and it would be called regular time. So when we then register it, we register a specific handler and this handler then is called at every call. So every time the method called call lines is called, it's called, and every time then also a method is called line, this use here is called. And that's nice, but we want no more. We want to know also we want to get a handler on every line. So what we do, we can return from this handler and inner handler that's called for every line and this has the same signature. So the idea is we essentially implement our debugger here by not using like our writing manual here with the DBG but just setting an inner handler. This is called at every line. And that's quite nice because it works, but you might expect to show later it's quite slow. But it's okay. We can even go to the opcode level, to the bytecode level in here. But the problem is, and the question here is do we need a line event for every function? Because we know when we set a breakpoint somewhere, we only need to like, set a breakpoint, set, we only need to like check the lines there. But for example, consider that we have here this con-con-lines and our user decides that he adds a breakpoint when we're in isCodeLine. So it's a breakpoint also and isCodeLine and in isCodeLine decides, hey, I want to add a breakpoint into code con-lines. The problem is when we haven't passed like inner handler to the method here for con-con-lines, we can't enable line tracing for con-con-lines. So we have to enable it for every, every line of our whole application, which is kind of a mess. So this is slow. So there are multiple ideas how to improve this. And one of the best ideas, and if you're a Python developer, you should do this. If you're a Python core developer, is to add a new API. And this API is called Python, and this API is defined in the PEP669 and it's really, really cool. And the cool thing is also this PEP is written in a style that you can easily digest. I come from the Java world. This is not always the case with the Java PEP. So I'm with the JEP, so I'm quite happy that Python does things a little bit better. So the JEP is called Low Impact Monitoring for C Python and hopefully other runtimes will support it in the future. And it's here since October. So the idea here is that we have more fine-train support, that we learn from the lessons that like having to enable the line handlers for every line is probably not the best thing. So typically when we develop, when we use this PEP, we define some shortcuts here in the top. So we don't run to write, is this not monitoring all the time because that's where the monitoring functions live. So we call it mon and events, it's also bit long, so we call it mon not events. Then we have to assign it the tool ideas, tool ideas. So the idea is that you can have multiple tools that are registered here. And for each tool we register some callbacks. So what we do here in our example, we register a callback for our tool. Our tool is a debugger, they're like six possible tool ideas, a tool idea is one of them is a debugger, another one is a profile. And we register here callback for the line because we want to still have like line callbacks sometimes. And we also register a callback for the start function. So for the start event, when a method is called, then we have these handlers and the start handler is just passed like the code object. That's what you get when you form a method for function called f underscore code. And the offset word is located in a byte code and for line handler, we get the line word in. And the cool thing is here we can return from this, as you see in the bottom, the line handler also can return either disable or any. And the cool thing is when we disable, it's disabled all the time for this specific line and it's called for coverage. So we can also make coverage testing easier. So yes, we enable them the start events and that's fine. And then we run our program, we get the start event for every function that we call and then at every time we ask, hey, do we have a break point as an function? If yes, we enable the line here. But only specifically for this function. And then for every line we check it. The cool thing is that these line events are emitted per fret. So the ideas that were set, sister, etc., it was emitted like in the main fret per interpreter. But here it's emitted for every line in the fret that the function is currently executing. And this is really cool. And Lukas Lange wrote in a PR discussion, the biggest opportunity of PEP 6.9 isn't even the speed. It's the fact that the debugger built on top of it will automatically support all threats and supports threats properly. And with the incoming changes with PEP 703, with making the global interpreter lock option in Cpython, it will get far more important. Because with then we will probably see multi-threading Python applications and then the old approach is just not usable anymore. So the idea is that we can enable events globally and locally. And the sum of both is like they're enabled events per function. And the cool thing here is the power is in the fine-train configuration. So you can set events in a function f for the function f while this function is running. So consider this example here. We have a simple line handler here and you register a callback for each line. And then in f you decide at some point, hey, I want to set the local events. I want enable line events. And later you disable them. And it works. So here we emit hello and then we emit like, hey, we're at line 18. We're just like the line that prints n, then we print n. And that's really cool. That wasn't possible before. That's really great because this enables us to only enable line events for the functions that we will need them. And so the question is, of course, what's far as there are several methods in this PEP and this API. And what's really fast is to register callbacks. We can easily switch out the callbacks and get the tools so we can say, oh, please get the tool ID. What's kind of fast is that setting local events because where it does it modifies the bytecode to the VMS executing. And what's pretty slow is to use the tool ID to start the debugger here and to set the global events because this potentially recompars or modifies all bytecode of every function. So do it all the way. So then it's fine. So back to the debugger. We had here our start handler and our line handler. And they look essentially the same as before. The only difference is that we enable the line breakpoints when we're needed. So they're different than events kinds because we've seen that line events are pretty powerful for implementing basic debuggers. One of them we've seen already the start events. There's the regime return, yield events for everything that you do in your Python application. And there are also then in-signary events. These are events that aren't like an A that you can't enable or disable because they come from, they are controlled by other events. So for example, you have to call event that is triggered whenever you call a function and then we have C relighted, a C raise whenever exception is flown in C code or whenever fact a C function returns. And there are of course then also other events that are not enabled locally but only globally. Essentially the idea is we cannot locate them properly. And the cool thing here is maybe you've seen that we have a new event called stop iteration because we think it was in this Python version we're not using in iterators. When we were returning from iterator previously we wrote an exception but that's pretty slow. So we don't throw an exception anymore but to debug this to still notice it we have a new stop iteration event. Of course what you'll be waiting for is performance because the performance is like the thing that besides the threading support is pretty neat. So what I did I looked around and I found some people also doing some performances but they were using Fibonacci functions. Now I'm like that's a bit small, that's not representative. So I started looking into Python benchmarking suites and there's this pi performance benchmarking suite and I just hacked it. I just wrote random code in it because you can do some kind of monkey patching in Python and it's great. In Java we have like private functions and everything in Python you don't have to care and that's why you like using Python you can do things that you're not supposed to do to get some change results. So if you want I'm using Python all the time when fixing bugs in the OpenJK to write test suite because it's faster in Python than to do in Java so you get the OpenJK, some bugs were fixed because I wrote some weird Python script here. But essentially what I was then going to test is I wanted to check the minimal implementation of a debugger. So minimal implementation with such trace, so debugger that doesn't have any breakpoints here is just call handler and then an inner handler calls it every line. Then the minimal implementation for monitoring API wouldn't enable any line events because when we don't set a breakpoint we don't need to set any line events. So that's how we implement this but I thought like it's a bit sneaky because we're comparing something that triggers an event that is relying with something that only triggers an event every function calls are also made a third comparison with like setting all the line events and it turns out that's still faster which is quite nice. So I use this Py for matchpoint suite that's quite representable and what I found is that it's really, really fast. So the CessetRace when you run it, when you run all the five benchmarks with CessetRace on you have a 3.5 times larger runtime. That's pretty slow. When you're using monitoring you only have a runtime increase of like a factor of 1.2 which is like 20% slower which is pretty, pretty awesome because this means you can debug all your tight loops, you can debug your whole applications without worrying about like debating slowing down and when you enable all line events it's still 30% faster than with CessetRace and people here probably like RAS. That's essentially all the benchmarks that are in Py performance and what you see here are the orange bars. These are the bars for the monitoring solutions and here it's just at one so if the bar is not visible it means you don't have any overhead in this benchmark but essentially you see that the blue bars for CessetRace can get high, can get up to like 10, 12 so it's really good. At least in my opinion and then when we switch over and use CessetRace monitoring with all line events enabled it gets worse but still we see that it's still significantly faster. Another question is of course is this whole thing used because it's implemented and I'm working in OmTek so I noted when you implement a cool feature the chances are that nobody will use it for like a year but here in OmTek people started, but here in CPython people started using it especially the vendors like PyCharm. So it's not yet used in PVP and they're showing the second pie but IDEs like PyCharm with their version 2093.3 use it and they've seen significant performance improvements. And there's currently a pending pull request on GitHub so if you want to help PVP implement it go to this pull request and try another discussion here so I would really recommend it's an open source project to CPython and you can make PVP better so what not to like is simple. Here's a quote from the pull request from Chen Gao who wrote this pull request and he wrote after this change we will have the chance to build a much faster debugger for break points we don't need to trigger trace functions all the time and change for a line number so I'll show it to you. The bad news is it's almost impossible to do a completely backward compatible transition because the mechanism is quite different. So there's an ongoing discussion how to do this. You could take part there so scan this QR code, be part of the community, give something back and not just use CPython. So because I have like tiny tiny town left I want to just show you shortly how single stepping works because single stepping is just break points because essentially the idea is here when we have always take here with the Scona and step out of this for example we just check for the next line where only the frame before changed like the current code lines changed. Stepping is also pretty simple we just check that only like the line number changed it's also nice and then check in to is we check the next time where we just put the frame on top. So that's all from me I'm part of Northern Twitter you can find my team at sub machine A O so if you want to use a JVM use the sub machine it's the best JVM. I'm contractually obliged to say this. We work at SAP we're one of the many cool open source projects at SAP you can follow me on my blog where I write on DBegging, EBPS stuff and everything else. So thank you for being here. Thank you very much Johans we have time for probably two three questions maybe. Does anyone we have one there? Thank you very much for this talk and for this pep because it actually solves a lot of problems I had when I started back in the days developing a tool for performance analysis for Python. However at some point choose to use the C interface of set rays and profiling and whatever do your does your proposal as I said is implemented already also support the C interface? So I have to correct I'm not I have nothing to do with the nice people who implemented this sorry but so please ask them they're probably in some discord somewhere I'm just telling you about the good news because programmers usually don't want to go to conferences and speak in front of people so that's why I'll be giving talks on this. So sadly I don't know. Thank you do we have any more questions to Johans? Can raise your hand. No questions apparently. I just want to choose this opportunity to thank Mark Andre and David also known as Flypeg for organizing this dev room. You guys did an amazing work. Thank you very very much. And thanks Johans again.
Opening Railways and Open Transport devroom
So, hello everyone. Hello everyone, thank you for being here and for making this room so full even in the early morning. Don't be confused, I'm speaking in this microphone just for the stream so we have to talk a little bit louder here so that people online can hear us. I hope that's all right for you also to the speakers. Okay, so we are very thrilled from the organizers that you are all here and that we collected so many interesting talks. We will shortly give an introduction into what we thought about this schedule, why the talks are there and were selected, but in general we had so many good contributions and submissions to this deaf room so it was really hard and we hope that you will enjoy the program. We, that is people from different railway companies I would say in Europe, like in the last year when we first presented this room or organized this room, yeah, we learned a lot and we also learned each other and we are actually, we can say we just founded the Open Rail Association just a few days ago officially so this is one of the forms that we have where we bring together people from the community, from the free and open software community but also from railway operators, from the transportation community to work together and yeah, this is one of these great opportunities so it's great that we can do this for the second time in a row. And yeah, thank you. We, that is here, Louis Hamelon from SNCF, we will also give a short introduction now and then it's me, Max Mehl from Deutsche Bahn, we have Comedius Schumacher from Deutsche Bahn, Peter Keller from SBB, Mahalia Steffan from SBB as well and Simon Clavier from SNCF so you see we are quite international here and we are the organizers. So Louis, do you want to give us a short intro into the day, what can we expect today? Yes, thank you Max. So we try to tell a story this year and not to put some talks, the one after the other and we start with, first with something about the data and about the traffic forecast and to modelize the data of the demand and then what happened with the data, we try to simulate the passenger behavior with a mat sim tool which is a quite fancy tool and then once we have this transport modeling passenger model, we can use it for making simulation and building timetables so we have these three talks about a fancy tool which is called OSRD. Yes, it's a and this tool, so we will have three talks about this tool, one about the map, one about the running time algorithm and the other one about the signaling system and then how it is used and nice and what is, what, how the community works with all the, the, the, for the railway and for the transportation in general. Cool, Tristan. That's it for me, so maybe we can start now.
Open standards, open data, open-source tools: their governance and future
So, hello everyone and thank you for your patience. So I am Tuto, I'm a project manager and expert contributor with an IT for PT. I'm based in Paris and why I'm here today is because I do believe that the tools, the data that we generate should be used to build communities and inclusion. You have below a couple of languages that you can ask me questions into try not Japanese because then no one will understand in the room. Thank you. So, IT for PT who we are, we are a non-profit association and what we do originally we actually come from onboard units where we created a standard for open architecture and data accessibility and interoperability which means that basically the bus, the trains, the trams, the rolling stock would all be standardized talking together from the actual wiring of the vehicle. So making sure everyone for example use the same internet connection into tapping into the same feedback loop to the back office. So we are a membership based association. We have over 160 members in 28 countries that will have railway operators, public transport agencies and other associations. So as I said what we do really is to build this architecture for interoperability first and foremost. We also gather together a community of open source developers, aficionados and passionate people so that's why we're here and we also have at the end a label for compliance making sure that when people use standards they're not alone and they can actually check that all their different units are compliant with the standard so then from the buyer perspective you know that it fits the norms that exist and I'm happy to have Breder with me. Yeah officially I'm a public product owner for a small team in Norway with 10 people representing and tour company owned by the Ministry of Transport. We are a non-profit. We work building open source tools. We are using public funded money in our development and we want to give as much as possible to back to the society both with open data and open sources collaborating with stakeholders in Europe, Norway and internationally. What we say we do is building an open infrastructure platform. The road authority builds roads, someone builds the harbors, airports, electricity, water supplies. We build an infrastructure platform for mobility data. Open source all the way for my part of the tour and advocate for that for also the rest. This is. So on this one we wanted to show you a little bit what exists today when we talk about data that relates to transport and public transport and also railway knowing that there are different types of standards and specifications that exist. So if you take the European context you will have this gigantic European norm that is called Transmodal that is really to be viewed as a data dictionary and a grammar. So it's not a standard. It's really a reference for you to cross check concepts, how they integrate one with another, how they are articulated, how actually they are defined and because it's a European norm it is also translated in most European languages so it also makes it easier to implement. Obviously a data dictionary, a data model is nothing if a data exchange format is not created. So there were two and way more open standards that were created based on Transmodal. One is netx that is timetable information that is everything that is known in advance to describe transport network, schedule, the fares and so on. You have Siri for real-time information or anything that is not known in advance that you will have real-time updates, vehicle monitoring, situation exchange for example if you need to close railway or public transport services and one that is upcoming that we will start working on defining very soon is OPRA which is more to run statistics, performance, so public transport agencies and authorities can compare one operator from the other. You also have in the screen shown GTFS schedule and GTFS real-time which are probably the most used today across the world to describe real actually timetables and real-time information based on the on these schedules so if you use any trip planning app a good chance is that it's actually based on GTFS data because we're here in the room I would also like to thank a couple of colleagues including Stefan because right now what we're doing is actually bridging what was first created for urban public transport with the rail domain through European project. So as I said I wanted to place a little bit everything that is open standard and open source within railway and open transport and basically is just to show you that everything is linked so as a customer you usually just see the trip planning part which is on the top right where you actually want to go from A to B so you'll get your train schedule your timetable and so on and also real-time information the train is cancelled services disrupted one is late and so on and so on or just the tram is arriving at the station all of that is thanks to the data that is issued from the back office that a lot of times especially for real-time data is actually based on vehicle data where the IT4PT specification exists so as you see we really try to map out all the different standards and specifications that exist to build all of that but that's really the backbone it was more for me to give you an indication that all the data you work on has been standardized and standards are open for you to participate in. Let's go back I will focus in my next minutes talk about the upper right corner IT4PT standards from the vehicle to the back office system that produce real-time data in Norway is more and more based on the IT4PT standards but I will not cover that in my presentation because we are in the upright corner from the inside and we will have demand for finalized data that we can use so the IT4PT stuff is done by the operators. This is a overview of our all components that my team is responsible for you can split it into two which is the input side and output side all of this is open source open at github someone in the room Hannes is part of the team so we are more here to ask other technical questions it's possible and one French company has downloaded this on github and one atender and serves a region in Paris. We seek collaboration with everyone don't reinvent the wheel again in this audience how many produce data or want to produce data for the mobility sector 10? Stevan, raise your hand you want the right way to produce data and make it open yeah and the rest of you you want to use data yeah yesterday in the middle of this week I was in another meeting where the EU country talked and showed what they have done with open data regarding netx and Siri the last year nine country showed up and all of them has a lot of work to do so let's go just briefly into what we have done in Attenture we have focused on high quality data we need that to produce good information to the travelers so the operator and authorities in Norway are responsible for three things we have a national stop-lapse registry so they manually update that they produce planned time travel data in Norway we say note the conversional data so they produce native netx from the start in this context netx is similar to gtfs but have a lot of more data support more into operational data can be transferred to the data and we use Siri protocols for real-time updates that is very similar to gtfs real-time we develop OpenTrip planner open source component we do that in collaboration with that project it's a successful collaboration supports both netx and gtfs supports high quality data in Attenture we reached in January one billion requests in a month in a country with five and a half million people so you want to travel a lot the API from OpenTrip planner is openly available you don't need to register but if you do you can have more access to it and we support mostly the main travel apps in Norway to use that API so we are getting the same information everywhere the API should be relevant for different users so it's flexible so the client can decide what is important for my users to show so the Ministry of Transport wants an tour to be a neutral one the national the biggest railway operator they want to show their offering first so they show their offerings and not the competing ones in the same way and other other regions based apps show only their local area and not enough and all of them are using the same API getting the same correct information everywhere we have one place to correct if it's wrong on the left side at the source we also share data in national access point which is a requirement in Europe there we share the API I talked about we share net text and stereo raw data files and we share gtfs and gtfs as well all of this is open available and what we say to the Norwegian data producers you have three responsibilities stop places the netx data and the return data and we from an tour can take care of the data is correct in all the apps and all the international apps also so we say to them deliver data to us and we can secure that it is correct as Google which is important for them we also see that the data producers want to use the data they have delivered to us that we have merged together with other data and we have quality validation tools so they want to use it themselves to do that they want they need to have more data than we need for public information they need operational data so we have added that into our data pipeline that is supported in netx and it's not supported in gtfs so that's the benefit of using netx in regards to gtfs and they starting now to use the data opens up giving them the possibility to get out of lock-in situations which is something that is usual in the public transport sector they have a big important software provider had it for many years and it's hard to shift and get out of it by going into the netx and doing that correct it's possible to break that circle to handle the extra data we do that with one validation in the previous slide but in open data we remove all the data that is for sensitive for operational part and we give them access to that in a different data set and it's this is worked works pretty nice today open source tools open tri- I can take open tri-planer I'm a man leading that open tri-planer is an open source tool started back in the US 13 14 years ago thanks Hemsner that's it was a successful tri-planning from the beginning increased usage worldwide added a lot of functionality and after 10 years when we started to use it we saw that it was built for big cities when we entered a graph with all the data from Norway the latency wasn't that usable from Oslo to Bergen 10 seconds we shut the stop the search give one answer we decided to get away the community to build a new version so and to took that lead on that development the two first years we did the development alone we had meetings with the collaboration so we secured that we was on the same path today we are around 10 to 15 companies actively develop on it we do that together in the same master branch we have regular product owner meetings to discuss the direction on it and we share resources the open tri-planer is a multimodal tri-planer supports all kinds of modes we are still not finalized with it but it works and we can collaborate even more and it supports the standards we talked about today netx and syria and gtfs and gtfs and it's supported in the same instance so you can use those standards together and then we're getting to the part that might interest you the most is we wanted to present to you today all the open source tools that exist and others that need to be built and hopefully some of you will raise their hand and help us build them in the sense that it is a good good for the ecosystem so what exists is mostly thanks to the amazing work done at N2 because everything is open source so they have as red said an open registry for a stop place registry so all the stop points in europe pushed by the european delegation we have a national access point so it's kind of like open data platform for every single of the 27 plus three european countries where you can find a lot of data sets and not only just public transport some of them for example in france you can have the registry of all the places where you have cal pooling you can also have description of bicycle lane and so on if some of them need for you to create a login a user and a password it's mostly to try and keep up in the kpi of how many people actually use the data and you have a lot of other open source data libraries you used to have transit feeds now it's called a mobility database you have the mobility hub for gbfs and so on in n-tool you also have a data creation tool that is called n plan so to really create your schedule and your data in netx and we have four netx two validation tools that are fully open source and two developed by n-tool and green light developed through the european project data for pt which is basically to check if your netx feed is correct against the xsd scheme and then you have a lot of other smaller open source tools such as the one created by dg transit alkadis abi and so on but what we wanted to show is those are the tools that people created a lot within their companies within european projects within their initiatives because it replied to specific needs however now that we get more and more data that is open we need to create more tools so ideas we had discussing with a lot of people but happy to hear your thoughts are graphical representations of netx and serif feeds for example conversion tools from netx to merits that is more on the railway or bridging the different open source validation tools that exist or analytical tools so that's it for our presentation and most of it we want to hear from you if you have questions on either some tools we could develop or how to actually extend gtfs or gbfs or netx or seri how do we work with the railway industry and osdm which is also one that we did not have time to present today so the floor is yours actually i know the question about netx because it was kind of small on the slide that it's the Nordic profile that's the interviewers what are the difference between profiles or what does that mean compatibility with other countries so your question is on the compatibilities of the different profiles i was asked to repeat the question for the live stream so perfect no netx is a huge standard made by more or less theoretical parts so almost every use case you can think of in public transport is supported in the standard it's allowed in within the standard to model a specific use case in different ways and it's still valid so to make data interoperable and usable for serpartis we need to have profiles the regulation came at the same time that netx was uh valid yes in the european this is a valid standard and the regulation came but it wasn't the profiles existing so the uk started with a small profile first then french built a bigger one and then in 2015 and tour came with our aims that was extremely high we wanted to support what i showed in in the slides both the information part but also support the operational part for the rail operators which is complex and also for bus operators and and more so the french profile was based on the uk we based our profile on the french added stuff and a couple of years ago uh EU profile came which is almost part of the french small differences yes first supports all of the most important parts of the european profile the norleague profile but not the operational part is not supported and not needed in the EU profile but there are this will speak it's just a small the the differences between the norleague profiles and the EU profile uh and what we see now in europe is that too many countries build their own profiles so we started also this one with a norwegian profile but together with a collaboration with the norweg countries sweden danmark and finland we asked our neighboring countries is our profile supporting your use cases if yes thumbs up we collaborate if no come to us ask us and we see what's the difference and they come back they had some uh small things that wasn't supported so we added that to the profile and they had some ideas this is a better way of solving this use case and we changed it and that we was in live production development with different stakeholders producing the data and we changed the profile and it went back to my left side on the previous presentation you have to change your export oh you're gonna have to give them money no you're not getting any money you have to change it we stopped the validation you are not allowed to produce data into our production if you don't change this and then they changed it so the difference there is uh today we see that it's very hard to use net text data from a different country into the same system you don't do something so that's what i spent the last week to highlight yeah we need to solve that as well yeah was that answering your question yes you
Rust-transit: libraries to manage transit data in rust
So, we can start with the next presentation. Right pronunciation. That's right. We will talk about the rough libraries, public transport, so the stage is yours. Yeah, hello. Thank you. And I want to thank you, the people before, because Transmodel helped a lot to make like a nice model of how transit should be called, what's the stop point, stop area, a trip. So, if you ever work with public transit, read the Transmodel model. It helps a lot just to make things clear. And also, OpenTrip planner works quite well. We've been experimenting recently using it in whole France, so it's a bit bigger than Norway, and it seems to be working, so they do some nice things when we have to talk afterwards. So, I'm talking about the very other extreme end of public transit data. So, some very small tools and libraries to manipulate this data. And it's in Rust, because, well, if you're handling a few gigabytes of data and want real-time data, Rust might be an option, and that's what we've done. And, yeah. So, we are a very open and informal organization. So, in GitHub, it's Rust Transit, and we want to make some lot of small breaks just to get started using public transit data, and then you can do whatever you want to do. So, it's not very formal. We have no status, nothing, and it's just focused on implementing things. So, gradually adding more implementation and getting more things working. And this presentation is kind of a call saying, we're looking for some projects to add to it and maintain us, and maybe some people will say, okay, I have this very specific need, come and see us and let's talk. Right now, it's mostly, but that only maintained by voluntary in a cooperative where I work, which is called Coup de l'Hon liberté, and there's a colleague over there who can also answer your questions. So, the first one is GTFS structures. So, the most important part, the biggest one, so I will be spending a lot of time on it, and the other breaks are a bit smaller. So, GTFS, I was told before, is the fact of the standard used to publish static transit data. So, what time will the bus run next week? And at what stops it will stop and so on. So, we started initially for our own project as defining the types in Rust, so the structs and so on. And for those who do work with Rust, it's basically just baking third serialization and serialization annotations. And with the time going on, we added some sugar, like reading directly from an URL, which was a common need because, let's say, no way publish it on this website, and we just downloaded and have the data immediately. We started making some integrity checks because it's just plain CSV files, so identifiers might not reference data that doesn't exist, so we added those checks. And we tried to make it easier to access the data from one object to another. I want to mention one alternative, which is Transit Model, which is made by a French company who is now called HOD, which was used to be called Kizio Digital and which was used to be called CanalTP, depending how old you are in the transit world. It's an AGPL, so it's a library, as AGPL, so it might be a problem. It does much more things like file conversion, is able to convert GTFS to netx and other way around. It has very nice query functions, like tell me all the lines that go through this point. But it's a bit more complex to use. It's mostly done for the own tools, so it's not always very documented and you have to read to know how it works to get it working. And in a perfect world, it would be based on GTF structures. So we start discussing with them, but it break too many little things on their end and they didn't want to bother, just say, it's work for us, don't bother with it. So some user examples are TrontoValidator, which is done for the French National Access Point, what they were talking before. So the transport.data.gov.fr has a validator that checks that every GTFS file is valid. It doesn't have buses that go over the speed of light, sorry, and things like that. So it's based on GTFS structures. It has also some own tool with the GTFS to geogesan. Some people just didn't care about the timetable, so they take a file of timetables and just extract the topology of the network. And another project is Ketrin Maps, which is kind of a big student project in American California, I think, from a university, which are trying to make a whole system and they're contributing a lot. So it's a very small vanity metric. We have like 15 contributors, which is a lot and small for a project that's not really publicized. And we have regular people just happen to use it and make some contribution. So it's living on a small pace, but it works. So we had, what was also to say is quite performant. We tried to find the biggest GTFS that is there on the wild and apparently is the German one. So there are 600,000 stop points in Germany, at least in the GTFS files. There are one million trips. A trip is a bus doing his route, and if the bus runs 10 times a day, it's 10 trips. 32,000 stop points. So it's quite a big file and just to get everything read from the GTFS file to everything in memory, it takes about 16 seconds on this laptop, whatever it means. It takes about 5 gigabit of RAM. It means you can handle the whole data of Germany on your laptop or on a reasonable, affordable server. It's also quite robust. As I said, every file in transport data GovFR, which is a national access point, is passed using GTFS structures. It has data that comes from a lot of different editors and vendors, which are present all around the world. So we kind of work through all the quirks of all the weird things that people do. Like in GTFS, you're allowed not to put the trading commas. So if you have 10 columns in the CSV file and have data just in the first two columns, you can just put one comma and leave everything empty at the end. It's all the kind of things you went through. And I'm using this as a side note, as I have an audience that might be interested. Oh, sorry. The GTFS file was created as a dump of a database. Just dump all the tables, put the CSV file, bundle them in the zip file, which is a horrible thing to do. It's nice to exchange with your colleague as a one-off time, but not to make a standard. So in the future, if you ever work with this kind of things, don't make zips of CSV files, please. Let's use, for example, a SQLite database. You can have a schema, so you'll be sure that the data will be respect, and into there will be an integer, a Boolean will be a Boolean, and so on. You have foreign keys, so you won't have wrong references. You have typed columns. You have indexes, so you open the file and you can have fast queries immediately for free. You have a crying languages integrated. You just download, open it. You can already make some statistics. And you have kind of fun things. You can put everything on an S3 server and make HTTP range requests. You don't even have to download the whole gigabyte of file, just 10 megs, and you have all the data you need. So, yeah, this was a thing. Think about how you materialize your data at the end, because people will use it, and it will bring a lot of pain if you don't think about the serialization of your data. Okay, that's done about GTS structure, which was clearly the biggest part, and now some smaller projects. One crate, so crate is a package in the Rust world, is Serialite. So, as you told before, Serialite is a standard, it's a norm in Europe for real-time data, and Serialite is kind of a more simpler version to use. I think it's open to heated debates if it should exist or not. So, Serialite was mostly used as a SOAP interface, so over XML, and Serialite is the same data, but serialized as JSON and served over REST. And we used it initially to convert from GTS-RT, so the real-time GTS data to CRI, for the French National Access Point, to be able to expose the data in the European standard. And I also worked this small toy project where I read all the French, sorry, not French Parisian, big Paris area, Ilde-France data, to have some dashboards about real-time of this top or this metro line on real-time. So, it works also quite well with some big data. I mean, Ilde-France is twice as big as Norway, so it works. I mean, the standards are well done, and we get things working. Another one, which started really as a toy project, is Wesson for Routing. When people see OpenStreetMaps, they say, oh, nice, a road network. Let's implement some Dijkstra algorithm on it, because I want to play around with it. And if you go into the OpenStreetMap format, you see that it was meant for mapping and not for routing. And the most simple example is a way, it can be a road that goes around 100 kilometers, and it doesn't stop at every intersection. So, if you want to do some routing with that, that's a very bad graph. So, the idea of this small tool is to cut the graph into a topology, as we learned as a student, to make some routing algorithm. So, initially, I made it just for this toy project, was all the roads from, it's like a tree spanning here from Tokyo to every corner of Japan, and making this kind of tree-like structure, so it was nothing useful. So, it's meant to toy around, like the project I told you. If you want to just try some algorithm because you're a student and you want to implement on real-world data, it's very nice. It used also for WessonD. I think there will be some presentation afterwards, which is the open-source railway designer. Sorry. I'm not worried just I'm new. Yeah, I'm bad with acronyms. And we want to do the same with railway, and with railway, there was no tool to do it. But be aware, don't use it if you want real-life routing on roads for pedestrian for cars. There are much more better tools and there are tons of constraints that's not able to handle, like a turn left, a slow as a turn right, and things like that. So, use OSM for routing for turning around for very specific needs or for learning, but don't use it for real-life routing algorithm, and use those very nice open-source tools that exist. And that's pretty much. So, thank you. If you want to say that we're working, everybody wants to work with Rust and transit data. So, we're quite open. I hope we're friendly. So, don't hesitate to contact us and let's slowly grow the space of Toolbox for your needs. I saw a Chinese screenshot of the departure port. Are you planning on also integrated outside of Europe? Well, this was just a creative comment from Wikimedia picture. In series, we are not specific to any region. I mean, GTFS and Siri might be used anywhere. While GTFS is quite used all over the world, and they take more European, I think Siri has gained some traction around the world because it's more usable than the GTFS RT part, which is very focused, very big infrastructures and not always used. So, maybe it's possible. More like an information, you might be happy that Siri Lite has been fully approved. So, it's not kind of like a French version of Siri on the side, because Interforce Mobility asked for it. Okay. Thank you. That's a nice approach about all this transport. It's like making words vocabulary and being agreed on what this word means. It makes it easier to cross boundaries or formats and things like that. But it's always a bit tricky, but nice to hear. How far away is Rust in the transit industry? Like, there's another project, the transit model. Is that also Rust-based? Yeah, Transponder is Rust-based, yes. And it's used for... Huff makes a routing algorithm, which is called Devisia, which is heavily used in France. So, which is written mostly in Rust nowadays. It started as C++, well, it started as Delphi, but that's a long time ago. So, it's actually used, yes. Okay, thank you very much. Thank you very much. Thank you.
Counting on openness: Privacy-safe passenger counting
How many of you are still awake? Let's have a show of hands. Okay, 90%. Very good. Very good. How many of you work in mobility in your day job? All right, that's about 25%. How many of you would like to work in mobility as your day job? These are the superheroes of the next generation. How many of you have worked with passenger counting before? All right, five. Excellent. All right. So this story, this is like a few of the things that I want to highlight in the development of the Finnish national automatic passenger counting system. And there are like bits for everyone, and I'm going to be fairly speedy with these things. And if you have questions, please ask them at the end. And let's get started. So I've been working in public transit for a bit over 10 years now, and software development for a bit over 15. And I started my own consulting company five years ago when I wanted to help more organizations as well. But I just wanted to give you a bit of background that I come from the public transport side and not so much from the railway side. So just the basics. What is automatic passenger counting? It just answers the simple question, how many people are there in the vehicle? There are two different kinds of messages that these vendors send. For example, they send how many people went in or how many people went out of this particular door. And then there's the option of telling how many people there are in the vehicle right now. And some vendors send both of these data, and then you have to decide, speak louder? Yeah. All right, thanks. So some vendors send both of these data, and then you have to decide which one you trust, the DIF or the total. So why do we collect this? For the passenger, the benefit is quite obvious. You want to travel with the less crowded vehicles, mostly. You get this information from the passenger information systems such as signage. But also, more and more, I expect that there will be automatic decisions made for you without you knowing about it. So trip planners already will suggest trips that are less crowded. And in the future, I think general technology should already be there, but that general technology hasn't reached public transit yet. So we can't yet recognize prams and bikes and such when they come in. But when we can, then we can tell you if your pram fits in the bus that you're aiming for. Now for the authorities, the public transport planners, for example, one of the most important things is to be able to understand where the masses are moving. So you want to allocate the vehicle capacity where it's needed. So for example, if there's a route and the last three stops of one direction is often very empty, then it might make sense to cut the route short and just increase frequency. Also, some of the trip planners have these status information on how many grams of CO2 have you released when you're traveling. So that depends on how many other passengers there are in the bus. Also, pandemic precautions are important driver for the funding of these projects to finally get these things funded and running. And also when the passengers choose to even out the load, when they choose to go to vehicles that are less crowded, it means that transit becomes smoother because there's less congestion in particular spots. Now the situation in Finland before COVID and before mobile tickets were very popular. Aha, I'm hearing myself. The situation was such that... Somehow caught me off. Right. So there was not much incentive for developing these systems because we got most of the information from the ticketing systems. At least we got the information on when people got on. But in 2020, six municipalities and the government put money together and pulled it into Valti, Valti's service development company for public transit purposes owned by the major municipalities of Finland. And they pulled money together and in 2021, they chose Futurize as the contractor. Futurize is an excellent service company in Finland and I was privileged to take part in that team as a technical architect and a lead developer. And in the next two years, more companies and organizations joined in this project and many phases. I'm just giving you a bit of the background on it. I think in hindsight, our main task was to reduce vendor lock-in and to reduce the costs of APC because currently the high quality sensors take cost... Typical stereo camera costs 1,000 euros per door and 203 doors per bus means that it's quite expensive. And also, I mean, sensibly, many of the vendors want to offer end-to-end service of providing data and analysis. But then it's hard to maybe get rid of that vendor if you want to move on to the next system. So we interviewed stakeholders, held workshops, sketched out some architecture ideas and we came up with like a three-prong approach. And the first one is that we create an API spec between the onboard accounting devices and the backend. And we try to make it easy to understand for companies that don't work with public transport in general. And as a starting point, we took from HSL to Helsinki Regional Transport Authority, the capital area, PTAs, Public Transport Authorities often have the most resources, so they were a bit ahead. They had this data format that was modified from an earlier data format that they have. And we wanted to be compatible with HSL, so we don't fragment the finished market. But also, it has a lot of craft for our needs. So I've split the chasing message into two columns here. The first one is more about the APC ideas and the right one is about the general public transport method data, such as routes and operating dates and directions and such. And all of the data on the right side is available in the backend anyway from some other source. So by reducing some of the fields and also trying to get rid of some ambiguities, we just added schema version and accounting system ID to do matching in the backend and message ID to make a message checking unique when there are duplicates. And then we dropped everything that wasn't about the APC. So these chasing messages are sent over MQTT, which is very commonly used in public transport, both on board and between backend and vehicles. And I think this format allows for any company that understands how to count people or objects to participate in this market. So it lowers the barrier to entry. And we're hoping that there will be more companies that offer accounting devices. Okay, the second approach, second attack was to prototype new counting technologies. And we asked two companies to develop new things and one company to provide a reference device of something that already exists in the market. And DILCOMP created object detection from security cameras and AmpliCa used a millimeter wavelength radar for the presumably object detection. And there are a couple of pictures on the upper right, maybe a bit small. This is a picture of the prototype millimeter wavelength device 3D printed stuff hidden behind the ceiling panel. Now, unfortunately, unfortunately we learned that 20,000 euros per new technology was not enough. For us to create breakthrough technology, we managed to create the right format of data, but values were not yet usable. But maybe some of you can figure this out. I hope you can. Okay, the third approach that we created an open source backend for this whole system. So there's also no great vendor lock into our team either. And here's like the simplified architecture of it. And let's forget the left side for now, but in the middle, data comes from the on-port counting systems. They go to the MQTT broker. And then we push it into Apache Pulsar. Apache Pulsar is a distributed append-only log system competitor of Apache Kafka and has been in use at HSL for six years now. And we also wanted to have synergy there so that Valti and HSL would have similar technology backends. The messages from the MQTT broker are deduplicated and they're brought into the journey matchup. Journey matchup also takes its input from GTFS real-time, the vehicle positions which tell where the vehicles go and when they leave the stops. So it's a very simple logic in principle in journey matchup that you just accumulate values in and out until you leave the stop and then you trigger APC message with all of the public transport method data that you need in the analytics. So routes and stops and directions and so on. So journey matchup pushes it through MQTT back into the provider of the GTFS real-time API. So that services the authorities. That's the raw data or raw, but I mean the accurate data as such. But it doesn't serve the public because this is private data. This is mobility data of people moving about. Now you might think, okay, how many people moved in the door doesn't really match with any individual. But that is not so. And on the left side we describe how we need information from the vehicle registry as well to pick up data about the vehicle models that we have, the seat configuration and standing places and how we create an organization profile out of it. But for this part we need help from experts. So we asked university researchers from Finnish Center for Artificial Intelligence and Helsinki University. There's a professor on the home class group that is focusing on differential privacy. And especially Jonas Jalko and also Raja, sorry. And oh dear. Okay, I'll get back to that. Worked on this. Jonas was especially working hard on this with me and they created a method for the anonymization. Now the reasoning why we need this is that if you consider someone who lives maybe not in the city center a bit further away and they travel in a bit peculiar manner. Let's say that they have shift work, they travel at noon. The stop that they use, no one else ever gets on that particular route in that particular direction at that particular time except them. No one else gets off of the bus at that time. So if you learn that pattern, if that accurate information was public you could stalk them and figure out, okay, now their house is empty and so on. So to combat that, often I've understood that how people approach it is just been the values. So for example, if it's five to 20 people in the bus then it's many seats available. In GTFS real time, standard occupancy status, the occupancy status field has these ordered categories from empty to full. And the thing is that that's not really anonymization because when you switch from one category to the other you're still leaking information. So the method that they created based on differential privacy, we believe it is the first differential privacy method for automatic passenger counting in public transport. And I'm really glad that these researchers made this effort for all of us and it's all open source. I think it deserves a round of applause. It's also very simple. So the above case would be the one where you have no anonymization except the pinning. So once you switch from four people to five people you go from empty to many seats available. Now how their method works is that they take that vehicle model, the seats in the standing places, and they turn them into these, they take as input this upper CSV file, CSV, I'll just give you a light. And they fast these boundaries so that they match the differential privacy condition. So I'm not an expert on differential privacy. I'll hedge anyway how it works roughly so that we're actually using epsilon delta differential privacy. But in the epsilon differential privacy you have the small value epsilon that you can decide and that affects how private versus how usable and accurate your output data is. And the epsilon affects what's the probability that you can figure out an individual from that data set or whether that result was formed by a data set with a particular individual in it or not. That probability difference is very small and affected by epsilon and the delta parameter relaxes that condition a bit. So the black areas here have probability of zero, exactly zero. So that's the delta in action. Otherwise you would have these violet purple bars quite far along. So for example, how you interpret this is also CSV files is visualized. For example, if you have seven people, you have a small chance of publishing empty and a large chance of publishing many seats available and no chance of publishing any of the other categories. So we want to have such a system that if the accurate value would be many seats available, we don't accidentally publish full. The counting of these profiles is quite intensive, takes many hours. There may be various optimization possibilities in the algorithm, but it needs to be only done once per vehicle model. And then you have this small CSV file, a table of probabilities that you just sample from every time you need to publish the result at a stop. So it's very, very fast in use. And you can just plug it in if you already have another system like the one above. All right, so this has been a trip of these highlights. Check out our API spec, especially if you're interested in creating these kind of counting devices. Please try your hand in it. They are, the bosses are dirty and they are dusty and they're shaky, but otherwise you can use whatever methods you have available. Also, if you haven't yet got your own APC system, check out our backend code or maybe our architecture and this idea of not, of having only minimal data from these APC vendors. It's attractive for you. And if you already have an APC system, please do use the anonymization method created by the researchers. If you have further questions after this, you can contact me by that email. There are a few of the links. They're also in the talk page. And I'm not yet sure what else will be behind transit privacy.org, but right now it's just a link to the tool that the researchers created. That's enough of the monologue. Let's start the dialogue. So for public transport, I guess it's very important to easily detect when a road is being inefficient by maybe just moving air. For example, if in some end of the road, the bus or a tram or whatever is mostly empty, how does this anonymization algorithm make it harder to detect when some public transportation is being under you? So the question was whether or how this anonymization will affect the public transport planning use case of figuring out whether reallocation is to be done with the vehicle capacity. In our architecture, the public transport planners get the accurate data into their analytics. So the anonymization happens after and it's only for the open data part. Can you speak about your experience with the microwave based sensors? Oh, the millimeter wavelength radar. I have no clue. So we gave these companies a lot of leeway and they produced their pilots and we don't have insight into how exactly that works. Any insights about the results of their pilot or not? The insights were that thus far the results were not good enough to be shown. You haven't actually mentioned the great deal about how counts are actually achieved. Sorry, I can't hear you. You haven't explained much about how counts are achieved. The time has been involved enormously over decades. In the past you simply used to weigh the carriage. Now obviously it could be distorted by adults, children, people with a lucky age, Americans, whatever. Then you don't even know where they are on the train to the carriage. So you don't know that they're individual though. Then they looked at things like as you enter an exit using light sensors. Then they looked at things like are you connected to the internet and counting the number of people who did that. I suggest that they actually work on facial recognition, not against a database but simply against the number of unique faces on a carriage at a time. Then you can start as they move around and work out what the behavior is. What's your thought on that? Alright, thanks. There was a brief history of the different kinds of technologies used for detecting passengers and objects. Then a question of whether facial recognition would work. I'm not sure if it sounds like it would effectively work. It could. It's very tricky how to communicate it to the public in a way that is understood correctly. Like for example that we don't send anything else that plus one and minus one from these vehicles onwards. Yes, it's back. About that, facial recognition, you don't have to go really directly to that. There is so many more stuff that you can do and with your counting mechanisms, even all the source models can really do much better without having to get any facial information. If possible, just actually close that out. You don't really have to get through that. The layers, like I'm working on model share counting and we're doing that for cycling and we're doing that also for passenger counting and stuff. You don't need to actually get to mine all these information characteristics of the people who do the tracking algorithm yourself. It's not really necessary to get there. That will also reduce the communication. Thank you. A comment was about how the object detection and object tracking algorithms on open source are already quite fine without facial recognition. Yeah. It's like calculation of CO2 in the carriage. Sorry, I can't hear you. Calculation of CO2 in the carriage because when people are breathing, they can reflect air transfer and calculate how people are in this situation. Other studies have been done for COVID, for example, and you can reuse this COVID study. My next question is regarding the users of security cameras for counting people. Do you have any experiences in terms of producers of the systems? In the moment, you use the camera for a different purpose, the warranty is gone because we have this problem. This kind of thing is like, let me say, okay, we will never use it for other purposes than just checking security. This is like a big ego constraint that you have when you're procurement. Yeah. A good comment on security camera warranties. I don't, I remember hearing discussions about that, but I don't have any proper answers about what do the security service providers think about using their camera fees for something else. Okay. Thank you. Thank you.
MATSim at SBB: Using and contributing to the open-source transport simulation for advanced passenger demand modeling.
Thank you, Peter. Yeah. So today I want to talk a little bit about Matzim. Matzim is a transport simulation software that is being used at SBB, the Swiss Federal Airways, but it is actually an open source tool that has been around for quite some time. So obviously there is also a little bit talk about Matzim itself and the software and what it does and how you could use it if you are ever interested in that. That's on the agenda, so I'll briefly explain what Matzim does and why we find that useful at SBB. And we also contribute actively into the Matzim code and I'll give some examples of that. And yeah, since you might wonder why on earth are you even bothering, I'll give you some examples of our work with the software. So what is Matzim and why is it useful? If you have that elevator speech moment where you have to explain your work to your CEO and then they ask you what you're doing, then I tend to say I'm playing SimCity, but with complex econometric data behind it, so you have all these weird formulas somewhere in there. Then the elevator area is over and I have more or less explained what I'm doing and then the CEO knows that we have some guy playing SimCity all day. Well, there's a bit more behind it, but in brief that's what you're doing. So we are simulating transport and we simulate people's behavior using transport during the day. Matzim stands for Multi-Agent Transport Simulation and it has been around for roughly 20 years. It started as a purely academic project between ETH Zurich and TU Berlin. On a side note, that also explains why a Berlin guy is now living in Switzerland. So you can kind of imagine my background, but it has evolved over the years and there are many models around the world and quite a few of them are actually fully built on open data and are publicly available, not ours for some reasons, but for example there's quite a scenario about Berlin that you can download and you can see the data where it comes from and you can start playing with the model. Whether this is useful for anyone, I can't say, but I guess I think it's useful for some, even mostly PhD students to be fair. There are commercial users around the globe as well. Among us, the SBB, there's Volkswagen who have quite a strong development, also into the Matzim core, but they're not as open to talk about that as we are probably. Then there are models in Melbourne, there's one at the Berlin Transit Agency, so it has some standing right there. There's a book, there's code, there's a license and for the last couple of weeks there's also an association that kind of brings the whole thing together. Now, how does it work? So imagine you have a lot of data. You have census data, for example, you register data, you know where people live in a city or you just make that up, you replace people somewhere. You have econometric data that is value of times. You know what a person's intention is. If they travel by train, then the value of time is maybe six euros per hour and if they go by car, then it's maybe 10 euros per hour or the other way around. You have a road network that can come, for example, from OpenStreetMap, that is a very typical use case. You have a timetable for public transport, typically GTFS. You have count data. Many of the topics discussed in the previous talks are actually input data for us and that is a lot of input data. What we do then is we add some generic algorithms that basically randomly decide to tell people during the day, change your route, change your transport mode when you go from one activity to another or change your departure time choice and then we let that run the same day. It's a bit like Groundhog Day, 200 times, 500 times and mix people up and let them try out new things. This is what we call the Matsum loop that is also somewhere on my T-shirt. What comes out of it is actually output data, even more of it. You have individual daily plans. You know what your synthetic population is doing during the day, where they go shopping, what transport modes they use. You have mode choice for each strip, whether people tend to take car to get from A to B or public transport, depending on what's the offer. You have time-sharp traffic loads, so a lot of data to analyze and do your policy planning. You have distances, you have all kind of aggregate data that you can then use and play with. Obviously, this calibration process for a model that it really depicts the real world in an initial stage is kind of what's the long story behind model building. What can you use the whole thing for? Of course, transport policy evaluation, what happens if there's a new road, what happens if there's a new railway line, what happens if there's a new price. You can do a person specific. You know who's affected by transport policy because you have this agent-based paradigm behind it. You can also calculate, for example, accessibility. A lot of things are actually happening where MADSIM is being used when it comes to on-demand transport modes. You can really do your fleet scheduling, your fleet planning. You can say, okay, what happens if we have a lot of automated vehicles that replace passenger cars and what's the advantage of that? All these kind of future scenarios you can use MADSIM for, well, basically playing SimCity. The MADSIM project has been around for almost 20 years and historically it has been administered by the universities, so ETH Zurich and Tew Berlin. Professor's grow older and at one point they retire and the person that comes next is maybe not as interested in such transportation simulation anymore. Since last year, the whole MADSIM project is built on an association level, so that there's also some funding from other users to maintain build servers and all that stuff. The association also organizes things like the user meeting that is held annually or keeps track of all kind of developments, publish a newsletter, all these kind of stuff. Now at the Swiss Federal Railways, how did we start with that? It's a very brief timeline on one slide, but I think it's kind of interesting. In 2016, our CEO saw a presentation about MADSIM and decided we need this at SBB, please buy MADSIM. Well, as it happens with open source software models and open source software at all, buying the whole thing wasn't as easy and the whole procurement process didn't quite work out, so the task was delegated to the department that deals with classical transport models, so it's actually part of somewhere in the passenger division, this is also where I'm working. It came up with some challenges, for example you needed someone who knows programming Java and those people didn't exist in that department, but things that can be overcome and you need like proper computers actually because if you want to run proper big models then having a nice tiny laptop isn't sufficient. That was also something to overcome, also thanks to the IT at Peter's department in the end. At least we didn't kill it. At least they didn't. Yeah, but the whole thing, building a model for Switzerland in MADSIM took three years and ran from 2017 to 2020 and along the way we noticed that there are several additions into the code base that need to be made to make this a useful project for us and then at one point you see okay we need to decide, do we commit this back into the MADSIM core or do we keep that in our secret chamber and luckily people chose widely and this is all open source and this was actually a management-backed decision so I'm very happy with that. So in 2018 the first release of the model that I will present in a moment was released and since 2020 we have an annual release cycle of a new, of a transport model that is multi-model and MADSIM based in Switzerland and that can be used for all kind of policy studies. Our contributions to MADSIM I just want to showcase two. First of all one that is called oddly the Swiss rail rector so it's an rector based public transit router that works really fast because if you want to route millions of people within a reasonable time frame then this is what you need and compared to what we had before it was like many, many times faster and the whole simulation was actually sped up by factor three and what is also important to say, the Swiss rail rector is, well it's a Java package but you don't have to, it's tied to MADSIM, it uses MADSIM data structures but you don't actually have to use it for MADSIM problems so in a way it's something that can be used instead of OpenTrip planner but OpenTrip planner has other advantages but if you really need to route a lot of routes at the same time then it's something to look at. And then we, yeah now we have a bit of a fancy routing algorithm so it also knows like range queries, queries and intermodal access and egress and the last point is very, very important because if you want to model stations you really need to have an idea how many people arrive by food, how many arrive by bike and so on and that is not an easy question to answer and apart from the routing problem which is already quite complex it's also a part of the, you don't have that much empirical data about it so, but it's really one of the most useful features by now from our model so we're happy to have that and yeah as I said you can use that kind of independent of the rest of MADSIM so it's well worth checking out. Then there's another contribution where I was a bit deeply involved, it's basically, there is a traffic flow simulation in MADSIM that is typically queue based, I didn't go into detail here because I don't have enough time for that but we replaced this by something that is just like roughly two times faster for the whole simulation process and it's called Hermes because well it can fly now but has less plugability so depending on the use case of your simulation you can use one or the other and they're kind of interoperable and yeah this brings down simulation run times for Switzerland scenario so both the router and Hermes to something like 24 hours and since these are typically AWS instances it actually saves a lot of money to have models that run reasonably fast and for calibration process you maybe need 50 of those runs so it's it's kind of sums up yeah now what we use MADSIM for first of all there's a model called Zimba Mobi this is where I'm the product owner so I know a lot about it and it literally depicts the everyday mobility of eight and a half million people in Switzerland so basically the whole population it includes all major public transport modes so walking cycling taking the bus taking the car and it has a representation of the transit schedule obviously and also the whole road network and since it uses MADSIM the whole people's behavior or agents behavior is microscopic so including first and last mile decisions and hope that video works now yeah it does so right now imagine it's 8 a.m in Switzerland and you see people in those blue dots being at home and these light blue dogs people starting their work time and now we zoom in into a region somewhere in Eurik and see what people do there so to get from one place to another they need to travel obviously and they could travel by car that then there are in those little gray boxes or they could take the train or public transport then there are in those little red boxes and they get from one place to another and obviously you can run your analysis on that some public transport vehicles are maybe more crowded some are less crowded and yeah you can sum that up over the day and see what's going on and now we zoom in again and what you can do is see who's alighting at certain stations what kind of passenger groups are they are there do they have a regional subscription do they have a half fare card or do they have ordinary tickets for example and also on the highway you can see during the day how many people are currently on their way to work how many people are on their way home and who's just a truck and who's doing other things and this is all in the model and you can like analyze a lot of things obviously we are a bit more tied to public transport related analysis so station access and egress is different at certain points so in Jettikon for example there's more people arriving by bus to the station than in Aarau or if you have this city in Baden here then you can see people that reach the stations from nearby typically walk and people that come from further field further field they take the bus and these kind of analysis is really useful for example for station design and station planning and yeah so typically use cases for the model are then also the development of rail lines and designing of stop locations and time and the effects of timetables so it's nice that you created a nice timetable but who's going to use it and how many people will be on the trains this is an answer we can give and we can analyze what's happening around the stations so that was just the video that I showed and we can also see what's the effect of certain land use policies for example so we don't only have a model that depicts today but also one for 2030, 2040, 2050 so we kind of know how according to today's assumptions Switzerland is going to evolve and then we can do policy planning with this yeah and these are kind of future scenarios yeah and just one example for example over the next 20 years there will be roughly 20 to 30 new railway stations opening up mostly along existing lines and and very often these stations are being built because there's something happening around them so there's a new development coming new housing or a new commercial area being built or something like that and we can just like in Simcity we can add those little houses into the model and add people there give them daily plans and for example this is now in the city of Sankt Gallen where not a new station is being planned but the the moving of one station to another place so basically that station goes from left to the right and then lots of houses are being built there and with the tool we can say okay what's the beforehand we had at both those stations there are roughly 4 000 passengers a day and now it's roughly 6 000 so that would be the effect of the policy of things happening there and and these numbers help you to dimension those stops properly yeah another application that doesn't come from my department so please don't ask me doesn't question about it but i think it's interesting enough to be presented here is that we also want to go into deep down knowing what's happening along the railway corridors so Matsim has a mobility simulation i talked about that earlier and my colleagues decided okay we can replace this with something that we call Ray Sim and that actually has tracks and signals in there and blocks and we can start playing around with that and and do roughly also capacity planning and on a much easier level than it is usually being done so that you still don't need need to have an idea of every signal that's on the tracks but and of every switch but of the whole idea roughly you need to know whether a track is single track or double track for example and so the outcome of this is now also a little video you have two trains um one coming from down here it has currently a speed of six six meters per second but it wants to go it's currently accelerating to 11 meters per second and there you have also a train that is six meters per second and wants to accelerate to 14 meters per second and this train wants to go this way and this train wants to go that way and there's a station and they both interfere at one point so obviously we don't want them to crash so as you can see the train coming and the red lines are basically the blocks that are being blocked in front of the train so depending how how fast the train is the train is faster than the braking distance would be longer so more blocks are being blocked so um and then you can see the train that comes from the right has the right of way and and and the left train got a red signal and is breaking down and now that right train is passing switch goes to green and the other train can enter the station well yeah um and obviously you can connect that with the rest of matzin so you know how many passengers are on the train and then you can do your policy planning again if there's um if there's a heavily delayed freight train that but that would generate you that and that amount of money maybe you want to accelerate it but then you see oh no it interferes with all of our daily commuters they would be very angry so you can do your policy planning around this it's still in an early stage the that microscopic railway simulation but ultimately this is where we want to go to it's also released as as a matzin contribution called rail sim so it's part of the matzin code everyone can use it and i think it's a way to go in that direction as well but please don't ask me too many questions about that but i can connect to people who know about it so to wrap up matzin has helped sbb massively to to understand customer behavior and committing to open source in our point of view has really paid off here and it's also the way to go for us um yeah and these models are very very complex but um that is what they are no matter whether you use commercial or open source software um oh yeah that one has to come too um and uh yeah if you want to know more about matzin um there's an annual meeting um this year it will be in um as part of the heart conference it's a transport conference um on the 17th of june in in outer university so yeah hey thank you um yeah and um i'm happy for questions even though i only have five minutes thanks a lot for the presentation uh quite a bit of the the transport systems have an historical background for example probably some of them are based on the industrial need back at like 20 years ago or some of them are related to new developments in the city and some of these things are also presented in fact at open street map in these kind of resources are you able to extract amenities maybe or an historical sense or is the distributioner and the social demographic something that you get as a matrix how do you get this kind of distribution and sorry second yep do you import EFS and these other stuff um yeah both good questions so the first one um so we do have the census data from the federal office of statistics in switzerland and they also have an idea how it looks like in the future so we are in a very lucky situation there's a lot of data available and also publicly um in fact there's also a transport model available for switzerland publicly but unfortunately with a closed source license where you need a software that costs you roughly uh 10 000 euros a year so that doesn't help a lot um so we rather build our own models and the other question is um yeah typically we uh one would use gtfs as an as the main data source for public transport data since we are the railway operator and we have time tables in all kind of formats we use a different one but if you were to build a model for matzim typically you would use opus creed map and and and gtfs um yeah yep you um what for the model of the swiss transport uh how do you do that there's three three million agents that are being simulated eight and a half yeah eight and a half so what's what's the typical like how how much can it scale like how many agents can you have in one simulation um yeah that um so the the models do scale um but um there's there's an upper limit to what is useful because um so switzerland is still a useful also in terms of a regional scope because there you have many long distance commuters um but if you are using matzim to for example for for really long distance um choices um then you would simply remove everyday commuters so i also in a previous life i created a model for sweden that also worked but it's also roughly the same number of people um and but there are simulations for for cities in china which have 20 million people um and what you can do is you can cut down the number of agents that you simulate and and increase network capacities and so on so there there are ways to deal with that yeah so you mentioned that you can feed open street uh uh map data into matzim but does matzim also provide tools to add new assets or population models how there are tools that allow the addition of of new people if you don't want to hard code it some of them are commercial i mean there are some spin-off companies around matzim who provide this as a service but you can also do it on your own it's just basic java or python code that you can use for that so um yeah and if you do transport modeling um for public transport you would probably rather edit the gtfs than editing the matzim schedule so i think there are a lot of interesting yeah i'm here so sorry yeah well i think we one short question one question is possible yeah one more yeah i don't know how do you determine the accuracy of your model ah oh yeah that that is another talk of of an hour um so getting getting getting models right and calibrating them properly um is um is mostly countator that you need and in switzerland we have something that's called the microsensors that where where where we ask people every five years about their mobility behavior and that is very very accurate and has a lot of data where that is useful to calibrate models but um it's always a fair question if someone presents you a transport model to ask how is it calibrated okay so you
Bending geographic maps for enhanced railway space-time diagrams
Hello everyone. So my name is Alexis and I developed data visualization web applications at Westware. We do a lot of open source things and I'm totally not a trained person initially. I'm still not a trained person actually. But since early 2021 we started working for SNCF Rezo who is a firm in charge of handling the French train infrastructure. And we started to contribute on OSRD which is I guess it's been advocated here today already. Not yet. Not yet. Okay. So OSRD is an open source railway designer. It's an open source application to simulate trains on real or edited infrastructures. That's kind of amazing. The interface is web based. The project is kind of huge. The team is a good part of the team must be in the room I guess. And yeah, you can check it out. And at some point we, Westware, we've been tasked to actually enhance the space time diagrams. What are space time diagrams? First of all, everybody do not agree on what it should be named. So it might be circulation diagrams or graphical timetables or train graphs, which is actually nice train graph, but I'll stay on space time diagrams. It's probably been invented by a French engineer, Charles Hibry in the early 1840s. This engineer was in charge of scheduling the trains between Paris and Rouen. And he used this very smart chart I described right after. Some people think it's actually a Russian military guy. No, you remember? No. Okay. There's another lead and it's not clear. It's been invented by these people. Let's stay on this track. So horizontally you see the time. It's hours of the day. And vertically you see the list of the stations from Paris to Rouen. Basically it works. A train, can I zoom in? Okay, nice. The train is aligned and, no, sorry. Okay, each train is aligned on this diagram. And you can read a lot of information in this type of diagram just with those lines on this scale. So basically the more vertical is aligned, the faster the train goes. When the line is horizontal, it means that during different time, the train is at the same position, so it doesn't move. When two lines are crossing, it means that there are two trains at the same position on the line at the same time, which means that it's possible. So if I read this map, for instance, I can know that there are probably two different tracks, one for each direction, and probably no more. I know this because trains that don't go in the same direction can cross kind of anywhere, here or here or here. But when they are in the same direction, one train has to stop in a station like here or like somewhere around here, et cetera. This is kind of crazy, all the information that are just displayed in a so simple diagram. The thing is, I'm not a train person, but I know this diagram for long because it's actually on the cover of one of the DataVis reference book, Visual Display of Quantitative Information by Tuftor. And it's still used today. There are reasons why it's screen shot from OpenTrack and not from OSRD. I'll come back on it later. But OpenTrack is another open source software to handle trains, and they still use this kind of diagrams. And it becomes actually even better once we introduce blocks. So when they started using trains on tracks, it was kind of easy because basically there were not enough trains to consider collisions. But at some point, a train goes fast enough and is heavy enough that when an operator sees a danger stopped on the track, if he starts braking, he won't be stopped before the collision. So people had to find solutions for this. And so easy, since I'm going to sum up how it works, but I'm not a train person. So basically, the track is split into blocks, and only one train can be at one block at any given time, and there's a signal at the entrance of the block. If there's a train inside the block, the signal is red, so you cannot enter it. If the signal is orange, it means that you must stop because there is a train in the next block, basically. The thing is, when a block is occupied by a train, it means that during a time and on a distance, there cannot be other trains in this block. So basically, the occupancy of a block by a train is a rectangle, and when two rectangle collides, that's bad. And in OSRD, here is what it looks like in OSRD. So the red rectangles are the blocks occupied by a train, and here in OSRD, I started a simulation, and I dragged this train so that there was a collision. And yeah, it's really easy, graphically, to see that there will be two trains in the same block at some point. And I think as a data-vis person that it's kind of amazing. But how can we make this even more informative? So people from OSRD asked us, yeah, basically vertically, we just have the list of three stations or the list of point of interest, but we would like to bring more information in this. And we thought, let's start digging. Who does this kind of things? So we started looking into other transportations where people have to see how they travel along kind of a line. So here is what it looks like when you are inside the RORD. So this is a train that goes from northern Paris suburb to southern Paris suburb, going through Paris. And when you are inside this train, you have this synthesized diagram. So it's nice because it brings only the information you need, the list of stations, but also some interesting things like where can you switch to other transportation systems, et cetera, et cetera. And this is nice, but the thing is what Loïc wanted us to do was to have the exact infrastructure and to see exactly what are the tracks on the lines at any given point. And this would have needed us to actually know the whole infrastructure and to do heavy computations. And at this point, we planned to do this as a front end only feature. So we kept digging, sorry, yeah. So we can run there anything we want, but we need to know the exact topology and do heavy computations. And we kept digging to find something else. OK, sorry. So on top of this is the actual map of the bus in Paris. OK. When you take the bus 58 in Paris, you have this map. And the thing is, as you can see on the top map, this line, it appears absolutely straight here, you see. And this is kind of amazing because we cannot bend things in cartography, but that's what they did probably by hand. And they obtained this nice map where there are very identifiable areas. You can see all the streets. You can see a lot of information. But still, you know that you are going basically from left to right or from right to left. And it works. So yeah, we have to show everything a map would show. So we cannot just terrific exactly what we would like to display as we did with the schema because we have to take everything. But the good point is that we show everything a map would show, which means that we have all the context around. For a train, it would be the cities, the buildings, the places that are near the train, but not exactly on it, et cetera. It's actually called a strip map. And it exists for quite a long time. We've seen some very old examples like this one. And it's actually already been used within space-time diagrams already. So this one comes from the Russian military. It's trains between St. Petersburg and Moscow. And on top of it, not vertically aligned on the left, but you can see the whole itinerary with a lot of information surrounding. Like you can see the sea next to St. Petersburg. You can see other identifiable points, et cetera. And it brings a lot of context. So let's bend geographic maps. The strategy we used was to generate a grid made of triangles along the path. And then we generate another grid, which is totally flat. And when we want to translate a coordinate from the normal geographic system to the bent system, we just find in which triangle am I, and then I will translate it from one triangle to another. And this is something that is easy to do. So let's take a path. This is from Nantes to Angers in France. Then we generate a grid around it. So basically, I just simplify a bit the path, and I take regular steps. And I draw a line crossing it at a perpendicularly. And then I draw triangles, that kind of. But I have two problems here. First of all, there are points that are in multiple triangles. And this is bad. And another issue is that I have large triangles touching really small triangles, which means that in my final map, I know that if I have this kind of distortions, it wouldn't be very smooth. So we smoothen the grid with some, we just run some steps. I move each point to the barycenter of its neighbors, basically, something like this. Then we index all the triangles to get a quadtree. So it's really, really fast to know if I have a point, what are the nearest triangles. So in this nearest triangles, which is the one that contains my point, et cetera, et cetera, I do the regular grid on the right. So each triangle exists in both grids. And at this point, yay, we have a projection. So that's what I said. If I have a point P, I find the quad that contains P. I look for all the triangles that collide with this quad. I find the one that contains my point. Then there's a triangle with the same ID in the straight grid. So I just find this triangle, and I use barycentric coordinate system to translate from one triangle to the other. Also, I had to actually develop something. So we use the ReactMapGL and MapLibre. MapLibre and ReactMapGL because it's already used inside OSRD. So for this prototype, basically, we render a hidden map that contains the whole grid, but we don't show it into the screen. We just load every feature we can. We use layers from OpenStreetMap for the context and OSRD to have the actual infrastructure and signals, et cetera. Then we wait for the idle events that says, OK, I have loaded everything. I'm ready. So I take all the features. I transpose them. So I project them, actually. It's a projection. I also have to clip them if they go through or if they come from outside the grid and enter it, et cetera. Then I can render a new map with the projected features, which looks like this with the grid and like this without the grid. And we can look to the two maps side by side. Yeah, that's it. We have what we wanted, a map that shows the full itinerary from Nantes to Angers. We can still identify things. What I really like with StreetMap is that locally, if I'm going from Nantes at some point here, I know that I have Laloir on my right and the Scarke-Fou on my left. And those local information remain true in the BENT map. At some point, the Scarke-Fou on my left, the Laloir on my right, et cetera. You preserve local context at the price of having BENT lines around. In OSRD, this is how it looks like. This is a screenshot. I hope to show you something that works better in a minute. So yeah, it brings a lot of context. But when you zoom in precisely inside to the train in OSRD, you can see the exhaustive infrastructure, all the tracks. We didn't have signals yet, but it will come soon. And yeah, so it works for almost any path. It's known there's no loops, right? And it does bring context. With the current instantiation, we lose the Mabel data. It means that we have to load everything at once and render the map at once. But we cannot, like, if I zoom in, I will see more things with a better definition. It might come sometimes later. And it's a bit slow at the moment because we have to load everything and translate it at once. Yeah, demo, that's going to be really quick. So there's just a storybook. So it's on the OSRD UI project. If you want, you can just ask to the OSRD people. This is the project that's been moved out of the OSRD, which means that you can actually use it without OSRD data. It's just a react component that embeds some dependencies. And yeah, this is from Nantes to Marseille. Think with path. You have, yeah, on your right, you will first have this big ocean. And then later, there's Toulouse, there's the Pyrenees, and then the Mediterranean Sea. So it works as we wanted, lots of context. And also in OSRD, Roulmante-Tembourg, drumrolls. OK. This is the path I showed earlier. The trains, when I over train on the graph, I can see it on my strip map. And when I zoom in, I will have the actual infrastructure. They should be, yeah, OK. I see that the train changes swaps tracks here. That's nice. OK. That's going to be it for the demo. Thank you very much. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. I can probably take one or two questions. I'll need two minutes. So one or two questions. Sorry. Does this projection look good with photos of satellite imagery, or would it look really strange? Yes, it might look a bit strange. But actually, when the grid is quite smooth, like the one I showed earlier, where the triangles are just slightly bent, it might work. The thing is, I only work with vector data right now. But I could actually project pixels. But if I project pixels, you will have larger pixels. Yeah, a real sharp turn. It would skew things. Yes. Loïc has tried with some path that starts somewhere and goes just next to this somewhere later, and this is bad. For now. For now. Do you know how these maps were made before by bus maps in Paris, at some point? By hand. I'm quite sure it's by hand. But I don't have any proof. But I know that when I saw the amazing schematics maps of the infrastructure inside SNCF, and I asked, wow, what's the algorithm? What algorithm? So I bet it's by hand.
MARECO algorithm: how to drive a train using the least amount of energy
Now, second talk about OSDRD, it's about the running time calculation and the best way to calculate the running time is to save energy and Alex will present it. Thank you, Loïc. Hi, everyone. Thanks for coming to this talk. Today, how to drive a train using the least amount of energy with Mariko algorithm. This talk could have been how to drive a bus using the least amount of energy or how to drive any public transport that has wheels to use the least amount of energy and actually even worse with bikes. So I'm Alex Roland working also at SNSEF Réseau on the same project OSDRD for those who were at the previous talk. Here is our GitHub repo if you want to check it out. And I'm going to mostly spend some time on one type of graph, not the space time that you've seen just before, but this one is called the space speed. So it's a very easy graph that just represents the speed along the path of the train from its departure to its destination throughout the stops that it might take. On this graph, you have some speed limits that are on the line. Most of the time, the speed limits are quite lower on train stations. And then the train will leave the departure station with a speed of zero and then it will accelerate until reaching the different speed limits. Then it will break to reach the stop at speed zero again and then accelerate and then break again. So this is the fastest drive that the train can make. It accelerates as much as possible, drives as fast as possible and breaks at the last moment to reach the stop and to reach each speed limit restriction. In that case, let's pretend that the departure is at eight and then the stop is at nine twenty and the destination is at ten. This is still distance in this graph. I'm just showing you time because this graph does not show the time and we're going to need this. The problem is with public transport, if the train leaves five minutes late, it won't be able to catch up because this is the fastest drive. So it will at most arrive with five minutes late at the stop and at the destination. That's also a problem if the driver does not accelerate as much as the fastest drive. It will also be late even though it left on time. It's also a problem if the driver does not drive at the maximum speed, which spoiler happens. So going back to the fastest drive, this is actually a very bad way to plan trains, buses or any public transport because everything can fail very easily from the example I've shown. So we kind of want this public transport planification to have some margin to it. So if I'm adding, let's say, a ten percent time margin, I want to stop eight minutes later, so at 9.28 and then 12 minutes later here. So then I have some margin to like damp the few, I don't know, leaving late or not driving as fast type of problems. But we are here to save energy too. So we want to add that straight time, but we also want to save energy. Well the good thing is, physics does both of it. If you drive slower, you will save some energy. That's great. Great news. Okay, this is due to the different forces that apply on the train when it's running. You can see here, well let's not care about the weight and ground resistance. What's important here is the air drive and solid friction that scales with V and V squared. So at high speeds, it will be much greater. So driving slower, you experience it if you bike, if you drive slower, you use less energy, same with cars and every transportation system. So let's lower the speeds with a very basic way, a linear margin. So we are lowering the speeds at the same percentage all the way through the train's path. And then we arrive at 9.28 and we arrive at 10.12. Did we save that much energy? Not quite sure. This is also another way to lower the speed and then be on time with the margin that has been planned. We are lowering the high speeds only this time. But what I'm going to show you is actually the best strategy to lower the speeds because there are infinite ways to lower the speed and arrive on time. I could also just stop in the middle and then come back on track. So I'm going to show you what's been published by engineers from SNCF a few dozen years ago. I think the original paper is from 1979, so I was not born, which shows the best strategy to run trains in terms of energy consumption. So how does it work? There are four types of actions. Here I'm showing the same graph, but like a very simplified one. The train can be either accelerating, maintaining speed, posting. Posting means the driver cuts down the traction and then the train rolls with, drives with thanks to its inertia. And then the train can be braking. That's four driving actions that we are going to study. The idea is to study each type of action and see how much energy we can save per unit of time that we add. If we look at the accelerations, we can try to accelerate a bit not as strong as the maximum. So let's say V0, we like accelerate a bit slower and then we accelerate at maximum again. And I'm saving the formulas because it would be too long for this talk, but basically this can lead to a nice but small amount of energy saved per unit of added time. If we look at maintaining speed, as we saw the speed has a huge impact on the air friction. And so driving at V1 a bit lower saves actually lots of energy per unit of time added, so that's interesting. There are two reasons for coasting. The first one, the small triangle you see here, corresponds to a slope. So the driver will cut down the traction before the slope, slow down a little bit and then accelerate again thanks to gravity in the slope. And then on this distance, no traction was used. So it's interesting. And before braking, if we know we are going to need to brake and slow down, we might as well cut the traction before and then thanks to the inertia, the train will drive and lose some speed. So this is, we have two parameters here, the same V1, so the maximum speed and then VF here, the velocity at which we want to stop the coasting and start braking. And this is also very interesting in terms of energy saving per added time. And for braking, well, no energy used at braking, so no possible energy savings at braking. So what this analysis shows is that the two most interesting parts are actually saving energy on maintaining speed and coasting, which would, if we combine them, look like something like this. We want them to be equal so that the margin is as evenly distributed as possible. We don't want all the margin to be on one spot for reason I'll explain a bit later. And then basically how the algorithm works in the end. We start with the fastest drive that we compute and then we do a binary search algorithm with iteration. So we start with V1 and VF that leads to, let's say, this result. Then we get an output of how much time this represents actually. And we compare this time to the time we want, the 9, 28 and the 10, 12 from before. And then we iterate and we compute different ones until we converge to the solution that leads to the time we want. So if we come back to the first example, this leads to something like this, where the higher speeds are lowered and then you can see the coasting phases before each braking phase. And we arrive on time with the margin that we added. Now let's see how it looks like on some examples using OSRD simulation. So here a train between Paris and Lyon, so it's a high speed train, a TGV. Here you can see it on the map and then you have a linear margin. I don't know if everyone can see the green lines, yes. Okay, so linear margin on top and then maricot algorithm at the bottom. And then here the orange curve is the slopes along the path of the train. So here on this example you can see some like the triangle parts correspond to the train cutting down traction and then using the slope to accelerate again. And then you can see here that it's cutting the traction about a bit before the final braking here. So in this case we get 12% of energy saving between those two strategies for the same running time. Another example between gap and brilliance, so this is in the Alps, in the mountains. So the slopes here are quite strong and there are many like uphill and downhill slopes. So it's interesting because then we can use the triangle technique many times, cutting down the traction and then keeping up with the speed, the fence to the slope. As you can see here, there are many triangle shape. And then there are more stops here, so also more braking phases that we can use for coasting. In that case, it's 13% of energy saving also because the overall decliivity goes uphill. So it's not as great for this algorithm to work well. Another example in Bretagne, so this is the west of France between Rennes and Camper. With many stops this time, so I simulated a regional train that stops in many cities. So there are many stops, hence many coasting phases before braking. And the overall decliivity is quite flat. So in that case, we get 20% energy saving, which starts being a lot. And last example would be near Paris, so between Paris and Mont-Flajoli. Also many stops, but this time the overall decliivity is mostly descending. So it's very good situation for the algorithm to be efficient. Well in that case, we get 32% of energy saving. So let's plan all the trains with this algorithm. What can go wrong? Well, there are some impacts of the Marico algorithm on train planning and operation. I'm going to start with the few downsides. Most of the margin end up towards the braking phases because that's where the main coasting phases are. So we need to use it a bit carefully, especially in the long distances. I showed you the Paris to Lyon trip without no stops in between. So most of the margin was at the end, which means if the train leaves late, it's going to catch up near Lyon, but it's not going to be able to really catch up on the way. So it's going to be a bit late all the time, which is not great. So it needs to be used carefully. You can also deteriorate the headway, so how many trains can run in a certain amount of time because it can lower the speeds a bit too much in some areas. It also considers that drivers will follow the fastest drive at low speeds, so accelerating as much as possible, which is not the case if we study at driver behaviors. So we in the end plan trains a bit wrong if we consider that every driver will accelerate using 100% of the traction force. The good news is, sorry, energy savings. Well, this can be a lot of money in the end for the company. Each percent can be a lot, so imagine 20 or 30%. It's a bit more similar to driver behaviors, especially experienced drivers that know where they are driving. They will anticipate the slopes and they will cut down traction to save some energy. So it's more similar to the actual driving than the linear margin. Strong accelerations are better for the headway, especially in dense lines. So you want trains to leave the stations as early as possible and then drive at high enough speed because trains that drive slowly have really bad, sorry, for the headway. And coasting before braking, also in dense lines, brings drivers to approach to reach the stations at lower speed because they have been using the coasting before. So they can anticipate and adapt to braking better if there is a train in front of them with and they get like a caution signal, asking them to slow down. And that's it. Thank you. APPLAUSE We have three minutes for questions, I think here and here now. Yeah, you mentioned that grading doesn't cost any energy. We generate a risk braking. It does save energy if you rate slower. Does it take a different program? This algorithm doesn't take it into account because it's too old. So I don't think trains that could gain energy at braking were a thing at that time. I personally would like to adapt this algorithm to take this into account in the future if that becomes one of our needs for OSRD simulations. But yeah, you're right. What about the length of the train? I mean, for instance, a long freight train, you can get some very different things because there are supposed to be a sensing and a sending. Yeah, so I forgot to, I think I need to say the questions again for the microphone. So yeah, the question was what about long trains, especially freight trains, that can be very long? Well, yes, the cleavities, the slopes can compensate on very long trains. Well, this effect, this algorithm also works, no matter the length of the train. It's just because of the binary search, we don't know the exact output. So we simulate it, we see the total time, and then we adapt the V1 and Vf velocities for the right time that we want. So it's actually taking into account those, as long as the algorithm you use is taking this into account in the simulation. One last question. Sorry. I see your graphs because they're on time, but when you save energy, I will assume from the graph that the graph is now shown, it covers less distance because you have time over speed. But where is the time actually saved? Is it normally that the train would just get at its end station earlier, and now just take that extra five minutes away and spread it out by saving energy? No, actually the graph was showing only show speed and distance. So the time, if you show it on, I don't know if there is, yeah. No, we don't have time for this, okay? Sorry. But they actually arrive on the time. They drive a bit slower, so if you represent it as a space-time diagram, you will see that there will be a bit more horizontal because they drive slower. Thank you, Alex. Thank you.
Railway signaling: detecting conflicts in a complex world
Hi, I'm Eunice and I work for SNCF Rizzo and the OSRD project. So standard disclaimer, the opinion in this presentation I'm maum and not those of the OSRD project SNCF Rizzo or the Open Relay Association. So what's OSRD? It's a railway design toolbox built around microscopic simulation. So it allows to perform operational studies and also to find last minute pass through the infrastructure without creating any new conflicts. It's a licensed under our GPL V3 and funded by SNCF Rizzo, the European Union and the French state. So a short signaling primer. The main goal is that trains do not crash into each other or derail. The problem is trains are very hard to stop. They take a very long time to slow down and they need to know that they should slow down very much in advance. To do that we use signals and in order to actually use the signals we need to know where the trains are. So for that we use track circuits and the axle contours. But basically we divide the infrastructure into zones and in each zone we can know if a train is there or isn't there. We call the space between signal blocks and the block detection zones. Another thing is that train must not go over a switch that isn't set for them. So for that they need to have actually an itinerary through the infrastructure. We call that a route. And the route need to be established which means that the switches must be locked in place before a train can pass the signal at the start of the route. So for example this is using your ball signalings which is the French main signal system. So here we have a train and behind it this route is set so you have one red light, one yellow light that signals that this is a red light and then it's okay it's a green light. But up above the route isn't set this switch is dangerous so there is not only one red light there is two red lights. This means that under no circumstance a train must pass the signal. Here a train can pass it but very slowly. So we have a number of challenges. Every European country has its own signalling system and actually multiple. And there is standardization which is called ERTMS which is actually three levels of signalling system and even more complicated than that but it's not widely deployed yet and it probably never will be because nobody is going to upgrade an online for no reason. But we need to cover every single one of those cases so far as it's just another standard. We also need to not simulate the infrastructure every time we make a small change in a train for example departure time or in a forest EDCM for example it use an ASTAR through the graph of time, space and speed in order to find the pass that doesn't conflict with another train and every iteration of that ASTAR we cannot simulate the whole infrastructure it would just one scale. So we need to be able to model the capacity needs of a train while only simulating it. And also most of the application should not have 15 implementation of everything because of 50 signalling system it should be very much abstracted. So our approach is that signalling system have a very restricted view of the infrastructure the only thing that only see in front of it and it see linear pass until the next signal. So they see the state of the zones they protect and they also see the state of the next signals. We give them other metadata such as the speed of the approaching train or the kind of train this is useful in some special case. And we also separate the concept of signalling system such as BAL or RTMS and signalling the rival which is the actual code that implement the behavior of a signal. For example, yeah, so they depend on the output and input signalling system. So for example, here we have the BAL signal that followed either by a T-VM signal or by a BAL signal also. And we have two drivers, so two modules that under every BAL to BAL signals and every BAL to T-VM signals. And we inject BAL parameters because this is actually a BAL signal since the actual lights are using the BAL signalling system. So from that we can actually feed along the path of the train the state of the preceding signal here, get a state and feed it forward and we have signalling. But there is a number of problems with this. It's very cool but as you can see the actual signal reacts after the passage of the train which is quite normal because that's how it is in the real world. Our problem is actually trains need to see green before them. Their actual needs are before them and not after them. They don't really care. And we linearize the path but what is the path of the train that follow our train? We don't know because as we said earlier we are simulating each train alone. So we need to model the capacity requirement of a train but only knowing that this train pass. So why do train conflict? Either they are following too close to each other in which case they need the zone in front of them to be free or they have incompatible routes which means that they need the zone before them to have some specific switch configuration in order to proceed. There are other reasons why train conflicts such as power delivery needs and many other reasons but we don't handle those and they have nothing to do with signalling. So what's the spacing requirements? We have a zone, a begin time and end time. It's quite simple. For a route we have a set deadline which is the begin time. We have the actual switch configuration. So in order to set a route you need to know in which direction you are going to traverse the zone and what is the switch configuration you traverse it. So how do we get this? Every time a train encounters a signal we start by assuming the zone in front of the signal is occupied and we probe the infrastructure linearly until that signal becomes green again. And then we know that all the zones where the signal wasn't green basically are part of that signal requirement and we can adjust the begin time of the zone to match the time at which the train saw the signal. And every time a train leaves the zone it doesn't require it anymore. In terms of routing requirements most of the parameters only depend on the path of the train. So if we go earlier the route, the traverse zone, the detectors which basically indicate the direction in the zone and the switch configuration only depend on the path of the train. We know that. We are simulating the train. But in order to find the setting deadline which means we need to know which signal protects the entry of the route. And not only that signal but the signal before it because as we saw trains can see a signal being packed by signal before the actual protecting signal. So basically we probe in the other way so we set all the zones in the route as incompatible which means that the route isn't set and then we iterate through the signals until we find a signal that's green. Good so now for a train we have its conflict and spacing requirements and the good thing with those is they are indexable by zone so we can simulate every train once and then keep a database of each requirement and we simply need to check for every zone if all the requirements match. So the spacing requirements are never compatible if they overlap and the routing requirements are compatible if they go in the same direction and have the same switch configuration. And if we had a new train we only need to check its requirement and same thing in the ASTAR of STDCM if we had a train we only need to check that the new zones that are traversed by ZC ASTAR iteration are actually conflicted. So in the future we want to implement TVM support. We are actually in the process of doing that and it should be done by the end of the month. We want to implement support for overlaps. The main problem is friends doesn't use overlaps which basically are zones that must be free after the rest of the signal in case a train doesn't stop there. Friends doesn't use that but for example Germany do and we do not have any German on the team. And same thing for other countries signaling systems we want to implement those and the contribution are very welcome. There are also moving block support with basically the ERTMSV3 and in order to implement those we probably need another model specifically for moving block systems. Thank you for listening. Do you have any question? Five minutes question. Here. Thanks. Another one? You mentioned the different signaling systems and the different operational rules. Can you move this quite flexibly or as you said the implementation of TVM as a manual coding? Manual coding of the signaling system on the driver. The signaling system is quite simple. It's not actually a JSON file but it could be. So it's just declaring what properties the signal may have and there is actually code to check when we actually construct the blocks that they are correct. So basically sanitization of user input and in order to make the driver you just decide what are the possible transitions for your system and you implement those it's basically one function class. So it's actual code but it's quite simple. So when you do the root planning you know of course what rules you train such as post-created times but I can imagine that the train will occupy is on for a short time and the other will have to wait a little bit for that to become available. So there is an optimization problem there like what do we do? So this is a system how does it tie in with the actual timetable planning? So the timetable planning in operational studies are done manually. So because the people doing operational studies actually do this manually or want to do this manually for now and in the case of SCD-CM we cannot change the path of any train because they are already sold. So we cannot make a new train but it must not interact with any other train. So essentially this gives you a yes or no? Yeah it gives you a yes it's possible and this is the fastest path. Again for now. For now yeah. What's the difficulty? Was it a challenge for TVN's? Because there is this big limit on blocks with the speeds before? Well it's yeah but this was a challenge to integrate in the design. But for now it's not a challenge it's just a developer bandwidth. Does the conductor of the train see some kind of nice map that they can see how fast they can still be going? Or do they just see the green light red light, orange light, double red light and they just kind of react to these very basic signals? So on BAL signalling they only see the green light, white light, etc. On TVM it's actually a cab signalling system. So the driver sees in the train what speed he should go. But yeah on BAL I don't think. There is no connection so the driver just look outside the window and see what he has to do. What challenges do you have now? Or how does this simulation help in case of delays and dynamically reallocating paths or timetables where the delays are in a cabin at a scale? So we do not currently support any dynamic simulation but we plan to. We hope so. And well in order for dynamic simulation you pretty much have the same constraint and simulating. You need to actually simulate what state it's in at any point but you also need to know the. The resource needs in front of a train. So the situations are for now resolved manually like from the control center? Oh yeah OSRD actually is not used in real operations in real time for them. So I think at SNSF most of it is done by experience of the regulator. This one last question maybe short. Yes please. What is the safety requirement of the company? No. Thanks. No no no. That's it. Thank you.
How we at Deutsche Bahn develop IoT use cases quickly and cost-effectively
Okay. Yeah. So great. We managed to set up everything. We have a demo, so we needed to do this. So without further ado, I'm very happy that Olga is here to tell us a little bit about the ISU case and what it's done. A little bit of less time, so please condense it as much as you can, but take your time. Yeah. Thank you very much. I will give my best that we just in time arrived to finish off this lot. But let us start. My name is Horace Koch and I'm working for DB Suster, the IT company of Deutsche Bahn. And I'm the product owner of IoT. We works with applied IoT. And if I make not IoT, then I'm a member of Aka Open Source of Bitcoin, the chairman take a look. Digital association, Bitcoin and pushing forward the open source ID. And some words about my employer. The visitor is a 100 person subsidiary of Deutsche Bahn AG and is the digitalization partner for all Deutsche Bahn companies. And we have currently 7000 employees and managed over 500 projects and services in the cloud. And if you are looking for a new challenge, please take a look at dbd.com. Okay, let's start. What is Internet of Things? Here's the definition from the Wikipedia, but I would like to describe it with my own words. The aim of things, the aim of Internet of Things is to measure some conditions in the real world and to link and evaluate it. And this information and ultimately to derive the measures from it. And we have unfortunately only a little time, but I will try to give you a very deep insight in practical usage of IoT and for example, so we realize a practical project in this talk. We would like to measure the air quality inside of this room. And yeah, I think also it's a quite funny topic. And yeah, we will touch the theme. Where can I get the sensors? How can I transmit the data? And finally, how will be this process stored and visualized and so on. So let's start with the sensors. Where do the sensors come from? Yeah, after understanding the customer's problem and determining suitable matrix, the question arise which sensors can be used to reliable determine the matrix. And normally we try to buy this from the market, use standard sensors, but from time to time there are no sensors available for this topic. And then we give contracts to our DB company, DB-Con T, or DB System Technique, or maybe external partners to develop the sensoric after our specifications. And from time to time we make some in-house development and for this we use some sensor platforms. For example, other foods, fever, V-MOS, or Tinkerforge. For our project to measure the air quality inside of this room, we use the Tinkerforge and take a look inside the portfolio from Tinkerforge and found two interesting sensors. One is the air quality bricklet. This sensor measures the air temperature pressure, humidity, and an air quality index. The air quality index measures some gases and other values and gives us a calculated value for this. And the second sensor is a particulate metal bricklet. This measures the particles inside of the air, for example fine dust. Both sensors are connected to the master brick and the master brick makes the communication between my laptop and both sensors and we can now take a look. I connect it to my laptop and hopefully you can see this. We fire up the Tinkerforge bricklet fever, making a connect and we see all the bricklets and bricks. They are connected together and we can see some values from it. Without code, some lines or something, you can make first analysis. Is this possible to measure with this sensoric the right values? And is this worth to go further on? Okay, let's go back to the presentation. The next step is connectivity. How was the data sent to our back end systems? And for this, there are a lot of transmission protocols in the IoT environment available. Here are the four important protocols. But it's really difficult to take the right one because some need further infrastructure. For example, gateways or other costs, monthly pay or something. And you also must look for data, for example, bandwifes, coverage, energy consumption and so on. For our example, we only use the Wi-Fi connection of my laptop. So it's really easy. Normally, we use NeoBand IoT because we can't use the standard one. And NeoBand IoT based on LTE is more or less available. Okay, we use the MQTT protocol. MQTT is more or less a producer-consumer model. The producer writes some data into a topic. On the message broker, a topic is directory structure also. And the consumers can subscribe exactly this directory, this topic. And if the producer sends some data, the consumer can read it. Or got it pushed from the message broker immediately. Okay, we use AWS Core IoT for this. And there is a product IoT Core. And IoT Core is a perfect MQTT broker for us because it's full-managed and co-ops, auto-skating. And so on, and you have to work with this. Okay, then let's take a closer look into the code of this. I'm not a programmer, but it's so easy, anybody can work with this. Tinkerforge has a lot of examples, and it's more or less intelligent copy and paste. You take an example, you write a unique ID for every sensor, so you can connect more air quality index sensors together or something. Anybody has his own ID, imports some libraries, and here is the important part. We take our certificates for the MQTT protocol communication, and we create two callback functions, one for the air quality sensor and one for the particular meta sensor. And here you can see it's easy, one call to libraries, and you get all information from the sensors. Here we have a little bit formatting, print out, or write it to MQTT. The same for the particular meta sensors. It's really easy, there are examples everywhere available for this. And we can fire up this. We start this Python program, and here we can see the values. The values will be formatted into a JSON format and then sent to our MQTT program. Okay, let's come back to Fing Sport. Fing Sport is a relatively new software, it started in 2016, but in shortly the time it's more or less the market leader in open source IOT platforms. One question around, who has any heard from Fing Sport? One, two, three. Okay, perfectly. It's an open source software with open Apache license, and it is all in one solution. All aspects of IOT is available, API agree, so it's really easy to configure your system and so on, reporting, scheduling, visualization, and so on. And the best thing is the root chain. The root chain is a little bit like Node-RED, and there you can configure whatever you want to control a backend system. For example, if the error is too bad, then open the window. Okay, next step, it's always a good idea if you use an open source software to take a look at an open hub. And here are... Oh, five minutes left. Okay, it's a good software, it's perfect, insecure. It's a microservice architecture, and it's really easy to install it. And I will show you a little demo. Okay, here's our Fing Sport system. I fired up, and first we must create an integration. Integration is the part to subscribe an MQTT topic from the pro-car. Maybe you remember. I prepared it before, so I don't make it now. And the next way to create a converter, a converter is for preparation of the data. Sometimes the data are in grad Fahrenheit, and you would like to do it in grad Celsius or something, so you can prepare the data for the storage and so on. Okay, dashboard. So we try to create a new dashboard first. It's insert, and we add a new widget, temperature, temperature, and a quality sensor. We select the device from which we would like to visualize the data and which data case we want, a quality index, and for the first step we would like the temperature. And go on, and so we create our first widget on our dashboard, and this you should be repeat sometimes, and then it looks so. Okay, I let it running, and we take a look in three minutes on it. Because the system needs some minutes to measure the correct values, the sensor is a little bit of self-calibrating. Okay, we come back to our presentation, and I would like to speak about some use cases. From time to time we make some IoT hackathon, for example with our customers to better understand their requirements or to find possible solution very quickly and to make some tests with this. And from time to time we make this also for HR to get new employees or to work with studies, for example, some digital resummon school or similar events. Okay, here's an example of our environmental sensor. For example, this is measuring temperature, humidity, pressure, and the best is it measures the particles. The count of the particles in the air and the mass of the particles and vibrations. And why we make this, some employees, some colleagues of digital signal boxes told us maybe there's a connection between pollution, fine dust and so on, and air hours occurred in our signal boxes. And I have here a screenshot, it's really actual. Did anybody has ID, what is wrong with this? Okay, it's difficult to see. New year, exactly. 20 minutes after New Year's Eve we have a massive growing up, defined us in our signal boxes. Nobody knows how can it be. At the moment 300 sensors are rolled out, but 50, 15, 20 signal boxes shows this. And now we have to make some evaluation, how can it be, and maybe it's a good idea to power off the air condition or the ventilation system or something, or take a look at the windows open, what is wrong with this. And here is another example, another use case, PESC. Okay, PESC control, we can measure the red visits and so we can reduce the amount of very toxic baits. And another example, Samurai, things what is a perfect open source software if you like to realize IoT project really cost efficiency. Okay, thank you very much. APPLAUSE We don't have time for questions, but I do want to see the diagram. Yeah, yeah, yeah. What does it mean? Please? What does it mean? What does it mean? Yeah. No, no, no, no, no, no. This is below 50, it's really good. But I think I'm not sure how long does it take until the sensor is recalibrating. Yeah, yeah, yeah. I think the air is too poor for this value, so maybe we should wait half an hour or so. Okay. Okay. Thank you very much. Thank you. And this thing was bought. Yeah, is it a bit like this? Yeah, the next thing you do is get up. Basically, it's a competition to open up and all the other stuff. Now, open up is more or less...
Transportr: the Past, the Present and the Future
So we are coming towards the end of the program. We have two short community talks as final talks of this morning. And I'm very happy that we have Nicola here to talk about Transcorter. Many of you probably know that as one of the free applications to take public transport information. So yeah, it's your state. Thank you. Perfect. Can you hear me with a microphone? Or is it just for the recording? Perfect. So yeah, then welcome to my short little talk about Transcorter. It's passed. The current state and glimpse into the potential future. Let me maybe start the talk with asking you who of you is using public transport regularly? Great. I could have guessed that, I guess. But then some of you may know this kind of problem. You go somewhere, you travel somewhere, there's a different public transport system. To find your way through it, they want you to download their own app. Usually proprietary from Google Play. And then at some point your home screen gets cluttered with all of these apps. Some alternatives that you may know is using Debena Vigartra that works quite well in Germany. They include a lot of regions with decent data quality as well. But then first of all, since the new update, I think there's no map inside anymore, which I find pretty, well, it's a bit sad. And then there's also some people found out that Debena Vigartra is sending or connecting to a lot of tracking services, even if you declined that. So maybe that's not what you want to use. Well, Google Maps, another option, I guess we don't have to talk about why you maybe don't want to use it either. So as you guessed, a transporter, it tries to be another alternative to these kind of things. It was created 2013 by Trosten Grote. As you may notice from that picture, that's not me. I'm Miko Leguccio. I'm filming under Jallochim on GitHub. And I started contributing to Transporter in 2017. So when you open Transporter, it might look like this. You have a list of networks. You can choose where you are. And then you can basically look for journeys as you would expect. So in this short talk, I will, well, first of all, tell you a bit about how Transporter works, the internals. How do we get the data, basically? Then, as I said before, the past, present, and future of the project. So first of all, these official apps, how they work, well, they have their data source, usually in some proprietary format. And they have the apps that talk to some APIs that provide the data. In the case of Google Maps, it's a bit differently. They don't use the data directly, but they use a format called GTFS. That's a standardized public transport format. Initiate it by Google, but it's an open specification. So you can create your own GTFS files and also consume GTFS files, GTFS files as you want. And that's what Google uses internally for their public transport routing. Now, where does Transporter come into play? Maybe you've heard of Uffy before. That's another app that also works on Android developed by Andreas Schildbach. Even before Uffy itself was open-sourced, Andreas Schildbach also already open-sourced a library called Public Transport Enabler. And that is basically the wrapper that contains the logic to connect to and understand the data from the official APIs. And Transporter is using that same library. So huge thanks to Andreas at that point for him to open-source this and making a Transporter possible. Then there's also a second part in Public Transport Enabler where you can consume GTFS files by a proxy. In that case, you don't use the GTFS files directly. You don't perform routing on your phone. But you use some third-party provider. What Public Transport Enabler was using is Navicia, a French company. They provided this service for free, basically consuming the GTFS files and then exposing them as an API to interested apps. And that's actually how I got in contact with Transporter because when I was spending some time in Nicaragua, I was working there with some other volunteers to gather Public Transport data, schedule information, put that together into some GTFS file and then in the end making routing possible for a limited region, but at least with apps such as Transporter and Auffie. So now that you know a little bit more about the internals, I would like to go on with the project itself, how it evolved. This is the graph of the code frequency on GitHub. So as you can see, there has been quite a lot of activity in the beginning. Initial commit 2013, release 1.0 2015. And then in 2018 with this huge spike, there was a major rewrite of the app. Most of this was done by Torsten Grotter. And as you can see afterwards, activity declined a bit. Well, both Torsten and me were busy with other stuff. So this is actually a try to get some more or attract maybe some new contributors to Transporter with this talk. Maybe you've noticed that at some point we even got removed from the official after-ride repository because they found out that some library or the map library that we were using was not fully open source. And it became necessary to switch to an open source fork of that library that actually didn't include the non-free dependencies. And another thing that happened last year is that Navicia started to change strategy. It's like the new version of their software is not open source anymore. And they also stopped serving a lot of regions. So at that point, for example, Nicaragua was not available anymore, which is a bit of a shame. So in 2023 we also got some new interest from the community. There was people asking about the future of the Transporter app. Well, that's got some new energy and we finally finished the migration to the new map library. And since one month ago, we made it actually back to asteroid. We're back there. But as I said before, we lack some regions that were supported before because Navicia stopped providing them and because some APIs also broke over time. Well, as I said before, there are some new contributors. There's some effort to move to new design theme, which is great. There are quite some open issues. Some of them are bugs and many of them are feature requests. And a lot of them are actually marked as so-called beginner jobs. That means that they are supposed to be quite easy to tackle. So if anyone watching this or being here feels like looking at some Kotlin or Java Android code, feel free to pick one of those and try to work on them. And apart from Transport itself, it's also nice to see that the whole ecosystem of similar public transport apps is growing. So there's Uffy, as mentioned before. That's open source now for some years as well. Then there's itinerary, an app that tries to do or does even more things than what Transporter is trying to do, like saving your tickets. There's another Linux app, GTK-based, in that case, which is pretty new, also looks really nice. And on iOS, there's also an app that I'm not sure if it's fully open source, but at least there is some variety to choose from. And looking at this ecosystem that is growing, I think it would be nice to try to combine efforts in some way. And maybe what would be nice as well is find an alternative to what Navicia was providing before, some kind of shared service that is maintained by the community that can use GTFS files that are available for a lot of places in the world and provide an API that can be used by all of these apps and even more. So that's for me. I have three steps for you. If you haven't already downloaded Transporter either from Afterhood or from Google Play, if you find anything that doesn't work as you want, tell us. Look at the code, contribute, and have fun using public transport. Is there one quick question? Hello there. I tried to navigate to the results yesterday. Yes, Bajum is one of the regions that was supposed to be broke, the API broke, and I think we would have to look into what kind of API they're using and then maybe feel free to look into that and contribute. Sorry, we don't have any time left for more questions. Please talk to him, please contribute, and we are moving to the next presentation. Thank you.
Software needs of a volunteer operated heritage railway
So, we're coming to the end of our last talk for today. I'm very happy that we are closing with a real training operation. And in this case, we're talking about how to do that with open source, how the source can help there. And yeah, Niels, it's your stage. Yeah, thank you. So my name is Niels. During the week, I try to make medical devices speak together. And on the weekend, I'm playing with trains. So I'm working at Dampfbahn Frankische Schweiz on the weekends, which you can't see because it's too bright on the beamer. For location, this is Feuchheim, which some people might know in the medical industry, because that's where my employer is. And the next bigger city is Nuremberg, which is somewhere around here. We have a short line, 16 kilometers. It was closed down in 74. We are running it since 80-something. There's something like 30,000 passengers per year. So we are slightly sized for a heritage line or museum ray-ray. We have 400 members in the club, of which 40 are actually active. We are running steam and diesel every Sunday from May to October, sometimes occasionally holiday train or special train, or whatever. But May to October is the main time. We are completely volunteer run. So we have a professional safety manager, but everything else is completely volunteers. We are a real railway running under FONE regulations, so slightly easier rules than the Deutsche Bahn has, but still real railway rules. And because we are kind of the only railway in the region, a lot of local initiatives are in contact with us. There are some initiatives we want to reopen public transport on the line, which for us would be good, because then we can get a lot of Trassengebühr. And it would also help the area quite a lot. So why do I do the talk here? So first, I want to put heritage lines generally on the radar for you guys, A, to come and visit us, and B, because we have a lot of need for people doing IT stuff. And the interesting thing is, as we are running some of the heritage lines, have their own line, where we can do more or less whatever we want and what the E-Bahn allows in Germany. We are the perfect experimentation ground, if you want to try out some stuff. If you look at Europe, we have about 100 heritage lines in Germany. UK is the absolute mecca. They have, I don't know, far above 100. They are about 50 in Austria, 20 in France, 10 in Belgium. So all over Europe, you will find some lines. They are organized in larger communities. In Germany, it's the Faudi MT. In Austria, the EMT. UK has the Heritage Railway Association, the HRA. So there's a bigger group, and there's a European organization called Fedeg Rail. What's our problem? We like trains, but we are horribly bad at computers. So this is kind of the typical members. So that's me after a training shift as a fireman. This is our engineer who is in real life a state attorney, and this was my trainer for firemen. He is in real life a medical doctor. So I'm the only one who has to do something with IT in this picture, and I'm probably the one of three in the whole club who has to do something with IT. So big problem if we want to run anything in IT. What do we do? Of course, we do the stuff that a normal railway does. So we sell tickets. We run trains. We operate and repair the infrastructure. So we have switches and signals and things. We operate and repair rolling stock. So we have coaches and wagons and locomotives and everything. Not the usual stuff the Deutsche Bahn has to take care of. There's a little overlap. We have a V60, which I think is still in operations at Deutsche Bahn as well. But most of the stuff is 80 to 100 years old. And we have workshops and sheds and all the infrastructure around. You need to have a railway running. But also we have a nonprofit part of things. So we have archives. We do this to preserve history. So we have a lot of documentation on our trains. We have photography, everything on the historic side of things. We are a club, a fein in the city of Germany. So we have to do membership management. We have to do all the tech stuff you need to do for fein. And we need to get somehow money for everything. So we need to organize donation campaigns and try to get fundings for things. And it's not just the, so we cannot just run on tickets alone. If you need to do a full inspection of a steam engine, we are talking about half a million of euros. And that is about 10 years of running. And we need to do every 10 years this inspection. So, and we have four steam engines. So you see the problem. OK, we still run railways as in the 1950s. So our line, in our case, closed down 74. It's more or less in the same state. We are now, but the rest, the signalling, everything is still in like in 74. Our active members, unfortunately, are getting older. We get newer active members again. But unfortunately, the everyday workload on people has also increased. So you cannot spend all your life at the railway anymore. Some people need to earn money. And this is kind of increasing, decreasing the time that people can spend on the railway. They are higher safety requirements. So even if we run railways like in the 50s, we still have to fulfill all the safety requirements from the 2020s. So this is kind of challenging. Our customers want more. So you cannot come and say, yes, the ticket office is open from 8 till 9 on Saturdays. They want to buy a ticket on the internet. And of course, growing regulations and administrative effort, which you have everywhere. So the problems we have, tickets, we still sell these Edmondson tickets. As you can see there, there's carport things. And one of our members also has a printing machine for that. So this problem is solved. But we also want to sell tickets via the internet. And there's not really a good solution. There's Farkapendrucker.de in Germany, which works. And works reliably. But this thing is stuck in the 90s. So if you look at the layout, it doesn't have responsive design. And it's really hard to use. The back end is quite OK, but the front end for the customers is horrible. Unfortunately, it's the only thing we have. The other thing is you could use some kind of an event ticketing software. So a lot of people here would probably know Pre-Tex. Pre-Tex is absolutely great, but not made for railways. So it starts with seating arrangements. So usually we want to have some algorithm that not everybody sits at the windows, but that if you book multiple places, you have to set the window. And then you fill up until you only then are allowed to have the next window, because else you have all the windows taken. And then the rest will stay empty because people say, I want a window. I don't go there. There are hundreds of bachelors and master thesis on how to do a ticket selling system. But none of them has made it into an open source software, unfortunately. So this is something Farkapendrucker works. Using it helped needed. Running trains. So for timetables, for us it's pretty simple. We have one line, one train. So we take the same timetable since 20 years. So this works. But if we get more complicated, we might want to run two trains. And then we need to have a serious timetable. We use now J-train graph and FPL edit, which is made for model trains. Works really well. The FPL edit author also added GTFS export now. So we might show up on Google Maps hopefully soon. And OpenStreetMap and Traveling and all the other apps which could use GTFS. So there's probably also some larger software interesting. We have some stuff like the signaling. So this is our signal box, the complete signal box. The other safety feature you need to know, there's a key. This is something where we can improve on. And it might be some stuff like just putting a GPS tracker on the train, which then has the other problem. There's no mobile phone reception on our line because we're in Germany and in the middle of nowhere. So there are lots of things where, for example, the IoT department can have a field day. We have passenger information systems, whatever. So there's a lot of things where you could create new software, which would help us a lot. Managing rolling stock. So right now, every train management, so train cars have regular inspection dates and a lot of paperwork attached to. This is now managed in an XCloud and an Excel sheet. And that's already the advanced technology solution. Usually it's paper folders. So a lot of things there. We have reports, regulations, whatever. So this is kind of a nightmare right now. But we also got good feedback from our regulating body because we handed them readable PDFs and they said they were better than what they get from Deutsche Bahn. So what's our problem? So basically, the left side is the museum railway, half where we need to know our problems. We still don't really understand the problems well, so we need to get better at that. We need to find the solution and we need to be able to apply the solution. That will be the big thing. And the other thing is also the software side of things, so it needs to fit to our problem. We need to be able to find this software. So if you search for ticket systems, you will find G-RA and all that stuff. So completely unsearchable now. And it needs to be really ease of use because we are not good at computers. So this whole thing started at the Gulasch Programmier Nacht, a couple of, actually last year. We did a workshop at the KS Communication Camp and a small group formed trying to get the IT nerds and the railway nerds together. And that's also what I want to present here. For you guys, why should you bother if you don't like playing with trains? Playing with trains is fun. But you could also use museum railway as a learning ground. So if you work in software and want to do something for transport but don't know how railway works, we are a place where you can learn that. We are an experimentation area. So you can do a lot of stuff on museum railways. So the railway regulations are quite wide open for experimentation. So I'm really surprised what's possible sometimes. Coming from medical device where you can't do anything. And you can use this as a test bed where you have, we have the simple case. So we have one line one train or one line two trains if it gets more complicated, but we don't have the full network like DBS. How you can join? So to have the super easy entry point, we just created a Discord chat where you could just join. We are currently four or five people, so smaller. Hope to grow more. We have a wiki at kaosban.net where we, which was kind of the original idea dumping ground, which now gets a bit more formalized into a knowledge base where we try to collect the problem cases, the possible solutions, where there aren't solutions to get an overview of things. And we're starting now to network with different heritage line associations. So at the Faudi MT meeting in three weeks in Aachen, I will present basically this talk again and do the same publicity for heritage lines then. So then I'm done. So join us on the Discord if you like. We're open for crazy ideas. And if you want to play with trains, there's a museum railway near you which takes you with open arms normally. There's one question. Yeah. More information. I started this with a Netflix presentation. And in Norway, we had six rotation train nights that is using the M plan that we don't mention to produce netx data. So they are integrated in national trip mapping. Yeah. Yeah, I made a lot of notes during your talk. So. We also converted the beganesis. So just repeating for the video, in Norway, there are, how many you said, six museum railways to all you're already using the netx tool from the first talk. So if you haven't watched the first talk. Do you regulations on station visibility that the station has to be on a straight section of track applied to the historic railways? We have kind of at the limit of what I know. So I think we have some kind of heritage protection that still stuff can stay. But for example, we have one halt in a curve which we cannot use anymore because the border of the platform is not really there. So it's just a meadow which ends up in the track. And we need to put a clear border between the platform and the track there. So there are some regulations and some safety rules. But I think not 100% what the big railways have. In the Czech Republic unfortunately a lot of cities have lost their railways service because of this regulation. Yeah. So for video question was on the, if the regulations for stations that it must be on a straight part of track and that everything must be visible applied for museum railways. And then Czech Republic, a lot of towns have lost their railway access to that for that. So one last question. Thank you for the presentation. I really enjoyed it. I have also a lot of ideas. I'd be a very consider to get into the debate network. Because for example in Italy you have the foundation of like Vain Tardien. And you can actually buy within the ticketing system tickets in the Vain Tardien system. So the question is if we have considered joining DBNETs or DB infrastructure as well. Not really. So for ticketing for example we didn't consider that because it worked so far. It was a lot of manual work but somehow it works. And everything which is external gives external costs because DB doesn't do stuff for free. But we get work time for free. So if we can do it by hand or have to pay for it then we do it by hand. But they do have like their eye or their mind. Yeah. So there might be something I haven't really looked at that. But for tracks for example if we go out we are running on DB tracks and have to join their tools and work with their tools to get into the timetables. OK great. Thank you Niels. Thank you.
Shig: distribute and clone live streams among Fediverse instances
How is it possible? About interactive live streaming in a very worse. How is this possible or is it possible? To me, I'm Enrico and I'm interested in interactive live streams. Sorry. So, now it's better. I'll take it so. Sorry. Here are my contactless and I worked for different companies and even most likely in a conference system topics. And now we're talking about lessons. And in the 30 versus is quite interesting situation. When you're in a 30 versus for example when you're in Macedon, you read in post. The interesting point here is the post came to you. Means you have an app in Macedon or inclined and you don't care who posted the post on which instance the post itself is cloned from instance to instance through the further worst. Means you get a copy or a clone of this post. This is a quite interesting concept. So the instance in the background communicating to each other. How is he doing this? Of course with activity part, we had to talk right before this. So I will not go deep in it but the main idea of activity part is like you have an inbox and an outbox. And everyone in the 30 versus in terms of activity part is an actor. The users are actors, the servers are actors. And on the end you can send to every actor in the 30 versus a message or a post. And that's the way how it works. So activity we describe the things like in activities, it's like activity part, like subscription, follow and so far. And the other topic is content. It's all described in JSON. And how I said, the instances in the background communicating to each other and the content is flowing through the 30 versus. Activity part and live streams. They are in the 30 versus already implementation of activity part like OWNcast or PeerTube are the main famous. But the thing is we want a little bit more. I mean you have in OWNcast and PeerTube live streams but not interactable. It is not possible. It means without leaving your PeerTube instance or leaving your OWNcast instance you cannot interact with another stream or another instance. It's not possible. Yeah. That leads to a problem. It's called scaling in the 30 versus. That means on the end more or less the... More or less every instance provider in the 30 versus responsible for himself, you have to scale by your own. You have the possibility of course with hardware where you make an HLS CDN on top of this or this object storage. Those are the common ways how you can increase the amount of users that can watch you. But on the end you stay alone more or less. PeerTube try to solve this problem with PeerTube mail loader. It's quite awesome. Sometimes you see it. You're watching a video and then you see that other people are watching you as well. This means PeerTube Peer exchanging the chunks of HLS files. We are bit torrent and over-verb. It means you make a real PeerTube Peer connection to the other viewers. I put it on the top because this is the most common way in the 30 versus to share live streams. There are other ways as well, but most likely the basement PeerTube here in the browser. There's another way, it's web torrent in the background. Of course they can clone... Even PeerTube can clone videos from one server to another server. This is possible. And the new concept is remote runners. This is quite awesome. You can scale PeerTube with a remote runner. It means you can run other services that do the transcoding for you. Quite often it's re-expensive. This is the possibilities you have to scale your application or your instance. Oncast has a quite interesting feature. Oncast has a general concept. Oncast is you have a server and you only stream for yourself like this. But they have a dashboard. On the dashboard you can see every live stream in this time. But this dashboard is nothing else than an HTML page. They are linked to the live server. It means it's like a list of links. It's not really scaled because when you're watching there a stream, you're watching it from the server as well. This is the current state of it. But what we have now, we have ActivityPub. It is possible to share the information there as a live stream. This already worked as PeerTube as well. There's a live stream but you cannot share the stream itself. And what we want is we want to share a live stream. So in the live stream you want to have it interactive. Means an interactive live stream is a little bit more as if you have a stream with a track like a video and audio. No. We want to have it, you have a stream with a track and the tracks inside of the streams can change. You added new tracks, you added removed tracks, you enabled tracks, you disabled tracks and the tracks coming from different sources, different instances. When we can reach this, then we have interactive live streams in the Furryverse. It's not only that you share a stream, a static stream, it's a little bit more. This is what you want to achieve. It's like a conference in the Furryverse. And we already talked today about it. There's a protocol, it's called WIP and WAP. Of course we need a real-time protocol. It's clear we need WIP and WAP. It's a real-time protocol, it's a moment. And on the other side, there's another interesting approach, WIP and WAP. In short words, what is WIP and WAP? You make an HTTP request to a server and receive a WAP-ATC resource. That's it. No complicated signaling, only an HTTP request. It's a little bit like an activity path. You make a request and you get a resource back. This is written there. For the first one, you make a request to offer a resource. Hey, I have a resource here you can have. And for the second one, you make a request, you subscribe to the resource. This is only the different. This is the main idea. When you have this, here's a little bit more in detail, you can ignore this one, the eyes, only this two are important. You offer something with an HTTP, of course, and you get something back. And then you have all what you need for the resource. Finish. And then you have such kind of architecture. You can do something like this. A, you are sent off a resource like a client. You offer this to an endpoint. And the endpoint offers to the next endpoint. This is for WIP and for WAP, turn around as well. It's like you can make an, you can establish like a pipe. Yeah, sounds, it's really great. And then you can do this, you can clone streams. Because when you clone streams, only you send a request to an endpoint. Give me this and send this to another endpoint and clone this to another site. That's it. However, there's a problem. WIP and WAP is static. You cannot update the resource. When you one time have offer and the resource as a miss a request, you get an STP and you cannot update the STP anymore. It is static. Means you will receive a track, all the tracks that insert in the stream and nothing more, no way. Means you have a static resource. It's cool for a live streaming, but we want interactive live streaming where the resource are changed. This is quite important. So we want a little bit more dynamically inside WIP and WAP. This is not enough for us. And our trick sources is two things, two important things. A little bit smaller things, but the two main ideas behind us is like this. When you subscribe an egress endpoint and receiving a resource, you have to subscribe as well a channel. It is so opposite. You get a channel as well. Because you need a channel to get the information that the egress resource, the receiving a resource is updated. This is the first thing, what you need. Without is not possible. Normally you do it in a conference system. Perhaps you do it with a signaling server, your resources update, you get a new STP. But we only want rest. We have no WebSocket server. You need established an extra resource like a channel to receive this information. The second point is you have to annotate the tracks. You have to know what this track is. For example, this is the main track or it's a guest track. And here, Schick is using the STDP attribute media title. It's not used normally, some people are using it, but it's there for title of the track, for example. Here it's used for some meta information. For example, it's the track that you received as muted first, but the track is the main track or another track. And the rest is activity problem. You rely on the things. Yeah, Schick itself is an instance written in Go, based on PyM. It came with the JavaScript SDK. You get in front end, it's a web component, not an iframe. You get in web component. And this SDK is implemented in PeerTube plugin. Because Schick itself can nothing, only makes this exchanging. And it looks like this. You have a PeerTube instance on the left side, and you have a PeerTube instance on the right side. You are here starting your stream, and you want invite people on the form another instance. This PeerTube instance has a possibility to a Schick instance, and this is a complete other Schick instance. They are not related to each other. And this user is on his, and with this Schick instance, and with this protocol and background, he can exchange and communicate with each other, like a conference, but this is a stream. And then on this side, he is the owner of this, he is in streaming this one. It's then transcoded in RTMP, because from RTMP then in HLS. At the moment, I have not the direct HLS transcoding. But theoretically, you can, from verbiage, directly in HLS transcoding, but it's not implemented yet. Yeah, and let us look how it's looked. I think I have, yeah. For this one. Yeah. So, I have here the two PeerTube instances. I make it like this, and so like this. It depends on the time I already created a live stream, but you can do it directly now, because we have more time. Sorry. When you're looking, I'm not sure how familiar you with PeerTube. Here, inside of PeerTube, I have the chick plug-in. This is this one, and you can configure the chick plug-in, and you have here, this one is relating to the chick server. It's called stream.chick, means he knows this one. Yeah. Here's an ASESC, okay. Theoretically, you can use this. This is, okay. And the other one, let me see. Yeah, this is the other one. Yes, as well. That plug-in. But he is related to forstem.chick, is another chick instance. It's a complete different. They are in different servers. Yeah, they are complete separate from each other. See, this PeerTube instance, follow this PeerTube. You see? Means this one get all live videos from the other one, cloned. And, of course, this one has his own chick is following this instance. The communication between chick and the PeerTube is over activity pub. So when this chick, when the PeerTube instance get a new live, the chick get it as well as copy over activity pub. That's the idea behind it. The implementation is stored, steal from owncast is exactly the same, because owncast has a cool implementation for it. Yeah. That's a good idea, owncast and PeerTube together. I only want to mention. So, and what we can do now is, we can create a live stream right now. It's like this. I hope I have time. Yeah, I have time. Make it permanent, makes no difference. Yeah. One interesting point when you create a live stream, it should be short as possible. PeerTube can nine second delay something like this. Nine, fifteen seconds, something like this is the shortest what PeerTube arrives. I mean, when we're talking about interactive, it's definitely not take 30 seconds or 60 seconds. It's too much. Okay. So what we can do as well is, let us invite the other guy from the other instance. What you have to know is the activity pub ID from this guy. Yeah, this one. Now we create a live stream. I hope so. No, we don't create a live stream. I have to update the live stream. Sorry. My mistake. So now we have a live stream. Here's online. And in the back, I have to take this one because I'm not figure out how I can find this live stream than on the other side. Maybe someone will explain. Now activity pub has synced to both. So we have the live stream as well on the other instance. So when I have this one, I'm logged in as user one to three. I can assess now here. I'm now in. Now I'm in the web component. It's a web component rendered in peer tube from the plug-in. It's not an iframe. And I can do this here as well. So now two guys in two different stream, but they are not connected at the moment. First, they have to join. He's joining. And he's joining as guest. Takes a while. So let me see. So now we can do it. And of course we want the other guy is seeing something. So now the internet is a little bit slow. Sorry about this. Now they're both on different check instance, different SFOs. And the SFOs communicating with them and established with only rest endpoint. And the information like mute and unmute what you need. And exchanged like, sorry. Like the channel that for the web egress component is established. And even when I, let me come back. And even I can do this one. Sorry. No, I can't. Sorry, the connection is bad. So you see the other side. Now I have the track mixed. So I can even mix the live stream. And then all is working fine. Theoretical wise, and my internet goes not down. I can online goes as well. I can go live with this. Let me see that he can see this live as well. One moment. I think it's here. Yeah, it was here. Somewhere here. This one should be. Yeah, now we are live as well. Okay, sorry the internet is not so good. Yeah, that's it. And so we have established a clone stream between two instances in the first bus. That's it. Yeah. Yeah, question. I'm curious. I've worked a little bit with Activity Pub, but not Super Induct. I'm curious if there's like a, is there a live stream post type in Activity Pub, such that like other implementations like a master.on server or something could play this live stream, or does it look like just a link to a live stream? How does that go this way? The question is, is there an Activity Pub attribute or something like inside, right? I'm not sure. You have the content type of video inside, and you have as well the annotations that it's a live video or not. This came from PeerTube itself alone. So, inside of the JSON is only the host server inside. It means when you share this JSON to another PeerTube instance, you get a description like who is the owner, which actor is the owner of this live stream, and where is the home server, the home instance for this live stream. This is all what we have inside. And then, Schick annotates this with extra attributes like who is the guest, and this has the host server at Schick instance. Because you can only follow with Schick another instance when your own instance has as well a Schick instance. When you not have a Schick instance, the button to join, you have to go then to the other instance. This is the main. This is the mechanisms behind it. I think, what's the question? Yeah. Okay. Yeah. This only works when both instances implement in Schick instance. And this is supposed to work as well for own cast, because it makes no difference. Only the front is needed for own cast. And this is the main idea behind it, that you have a way to scale your streams in the background with extensions. Yeah, based on activity. Perhaps an interesting point. It's like a little bit controversial. You can use such kind of technology for, I will not say advertisement, but for recommendations. When you have a live streams, often you have the problem you want inform other people that you have as well live streams. Other people didn't know about you. And here you have something like a pool where you can add streams and then you can chat doing the live streams. Because in a back, a live streams and an active live streams, nothing else as that you have different kind of sources from different kind of furry growth instances. And such kind of things are then possible. Okay. Okay. Yeah. You mentioned that you're using data channels to change information about back of this. What exactly is set up the data channel? Renegeration, the STP. I have the egress endpoint. I mean, the receiving end point needs a data channel from the offer of the resource. The question was what came through the channel, the STP. The STP and the mute event as well. Yeah. This is coming soon. Yeah. What's the reason for the delay so much lately? Here in this one, I think also what's the reason for the delay in the latest thing? First, the network here, I guess. Second one, no, most likely the network. I have this one here. One moment. When you have this one, I hope I'll be online still. I'm not sure. This delay, what you have here, this is more bigger. This came from the transcoding form. VapRTC to RGMP. That is at the moment not optimized. This is the reason for this delay where you have such kind of, yeah. But the rest, I think it's the network. I guess. So it's not VapRTC to VapRTC. It's converted somewhere? It's like this. You have a VapRTC to VapRTC converted. Which one you mean between the server or between the? On the right-hand side, the video is quite delayed. Yeah. Where did the left? Yeah. This one. Yeah, there's a big delay at the moment. Yeah. Yeah. Now, the thing is, in this case, you have three VapRTC connections now. One is from the client. Maybe I can show you this here in the slides. Sorry. You have three connections. One to your chic instance. One from the chic instance to this one and one to this one. It's like a pipe. And I guess this was this quite fast because they are in the same location. But I guess this one makes a trouble at the moment. I guess. Yeah. Some other question? Yeah. I missed part of the presentation, sorry about that. As far as I understood, you are using Weep and Web as a way to get those two to communicate with each other. So, as I was saying before, in the last year, the view of what specification basically forces you to create an offer for that as well. So it makes changing Weep and Web impossible within the specification. Are you using the old mode where you were expecting an offer to do something? How are you dealing with this synchronization where you have to wait for an offer and stuff like this? Yeah. I try to repeat the question. Weep and Web, I think, have two options. First, you send an offer and get an answer back. And second, the second option is you say, hey, I want an offer from you. Then you get an offer and you send the answer back. What is the difference between this one? For the first, you need only one request. It's like, give me one post request. You send an offer and get an answer back inside the post request. For the second option, you send first a post request, get an offer, and send again a post, a patch. I think it's a patch afterwards. Yeah, something like this. I implemented the second one because I implemented it in June and I think now is a new version out where they are supposed only one request. Yeah. For Web, for Weep in one, for Weep, I only need one request. Yeah, that's right. But because we are not here, I not use Weep and Web how it's supposed to be because I need to dynamically, so I established Web at the C Channel as well. So that is additional. Okay. Yeah. Yeah, if no questions anymore, then thank you for watching. Thank you. Quite interesting. Yeah, because you're talking about this problem already. I wrote a long post because I liked the old mode. I liked the way that we are doing things. Federation is possible thanks to the mode. Just leave a couple of minutes to sit down. Yeah. Yeah. Yeah.
Getting AV1/SVC to work in the Janus WebRTC Server
Well, welcome everybody. Lorenzo here needs no introduction. He brought the crazy contraption to give his presentation with. It's almost a dangerous demo in and of itself. Yeah, yeah, easy. And he'll be telling us all about AV1 as we see. Let's go for it. Yeah, you can hear me, right? Yes, sir. So thanks so for the introduction. Yeah, so I'll be talking about specifically AV1 as we see. I'll go, it's in some technical details. So it may be boring here and there, but I really think it's important in order to get a better understanding of how it all works. And this is just a quick introduction over me. So I'm one of the co-founders of a small company based in the south of Italy called Miteco. I'm the main author of Janus, which is an open source for Bouticy server. And there are some links if you want to get in touch with me or learn more. And basically what we'll be talking about today is AV1. And if you're not familiar with what AV1 is, it's a new, relatively new video codec that was designed within the context of the Alliance for Open Media. That has a lot of companies behind it. There's Apple, Cisco, Google, really a ton of them. And what they really wanted to do was to create an open and royalty free video codec. And of course emphasis on open and royalty free because we don't want another H264 or H265, which was specifically designed for real time applications pretty much like Opus was also designed as a codec for the internet. So that was quite important innovation with support for higher resolution, so for KM Beyond. And most importantly, it was also conceived to have support for SVC baked in the codec specification itself. And that's quite important because some other codec support SVC as well, but many times they come as, let's say, later additions. So basically codecs are extended to have SVC supported. In this case, AV1 was conceived with native support for SVC. So all AV1 implementations are supposed to at least be able to decode an SVC stream, for instance, which is important when you start working in hardware decoders and stuff like this. And of course this got me and should get you all very interested because these are all very interesting features to have for different reasons in WebRTC. And SVC is important for a few different reasons. So we all know what CIML Cast is. You use a single M line to basically carry multiple quality streams, like you have a high, medium and low quality stream, both sent at the same time, so that different qualities can be distributed to different participants as needed. But with CIML Cast, each stream is encoded as a separate stream, which means that each stream is also decoded independently of others. But this does mean that you have to encode the same stream more than once. And the fact that they are decoded independently can also cause some challenges sometimes. With SVC instead, you still use the same media source, the same M line and so on, but the different qualities, so high, medium, low, whatever it is, are all layers of the same thing. So you have a single video stream that has like an onion, different layers, that basically make each layer provides more detail if you want to look at it that way. And so the key difference between CIML Cast and SVC is that with CIML Cast, since you have different streams, you also have different SSRCs. Each quality is a separate RTP stream. With SVC, all layers are the same SSRCs. So as far as the recipient is concerned, it's just a single stream, which means that it does require less bandwidth because you can pack some things up and it's more a layer kind of approach. It is sometimes more CPU intensive in terms of encoding because that's a bit more tricky, but it does have some advantages over CIML Cast as a consequence of that. And an interesting aspect is that CIML Cast, as we know it in WebRTC today, actually did already make use of SVC somehow, because when we say, for instance, BPA to CIML Cast, and then we mention temporal layers, temporal layers are not a feature of CIML Cast. Temporal layers are a feature of SVC. So we are basically using a feature of VPA that allows us to use a partial SVC functionality where we can have different frame rates within over the same RTP stream that we are handling. And this is just summarizing it from a visual perspective. So you have CIML Cast sending three different streams and then we can choose which, an SFU in the middle can choose which stream to send to other participants. With SVC, we have one big thing that has many layers. One participant may want to receive them all, another participant may only want to receive the medium layer, and then another participant may want to receive the lowest layer as possible. This is just to give you an idea from a visual perspective instead. And so I was very interested in implementing it in Janus, and here are a few links if you want to learn more about Janus itself. And so I started to figure out what we needed to do in terms of what do I need to do in order to get that working. And so first of all, of course, we need a way to negotiate AV1 and the SDP, and that's of course a given. It may be helpful also to be able to detect keyframes in the stream, and that may be helpful for different reasons. For instance, when you are doing Siemulcast as a server, it helps when you know whether a packet is a keyframe or not, especially if you want to switch on a keyframe or stuff like this. It's also important to be able to somehow interpret how the AV frames are spread across RTP packets, and for us it's especially important for our recordings, because when we record stuff in Janus, we just record all the RTP packets that we received, so that we can go through them later on. And so basically getting a recording in a playable format just means reorder all these RTP packets I received, get the AV1 frames out of those RTP packets, and then put them into an mp4 file to make an example. And this means that we need to know how AV1 fits within RTP, and we show how that works later. For SVC specifically, there is another important thing that is called the dependency descriptor that I'll talk about in a minute. And so that means that we also need to somehow support that in the server as well, which first of all means negotiating it, or extensions must be negotiated in order to be used. We need to know how to parse an extension of that sort, and then we need to figure out how to use the information that we receive in that extension. And as we'll see, 0.5 is the one that got me the most in trouble, and then I'll explain later why. But starting from negotiation is very easy, so you just negotiate the codec name and the relatively clock rate there, so that's easy. Detecting keyframes and support m basically being able to extract frames from packets is a bit more complicated, but that's because we need to start delving a bit deeper, and so figure out how AV1 is packetized over RTP. And that's actually something that's true for all codecs. So for all codecs, you need packetization rules, and that's especially true for video, because for video, typically you have larger frames, and RTP packets cannot be that large. They are usually limited by the MTU size and so on. And so you need to have some rules that tell you if you have a frame that is this large, this is how you split it across multiple RTP packets for this codec, this codec, and this other codec. And usually there are some similarities, but usually each codec has its own rules, mostly because of the nature of the bit stream, let's see. And this is an activity that typically the ITF carries on in the AVT core working group, because basically all packetization rules as RTP and WebRTC are all standards. Unfortunately for AV1, it did not happen in the ITF, so they came up with their own specification, which is provided here. So in this specification, they provide information both on the AV1 aggregation header, that is those packetization rules that I mentioned. So how do I split an AV frame over multiple RTP packets, and how do I get that same frame back when I have access to the RTP packets on the other side? And it also talks in great detail about this dependency, the scripture, which is a beast of its own, as you can see. And this is basically how it looks like from a visual perspective. So with RTP, you typically have an RTP header with all the usual stuff that you all know. You can have some RTP extensions in there, and this is where the new RTP extension would appear. And then you have the RTP payload. And the RTP payload is where this aggregation header plays a role, because as we mentioned, we cannot just dump an AV frame in there because it may not fit. And so we need to have some sort of information that tells us how an AV frame is actually split, or if there are more than one AV frame in the same packet, we need to know that as well. And the AV aggregation header, the AV1 aggregation header is fairly simple, because it's just a single byte with a few bits that you can set. Like, I will not go too much into the detail, not to bore you, but just information about whether these OBO, and the OBO is basically the equivalent of an AL for AV1. So if you know what an AL is for RAS264, an OBO is the same thing for AV1, more or less. So it's basically a unit of a frame. And then basically these attributes tells you whether or not an RTP packet that you just receive is a continuation from a previous frame, so that you know that whatever you're receiving now, you have to append to whatever buffer you had before, whether or not this frame is complete or not, whether you have to actually wait for something else before passing it to the decoder. You may have some information about how many OBOs are in place, which is actually optional, and we'll see why in a second. And then this bit tells you whether this is the packet that you receive is the beginning of an AV frame, which is, again, all of these pieces are very important when you have to reconstruct the AV frame when you receive it, so that AV1 frame when you receive it, so that you know that this is the first thing that you have to put in there, then you pass this year, this year, this year, eventually you again end up with the complete AV frame. And basically it looks a bit like this, so in this case, for instance, we are actually aggregating multiple OBOs in the same RTP packets, and in this case we are not specifying that there are that many elements, which means that for each OBO in there, after the aggregation header, we have a variable size element that tells us how long each OBO is, so in this case we're just going sequentially, aggregation header, we know there are some packets, we check the size, then we read exactly this amount of bytes, and this is the first element, second element we read the size of that, and we go on and go on and go on. And the W attribute over here allows us to save a tiny bit of space when you use it, because if you say that, for instance, there are just two OBOs in this element, then this means that you do need to provide the size of all the elements except the last, because then you can read them sequentially by checking the variable size length until you get to a certain point. When you get to the last element, you know that all the bytes that are left are actually associated to that frame, so you don't need that additional variable element in there, so you save a bit of data, maybe not that much, but in some cases it may be helpful. And to use the aggregation header, I mean I mentioned that it can be helpful in a few different cases. In my specific use case, I basically interpreted that, for instance, not a continuation and a first packet, I can more or less treat as a key frame. It's, of course, not really always like that, but it at least gives me the beginning of something, which is something that is very quick and simple to use when you're actually just routing stuff. You just read a single byte and make some decisions based on that. For instance, when you need to do some symbol-cast-related switches, for instance. For recordings, I needed to do something more complex, because as I mentioned, we need to traversal the RTP packets, reconstruct an obu frame, and an ap1 frame before we can put it into an mp4 packet, which means that I had to actually implement all that de-packetization rules accordingly. And also I had to implement the parsing of a specific obu in order to get some additional information, like the video resolution, because if I'm creating an mp4 frame, I don't need to decode the frames, but at least I do need to know how large it is so that I can put it into the mp4 header, for instance, or maybe use the RTP headers to figure out roughly the frame rate, these sort of things. And all that I've mentioned so far is really all that you need if you want to use everyone normally, just as a regular codec, so we forecast all streams are independent of each other. So if I want to go from high to low, I can just move to the SSRC with the low quality stream, and I don't need to do anything else. The low quality stream is encoded separately from that other one. I don't need to know anything about that other stream, they're completely independent. With SSRC, that's not always true, because you may have some dependencies in place. So if I want to go from, for instance, the highest quality layer, since we are talking about an onion, will very much likely depend on one or more packets from the medium layer and the low layer, which means that I may have to forward those two, otherwise the high quality layer will not work, because that alone is not enough to decode something. And these are all things that you need to figure out at runtime, because you have a stream that is coming in and you have to make a decision right away, otherwise you cause delays and stuff like this. And most importantly, most of the times you may not even be able to parse the payload, because, for instance, if insertable streams are used and the stream is end-to-end encrypted, you cannot have a look at the payload to see what is what. And this is what the dependency descriptor is for. The idea that you have an external component, so an RTP extension, that contains all the information related to the packet that you just received. And this one would not be encrypted as the payload itself, and so it's something that an intermediary like an SFU can use to do something. And this is just one example that comes from the RTP specification over there. There are really a ton of examples. In this case, this is an example of how L2 T3 dependencies work. L2 T3 means two different spatial layers that depend on each other and three temporal layers. So two video resolutions and maybe 30, 20, 10 frames per second. And this gives you an idea of how the dependencies work as a frame goes by. So this is the first frame, second, third, fourth, and so on and so forth. And so you'll see that in this specific kind of approach, the first packet you'll receive will be related to spatial layer zero, temporal layer zero. And pretty much everything depends on this packet over here. And then if I want spatial layer one and temporal layer zero, I definitely need to relay this packet to otherwise this one will not be able to be decoded. If I'm interested and basically you follow the arrows and you have an idea of the kind of dependencies that you can do so that you can choose which packets you can actually drop or not. And as you can guess, the problem is, as an SFU, how do I know these? So how do I know that this is what is happening and these are the dependencies that are in place? And this is basically what the dependency the scripture provides and I'll explain how in a second. And so continuing from the requirements that I described before, it means that if I wanted to have SAP or for this component in Janus, but this is true for each web artist is around there, again, I need a way to negotiate the extension. I need to somehow parse it so I don't, I need to know how it is encoded so that I can figure out what is in there. And then I need to find a way to use it. So for instance, to recover those dependencies there. And I thought that negotiation was supposed to be the easy part, but it's actually not that easy because of course you just need to negotiate that extension with that name as an additional X map. That's how it works for all extensions in the SDP. But it turned out that I also needed to support the so-called two byte header extensions using X map allow mixed. And this is because RTP extensions by default are supposed to be quite small. And so you usually have the so-called one byte header RTP extension where in one byte you provide some information, which means though that the length of the extension is limited as well. So since you are using one byte to convey a lot of information, the size of the extension itself cannot be more than, if I'm correct, more than 16 bytes or something like this. I don't remember now exactly. And the dependency, the script though can be much larger than that. And so you do need to support two bytes extensions with at the time I didn't. So I needed to implement that first in order to get it to work because when I started testing it, nothing worked and it turned out that this was the issue. And then we need, once we have negotiated it and we start receiving the dependency, the script, as part of our TP packets, we need to figure out a way to parse it. And this was really a nightmare for me. This is like therapy for me right now because I'm sharing all this with you. And I actually run to the about this in a couple of blog posts where you can see the nitty-gritty details. But just to give you an idea, basically it's, let's say a mess. I will not say that word because I don't want to be bit. But basically you can see that this is a specification that was written by somebody who writes codex, not a network specification because all fields are variable length and often at the bit level, which makes it really a nightmare to parse sometimes. And from what we regard the specification itself, it's indeed quite flexible because there are a few mandatory fields like if this is a start of a frame and end of a frame, the frame number, and the template ID for those dependencies that we've seen before. But also everything else is optional, which means that you can either have a dependency in the scriptural element that describes everything, so the whole context of the SVC or just something that tells you the scope of the current frame. And when we look at how a dependency in the scriptural really looks like, this is a simple parser that I created to basically debug things offline. And when we receive a keyframe, typically we have a 95 bytes extension, which if you know RTP, that's a lot. That's basically almost 10% of the payloads that you have. So it's really big, but that's because it contains a lot of information. So if you start parsing it and serializing everything that you receive, you have information about the different layers that you have, spatial temporal and so on and so forth. TDI, I don't remember exactly what it was, but this is just the output of that tool. That's a lot of stuff. So blah, blah, blah, blah, some more chains, some more stuff, the code layer targets. I have some stuff about resolutions. And finally, we're done. Basically, all the parts that we've seen before were basically the media center telling us, these are all the information that I used for this specific SVC context. So in this case, this was an L3T3, so three temporal layers and three spatial layers. And all those, that huge stuff that you've seen before is all the information related to chain dependencies, all that kind of very low level stuff. And so if you want to use it, it's there. And then at the end, it also tells you the resolution streams of the three different spatial layers. In this case, it was low because I captured really at the beginning, I think. And finally, it tells you that for this specific RTP packet, this is a spatial layer zero, temporal layer zero, and it uses template index number one, which is indeed spatial zero, temporal layer zero. And this is the information that we need because then having a look at all the stuff that we've seen before, we know that the resolution for spatial layer zero is, in this case, this multi-mover here. In practice, it would be something like 320 by something else. And this is it. And of course, likely not all dependency descriptors are so long, only for the meaningful key frame packets, it's usually like that. And then other dependency descriptors will be much smaller, like only seven bytes, because they will only tell you, for instance, the temporal index of this specific packet. In this case, it is a spatial layer zero at temporal layer zero. But I only know this because I received this before. So I received somewhere in time this huge chunk of information before, because if I only receive this and I get temporal index six, what is six? Six relative to what? So what does it mean? I don't even know how many layers there are. So you do need to have that information first if you want to make sense of all these smaller packets that you receive later after that, which means that when you start to implement stuff in a server, it does mean that you start need to keep a state, which is not really true for single cast or other things. I don't mean it's partly true, but only in a very limited way. In this case, it does mean that anytime that you receive that huge packet and you parse it, you need to keep it somewhere so that when you receive packets after that, you can reference them and use them for something. And the idea was that once I have a knowledge of those templates and I receive information and I know that this packet that I just received, this spatial layer X and temporal layer Y, then as a server, I can decide whether or not I want to relay it or drop it. And you can do it the relatively easy way or you can do it the hard way. The hard way is figuring out all of those dependencies that we've seen before. I went for the easier way, especially right now. If it is temporal layer 2, then relay everything related to spatial layer 1 and 0 as well, as long as it's the same or let's say the temporal layer is smaller or equal to the one that I'm receiving. So I may be relaying more than I should, but at least I know that everything is there. What's important is that once you use that information so that once you've parsed it, you cannot drop it. You need to relay it anyway because it's not only helpful to you, it's also helpful to the subscriber that is receiving that video stream because they also need to know what is what. So you need to forward that information as well. And very important, you also need to update the RTP headers accordingly, including the marker bit, which is what really drove me nuts the first time because I actually implemented all this for a long time and it didn't work. And eventually I figured out that the problem was that I was not updating marker bits as well. And this is the reason, basically. So if we have a sequence of RTP packets related to different spatial layers and temporal layers, this is basically what it looks like from an RTP perspective, including marker bits. If I am dropping spatial layer 2 because I don't need it, then what it means is that I'm dropping some packets over here. So of course, all the packets that I'm dropping, I need to update the sequence number so that it keeps on growing monotonically because otherwise the recipient will think that they are missing, losing some packets, but they are not missing them. They are just dropping them because they don't need them. So I need to update the sequence number so that this is one, this is two, this is three, this is four, five, six, seven, etc. So I need to make sure that they know that they are not really missing anything. But I also need to update where I'm setting the M equals one marker bit as well because this is needed for b-decoating, especially from Chrome. So in particular, you need to set M equals one on the last packet with the same timestamp. So since the timestamp now is changing on the second packet, because that's the last packet with that timestamp over there, I need to set M equals one on that second packet before I forward it or otherwise nothing works basically. Sorry, wrong direction. And basically, if you want to test all these and with Janus or with anything else, of course you need to have a browser that supports all this stuff. And the kind of bad news is that at the moment I think only Chrome supports it. I don't know if other Chrome-based browsers support it too, but definitely Chrome supports AV1 as a codec. And you can check that by using the RTP sender get capabilities thing to see. If you see AV1 in that list, you do support AV1 as a codec. But you also need to support SBC functionality and most importantly the dependency, the scripture. And the dependency, the scripture is not offered by default. So you do still need, I think, to first fill the trial like this. I don't remember right now if you can just manage the SDP to artificially put the extension in your SDP in order to make it work anyway, but that I should check, I should double check. But you may need to launch, for instance, Chrome with that thing over here so that the extension appears in the supported extensions by the browser. When you do that, then your browser is capable of encoding AV1 SBC functionality with dependency and scripture, which is quite important. And if you want to test this, I also made it very simple because if you go on the online demos for Janus and you check the eco test demo you can provide a couple of attributes to, first of all, for AV1 as a codec and then for a specific flavor of SBC, in this case, for instance, L3T3 to send three temporal layers and three spatial layers. And when you do some small buttons appear over there and they allow you to check one thing or the other, which means that you will send the big AV1 SBC stream to Janus and Janus will send you back only what you asked for. So in this case, for instance, spatial layer one and temporal layer two which is why my resolution is smaller and the bitrate is smaller as well. So by playing a bit with those things you should see resolution changing, bitrate changing, if it does, it works. And the same functionality is also supported in the video room, of course, which is the SFU to do video conferencing. So at least in theory you can have a complete video conference that is based on AV1 SBC as well, even though we haven't tested that much but it should definitely work. And I think this is it. I'm not sure if we have time for questions, but before that, I also wanted to announce that, I'm sorry, I'm bothering you all, but JanusCon is back. So JanusCon is our own Janus conference. So it's a conference devoted to Janus and WebRTC in general, which will happen at the end of April in Naples in the south of Italy. We have a few sponsors already which I'm very grateful for. And the call for paper ends in about a week. So if you have anything interesting doing with Janus and WebRTC, you can feel free to submit a talk there. Well, tickets are also available for sale as well. And of course, if your company is interested in sponsoring, that would be great too. And that is all. I don't know if we have time for questions because I didn't really check how fast I was going, maybe too fast or... Okay, so are there any questions for anyone at the C part? I see a couple. I think slow me with... Generally, would you say that the SBC is like the generation of simulcast or if we continue, whether we look at the future of people on the platform that will replace it or they will need to get the sale by sale? I mean, in general, if you look at, for instance, if you look at that... Oh, sorry, sorry. Slow me was asking, is basically SBC or evolution of simulcast or does it make sense to have them both at the same time? Which one will take... Which one will be more important in the future? Which one is the technology to invest in in the future, maybe, as well? And functionally, I mean, they serve the same purpose, if you want, because I have the same demo for simulcasts and if you look at the demo for simulcasts, it looks visually the same. So you have the same buttons to say, I want high quality, low quality and so on. The difference are really in just how the thing is implemented. And the main problem, I mean, in general, SBC is supposed to be more advanced, of course, than simulcast and more resilient as well, probably. But the main obstacle right now is that it's related to what I was saying before. So right now, if you want to use AV1 SBC, you have to do a custom flag, which means that right at the outset, it's really not something that you can ask your customer to do, for instance. So for the moment, it's not really something that is production ready. You can use the SBC flavor of VP9, which provides a similar feature, which is now available out there. But still, simulcast is overwhelmingly favored in general for production environments because it's been battle tested, it's been there since day one. Everybody supports simulcast, it's easier to work with and so on and so forth. So for the moment, it doesn't make sense to just use force SBC in your production environment right away, if not for experimental purposes and for testing how it works, for dipping your toes in the technology. But for the future, I definitely think you should pay attention to that because AV1 will be the code that everybody will adopt, hopefully because it's better quality, it's royalty free, it's open source, and it has SBC baked in. Sooner or later, hopefully Safari will have AV1 as we see, Firefox will have it, Edge and other browsers will have it as well. And you definitely want to be ready when that happens because otherwise you'll be the one stuck with the old codec and everybody else is taking advantage of the new team. I think learns that you can munch the SDP to make it work. For the extension, yeah. Because we have it working new team. Tuzlomi, there is one thing that in some environments might be relevant which is as many hardware decoders don't cope with SBC, but they do with Samocast because they look like a normal strain. So if you're in a resource constrained thing, maybe receiving SBC is no bueno, but receiving a normal Samocast will be better. But in theory, these will not be true for AV1 because AV1 was conceived with SBC in mind. So in theory, all hardware decoders, too, even smaller ones, will know how to interpret that. And since it's a single stream, they will be able to decode it. Of course, it's just theory and... Ideally they would. For VP9, for example, Chrome still does not use hardware decoders when you use SBC. And I'm not sure because AV1 hardware support is hit and miss yet still. And there was another question here, yeah? Yeah, I was wondering what the forward error correction strategy here is, like, is this patient, if there are... I'm sorry, if forward error correction is used, how do you use it with do is I mean... Yeah, if all the use forward error correction is SBC, then you are like, helping out some tactics and then it doesn't work. Yeah, that's a good question. And it's actually related to one of the doubts that I have related to FBC, mostly because I mean something like AV1, SBC and CMUCAS as well only makes sense when you have a server in the middle. It doesn't really make sense if you are sending something from point A to point B and point B is the one that is meant to receive it because in this case you are sending everything anyway. So unless you are using SBC as some sort of a... of your redundancy mechanism because you say, if I lose some packets related to two, I can still display one. That's one thing, but that's not really what it's meant for. And so the moment you have a server in the middle, it also means that you can offload the forward error correction stuff to the server as well. So which does make sense also because, for instance, when you use FlexFec, which is the thing that was described in the first presentation from Chrome, Chrome by default will not put any redundancy information, so it will not put any FEC packets until the peer tells them that they are losing some packets. And this is to optimize stuff, so you don't add redundancy unless it's needed because there's loss reported, which becomes a problem if you're doing something like a video conference because your uplink find may be perfect, and then you have subscriber X over here that is experiencing loss and you don't have any redundancy packets to send them instead. So the idea and probably the solution to that, this is something that I'm still brainstorming myself because FEC interests me, but I have some doubts there, is that probably the forward error correction stuff is something that the server itself will need to add on each subscriber leg. So from the server to you, I will have a dedicated FEC channel where I add some forward error correction stuff from the stream that I'm sending you, and for the stream that I'm sending you, the layer 2 may not be there, but I have a consistent stream because packets are in sequence, and so the forward error correction stuff that I'll be sending you will be different from the one that I'll be sending to somebody else who is receiving additional layers, and that's probably the only way to do this if you don't want to forward FEC and to end without treating it, which anyway wouldn't be useful at all, especially if the sender is not providing that information themselves. Yeah, in my experience, and this may be an implementation choice, of course, I did have to forward it because otherwise it would not be decoded properly, basically. And I don't know if this is actually really needed, like for instance, even the marker bit 1, that's not really needed from a specification perspective because as a receiver, you do see that the timestamp is changing, so you do know that it is a new frame and you can decode the previous one. But it's simply that Chrome expects that marker bit set to 1, otherwise it will not decode a frame, basically. So in my experience, you need to forward that information too. And I guess it makes sense because the recipients themselves also need to decode possibly differently the video stream depending on what they are receiving because they need to know if the resolution must be this size or this size or this size or something like this. It may all be part of the 81 bit stream, so it may be redundant information as far as they are concerned, but at least when I made these tests a few months ago, it was needed, so just relaying it makes sense. Yeah. In regard to switching this layer, like saw your previous talk somewhere was on bandwidth estimation, maybe you can comment on how they do go together or is there something specific to 81? Yeah, no, I mean the bandwidth estimation stuff is important for a few different reasons. And in this case, I'm talking about the bandwidth estimation on the subscriber side. So from server to recipients, because on the publisher side, there is transport-wide control CC and basically the browser themselves are capable of using the feedback to figure out if they need to send less or more. And so dynamically, you may see that some special layers are not appearing because the browser doesn't have enough bandwidth for that. On the subscriber perspective, it's really useful because it allows us to it helps with the decision. So for instance, right now I just mentioned just generically whether I want to relay or drop a packet, but this actually depends on why I should relay it because a user may want to receive the highest quality possible, a user may want to receive the lowest quality possible, but this may be because they only want a lower quality because the video is going to appear in a thumbnail and so they don't need the whole thing and that's an application logic decision. And now the decision may come from the user doesn't have enough bandwidth for all of that stuff, so they don't have enough bandwidth for special layer 2 and 1. Let's just send them special layer 0. And this is where bandwidth estimation helps because if I'm sending stuff to the subscriber and I'm starting to get information that congestion is happening, then internally the server can update which special layer or temporal layer I should send to this specific publisher dynamically. And so this will impact my decisions to relay or drop stuff and so it allows me to dynamically dynamically impact the quality of the subscriber depending on how much bandwidth they have. And in my experiments right now I've only done this with Siebel because I haven't hooked it up to SBC yet, but the key principles are really the same. One minute? Yeah, just related to that is there a way or Wipen Web to signal the final cast of the publisher and the subscribed site? Yeah, I mean for the final cast or SBC. Of course, yeah, in Wipen Web do you with Wipen Web is there any need to signal Siebel cast or SBC as well and does it make sense? And in general, I mean it's definitely important that you signal it on Wip because you want to make sure that the stream that you are ingesting is recognized by the server as a Siebel cast or an SBC stream so that the server can also parcel of those dependency descriptors in case it's a one SBC for instance or in case it's Siebel cast it knows that it needs to take care of, let's say, three different qualities. On the subscriber's side for Siebel cast it's really not important because you're just always, as a subscriber, you're just always going to receive one video stream and as far as you're concerned it's a consistent video stream. You don't even know that there is a switch behind the curtains that is happening from high to low to medium or whatever. You just see a single video stream so you don't need to be aware of the fact that it's Siebel cast. For everyone as a SBC it may be important to negotiate the dependency, the scripture extension as I mentioned because if it's needed for decoding purposes and you want the browser to be able to decode things properly then you may want to negotiate that extension as well on the subscriber's side. But as I was saying before it may or may not be needed so that's something that we'll have to check. And I think I'm really out of time now so. Thank you. Thank you.
Using GStreamer to build real-time applications with Golang
All right, well, welcome back everybody. Up next, the one and only Dan Jenkins is going to tell us all about G-Streamer and Golang. Take it away, please. Thank you. Hello, everyone. Can everyone hear me okay? Yeah? Good. Great. Cool. Okay. I forgot my clicker. Number one rookie thing to do. No, no, I've got my phone. So I'm good. But yeah, that's why I've got my phone. And it's going to look a little bit weird. I also forgot to buy I brought two European plugs with me. But one wasn't European. One was American. So my day did not start off well. So yes, G-Streamer and Golang. So a little bit about me. Very, oh, that's just going to get really annoying. I'm just going to click. Cool. Okay, so a little bit about me. So yes, I'm Dan Jenkins. I run a couple of companies. One called Everycast Labs, one called Nimbleape, and another one called Comcom. So Everycast Labs does broadcast stuff, bringing in remote talent into broadcast workflows. Nimbleape is a consultancy service, consultancy company based in the UK. And then Comcom is an event that we put on for open source people, our way of kind of giving back to the ecosystem that we build from. I was the very first Google developer expert in the world when it comes to WebRTC. I'm not saying I'm the best at WebRTC, but I'm the first that actually got accredited by Google's developer program. I love Lego, and I love real-time media. So yeah, Nimbleape, we're a consultancy, and if you've got hard problems that you want solved, come talk to us. And Everycast Labs, we've got that product that I was just talking about called Broadcast Bridge. And then Comcom. So Comcom is dear to my heart. Historically, it's been a residential event where we bring everyone, everyone stays in the same place. And then we've got three days of awesome real-time and open media content. And then we're back in 2024. Dates are still up in the air because of contracts, but it's not going to be residential this year. We're going to go on tour, so we're not just going to be in the UK. And that's quite exciting. So to the actual topic, GStreamer building real-time applications with Golang. So what are we actually going to talk about? We're going to talk about GStreamer, obviously. We're going to talk about Golang, obviously. But I want to introduce you to something called GoGST. GoGST has been around for a long time now, but kind of got itself into a bad state where it was not un-maintained, but there were lots of little forks and lots of little patches everywhere. And so we've kind of changed how that project's being managed now. And then also I want to introduce you to something called Pion. So let's take a look at GStreamer first. Who in the room has heard about GStreamer? Good. That's the answer I was looking for. So open source multimedia framework basically does everything that you chuck at it in some form. And I absolutely love GStreamer. So a lot of you might know GStreamer as something like this. I'm not going to ask you to tell me what that is, because I know that it's kind of taking in an RTSP source and then doing something with it and then outputting something at the end, via UDP, but all the stuff in the middle now. But GStreamer is actually super powerful and ultimately lets you do ingress, do something with it, and then egress. And it kind of boils down to something that's simple, right? GStreamer can do it all and can do a lot of things. So for us at everycast labs with our broadcast bridge product, we care about certain things. So GStreamer can do NDI, GStreamer can do WebRTC, GStreamer can do SRT, can do RTP, it can do HLS, it can do RTMP and RTSP, right? I'm not telling you anything that you don't know at this point. But for us, at least with broadcast bridge, GStreamer has a superpower and that superpower is app source and app sync. How many people in the room know about app source and app sync? Okay, good. That means like 60% of you are going to learn something now. The rest of you just sit and be happy. So yeah, this is what we use at broadcast bridge, in our broadcast bridge product. And that's because we don't write C. And so ultimately kind of adding code to plugins within GStreamer is really difficult for us. I know that's changing as time is going on. There are more and more Rust plugins, but at its core there are a load of stuff that we don't feel able to kind of contribute to if we find a problem. And so a lot of the time we don't like writing C like this, but we do like writing a lot of Go. And so we end up writing something like this. And this is Go GST. It was originally created by a guy with the GitHub handle, Tinyzimmer. I love the name. But now it's in its own GitHub organization. So it's under github.com. So it's under the new GitHub org and there's three main contributors. I think there's something like 17 total, but there's three main ones. Tinyzimmer me and R.S. Willy. So Lesfawkes is better for everyone. So this other one, Big, Little Ben. That's from the LiveKit team. And the LiveKit team had their own fork of Go GST. And they had put a load of work into fixing bugs, but they were never getting merged back into the project as it was under the Tinyzimmer GitHub. So now it's forked out. Well, it's not actually forked. We forked it into its own organization and then did the GitHub magic where we unforked it and then the Tinyzimmer one is now a fork of us. So there's a lot of GitHub kind of organization going on to make it easier for everyone. Did you know that GitHub forks don't turn up in Google SEO results and they don't turn up in GitHub search results either? And search doesn't work in the repo. So yeah, and search doesn't work in the repo. So basically forks are dumb. I mean, they're not dumb. But forks are bad. We should not be relying on forks for a long-term thing whatsoever. So yeah, this is actually really great for everyone now. So less forks is better for everyone. And like I said earlier, BroadcastBridge uses a mixture of SRT, NDI, WebRTC, among a load of other things as well. And so why, you're probably asking, why would we even need to use AppSync and AppSync when the modules, the plugins are already in Gstreamer? Like Gstreamer already knows how to take in an SRT feed. It already knows how to output an NDI feed and it knows how to do WebRTC stuff. So why are we building on top of AppSync and AppSync? And it comes down to greater control like I was kind of alluding to earlier. We use Pion to do WebRTC. And that's not because the Gstreamer implementation isn't good. It's just that if we want to be able to do anything that isn't implemented into the Gstreamer implementation, then we'd need to get someone to actually go and change that code. And that's something that we're able to do. My team aren't capable of doing, but we don't go really, really, really well. And so we can definitely kind of go and take that greater control. Like I said, this means we're handling WebRTC in something that we really know. Like, ultimately, very few people in this room know about transcoding something from one codec to another. And we just rely on FFMpeg or Gstreamer or whatever to do it for us. It's the same with WebRTC for us. We really know what we're doing with WebRTC and we want to be able to kind of tweak things that we can't necessarily tweak with the Gstreamer implementation. But Pion is hugely, hugely powerful. And this is the other key thing and it's easily upgradeable. So when we actually find a bug in Pion, we can a Gstreamer pipeline and never leaving the C level. But cost isn't just measured in terms of compute. Cost is everything from building the feature all the way through to deploying the feature and running the feature. And you've got to look at the whole picture. Pion gives us huge, huge flexibility and we can move fast and we can add new features and ultimately that means that we win business. So let's take a quick look at AppSource. How many people are actually familiar with AppSource? Right. So AppSource is just another plugin, module, whatever they're called. And ultimately, you can put it inside of your pipeline and you can push data into Gstreamer using AppSource. You set a load of capabilities on that element, that AppSource element, telling it, oh, well, this media that I'm just about to push into you is this format and this frame rate and whatever else. And you can push in data or you can make, so you have to push data in obviously, but you can also make Gstreamer ask you for the data. So instead of just going, oh, I've got data, data, data, data, data. And then Gstreamer goes, oh, hold on. I can't do anything with this. Why are you sending me so much data? You can, Gstreamer can actually ask for it. Now, that's not hugely helpful when it comes to real-time applications because real-time applications, in the case of Pion, sending us web, getting RTP data from Pion, for example, that's real-time. And so we want to get that data from Pion and we want to pass it into Gstreamer straight away. Because we're getting it in this constant flow from Pion. Whereas if you were reading a, if you were reading a file and then you were passing those chunks into Gstreamer, well, you've got control over how fast you push those chunks in. And so why not let Gstreamer go, ah, I want a bit more data. I want a bit more data. I want a bit more data. Right. App sync is absolutely no different. It's a, it's a plug-in, it's a module. And, and when you put it into the pipeline, it becomes an element. And ultimately you get push data out of AppSync. And so imagine you've got AppSource and then you've got something in the middle, whether or not that's transforming it or transcoding it. And then you've got AppSync and you're connecting all these bits together. And so you're pushing data in. Gstreamer is then doing something with it. And then it's pushing it, it's passing it over to AppSync. And then AppSync sends it out to your application as data. Not as UDP, not, not via report or anything. It's giving you the, the, the raw buffer of data. So you get pushed your data from AppSync via the, the, the, the new sample signal and event. I've got some data here you go. Notice how this is all go lang. So, yeah, let's take a very quick look. So we've got our sync. So that's an AppSource, AppSync element that I've made. And I'm setting some callbacks on it. And then we've got new sample funk. And then that gives me, that gives me my, my sync. And then I'm going to tell it as a return. I'm going to tell it what the return, what the flow state is. And so I pull the sample. And then if the sample isn't end of, isn't nil, then, then we carry on. If it is nil, then I'm returning that we are at the end of the stream. And then buffer. So we get this, our sample. So we're pulling the sample. And then, and then we're getting the buffer out of that. And then ultimately reading some, some, some information from that, from that buffer map, changing it from big Indian to little Indian, I think, or something. And then, and then doing some stuff on it, doing some maths on it. Not a lot of like useful information there. Like in terms of like, what am I actually then going to go and do with it? Well, at the moment, it's just printing out RMS. But then you can go off and do whatever you want with it. For us, that means getting a video and audio data out of G streamer and chucking it into NDI. Oh, Dan, why are you not using NDI within G streamer? Well, I tell you number one, when we did our NDI integration, G streamer didn't have NDI. It was, it was completely separate. It was, it was a different repo. And it wasn't part of the G streamer rust plugins. And then B, we do extra stuff that G streamer doesn't know how to do yet. So we, we grab tally information from, from NDI. And to be able to do that, you need access to the underlying NDI sender. And, and so there's stuff that G streamer can't do yet. Something that we actually want to add in to G streamer. So that we can stop sending stuff via the NDI SDK directly and we can just let G streamer deal with it for us. But again, goes back to that cost analysis, right? At the moment, we can get that data out of G streamer using app sync and chuck it out via NDI. We can do that. And it's relatively cheap. But then there's a load of extra work for us to be able to kind of go in and figure out the right way of doing it in G streamer so that like tally information becomes available as a signal. So yeah, for us, this means that we have to handle RTP and RTCP from Pion. Because Pion, within WebRTC, WebRTC is made up of lots of standards. But ultimately the media is RTP. And the bit that tells you what the quality is and everything else that goes with it along with it is RTCP. So it's very easy to forget about things that are very important when you don't deal with them. Like RTCP. SFU people in the room will go, ah, you could never forget about RTCP. But as a web developer, the browser deals with all of this for us. And so it's very easy for us to go, ah, RTP, I'm going to get my media. I'm going to get my media. And then everything works really, really well when you're in a really nice network environment. But then you chuck in real life scenario and the audio in the video goes terrible. Why did the audio and video go terrible? Because there's no RTCP feedback mechanism to go, ah, something's going wrong. But yeah, GStreamer makes all of this easy. And very quickly on this very specific thing, we use RTP bin within GStreamer. So that's that middle bit for us. We use app source, chuck it into RTP bin, and then we do a load of transcoding and stuff as well. And then we get app sync. RTP bin is magical. If you deal with RTP at all with GStreamer, then you need to be using RTP bin. There's a lot of text there. But ultimately, it implements everything you need to be able to handle RTP and RTCP and demuxing of payloads. And it's just a very nice all in all thing that deals with everything using all of the separate, all the separate plugins. But it forces it all together nicely for you. So for us, that's connecting the app source sync pads to RTP bin. And you'll notice I say pads. So for us, you can see up the top there RTP bin. So we're requesting a pad from RTP bin in that format. So it's a receive RTCP sync. And then we're also requesting a pad of send RTCP sync source as well. We then go and make a new app sync and a new app source. And you can see they're labeled RTCP app sync and RTCP app source. We then add those to our pipeline because otherwise nothing works. All of your elements have got to be in a pipeline. And then we link our RTCP app source pad RTCP app source, get static pad source, link RTCP sync pad. Yes. So I'm getting the app, sorry. I'm grabbing the RTCP sync pad from the RTP bin. And I'm linking it over to the RTCP app source. So that's basically just saying RTP bin is going to give me some information up to RTCP information via a pad. And I'm connecting to that pad so that I can then grab that information and send it over, send it back via Pion up to my web RTCP. So you'll get RTP in, in this case, you'll get RTP in into RTP bin, but you'll get RTCP in and out. So you'll get told RTCP and you'll also send it back out as well. And like I say, don't forget about the RTCP. As you can tell, I forgot about the RTCP and ended up doing certain demos and going, ah, look, it's really great. And then someone went and tried it on a really crappy internet connection and went, no, Dan, it doesn't work. And, and made me look rather foolish. So you end up looking something like this. So does everyone know about the dot graphs that you can generate from GStreamer? A couple of nods, not that many. So you can, within GStreamer, you can tell it, I want you to export a dot graph file on anything, on, on a state change or whatever. You, you've got control over when it generates it. And so for, for me, we, when we've got debugging enabled, we enable a dot graph generation whenever state changes. And so ultimately, this looks really small and dumb. It's a PDF. So you can go in and, and look at it in high quality detail. Um, because it's not a PNG. So you've got lots of options. You can, the dot graph can be converted into lots of different formats. But the really cool thing about dot graphs is it tells you what's connected to what. And so it's really great for debugging. And so for us, we've got our app source, um, our app source and our, our two app sources. So one is, um, one is RTP, which is this one. And then this one is RTCP. And you can see, I'm coming off the camera. I'm sorry. Um, so you can see that this one's set with, um, with capabilities to say that this is RTCP. And this one is set with capabilities to say this is RTP. And so you can see those are linked to a pad within a GST bin, a GST RTP bin. And so those pads are then connected to an RTP session. The RTP session is then, um, connected to a demuxer. The demuxer is then connected to a jitter buffer. And the jitter buffer is then able to go. Oh, well, in this, in this RTP stream that I'm receiving, that's both audio and video, where it's demuxed it and then it automatically goes, ah, here's the video and here's the audio. Right. And then it chucks it back out, chucks it back out, creates some pads for me, which I then connect over to, well, there's an app sync up there and that's my RTCP app sync. But then you could see here that it's then connecting out Opus and VPA into my pipeline. And then this is like the rest of the pipeline, which we don't care about, but like, I get told it's Opus and I get told it's VPA. And so I'm able to decode it and do stuff with it, whether or not that's outputting to NDI or whatever. At the end of the, um, at the end of it is, um, is an app source, uh, sorry, an app, an app sync for sending out via NDI. So we, we got into go purely because of Pion and Pion gives us loads of control. It's basically WebRTC in pure Golang. If you ignore the fact that WebRTC does lots of like actual media stuff, but when you look at, say, the, just the, the network portion of it of sending, sending data from here and sending it there, then it's pure Golang. So yeah, you can do any of this with any of the G streamer bindings or you can just, you know, do it with actual G streamer C. I mean, who actually want to do that? I don't know. But you can go and use whatever bindings you want. And so there's really nice bindings for Python, Rust, um, and I haven't used any of the others. Um, I've definitely used the Python one and the, and the Rust one myself. Um, and the Golang one, I went on there this morning to take the screenshot and I was like, Oh, where's the Golang one? Um, so here's the pull request to add it to the list. So if you've got a problem and G streamer doesn't quite solve that problem, that's what this talks about. This talk is about the fact that you can make G streamer do what you want it to do using app source and app sync. You can build it yourself with app source and app sync. So why G streamer? Why not FFM peg? Whatever. G streamer does everything that we need it to do. It has a fantastic community, super friendly community. And ultimately it's just super flexible and does exactly what we need it to do. Um, which is not something that we felt as a team. FFM peg would give us, for example, G streamer has a lot of scaffolding, let's say, um, and, and gives us an awful lot, um, for free. Whereas G, uh, FFM pegs a little bit more, more work, right? So my last message is G streamer for the win. Um, so yeah, don't wait for others. Don't wait for others to build your plugin for you. You can go and build with G streamer, app source and sync. And that's me. Thank you very much.
Build your ENUM LCR Server using CGRateS
I hope you can hear me. First of all, thank you for having me this year in Fosdum. My name is Saber Katelari. I'm a core developer at IDCS.com. And today I'll be showing you how you can build your own enum as your server using CG Rates. Firstly, something about our company. It's located in Bavaria, Germany with backhouses in Romania and Albania. We have over 17 years of experience in architecture and server-side solutions in voice-over IP environments. We have platform implementations covering both wholesale and retail businesses categories. And by now we are responsible to understand real-time processing and constraints and serious life system outages. Something about CG Rates. It's a real-time enterprise building suite, more like a framework since it can do many things. It's pluggable into any existing infrastructure. It's non-intrusive into existing setups. So it means it does not force you to make decisions. It's all dependent on your system admin if you want to take into consideration what CG Rates gives you or if you just want to ignore it. We are an open-source software since born in 2010. First sources published in 2012. Full sources are available in GitHub, 100% in Go. We always mention Go because when CG Rates first started, Go was also in its first weekly releases. And this means that we were one of the first implementers of Go. And it also means that everyone that we also paved the way for other people coming after us. We have no add-ons in private repositories and we take into consideration community contributions also. About Engine. Engine is performance-oriented. It has this built-in advanced caching system with transactional list record use and time TTL expiring records. It's asynchronous, processing with micro threads. If you know about Go, you probably know more about this. Also including API load balancer. We have three branches, V010, master and 1.0. V010 is our most conservative branch. Master is where we have our most recent developments. And also 1.0, we call it like the pinnacle that CG Rates can do, but it's still in early developments. We have a test-driven development environment with over 10,000 tests as part of our testing suite. Here we can mention unit tests, integration tests, and also call tests for switches. It has a building modular architecture which is cloud-ready. It has microservices with a rich set of RPC APIs because everything in CG Rates is API-related. And it's easy to enhance by rewriting specific components. So for example, if you want to rewrite the engine in some other code, you can easily do so. Some features for about CG Rates. You can do online offline charging system. You can have multi-tenancy from day one. This is more for wide labeling platforms. You can have multiple databases supported. We have multiple databases supported to mention some MySQL, Microsoft SQL, SQL Lite, Mongo Rates, Postgres, and also our internal database, which is compatible with everything we do. This is also a pretty challenging job to do for a relatively small team that we are. You can have real-time configuration reloads. So you can reload your configurations without having to shut down the engine and open it again. You can have rating engine with derived charging and in-number rating. You can have account balances and management with bundles and Dynaprepate. With Dynaprepate, you can create accounts on the fly and have it give some restricted permissions or limited permissions to your system. You can have sessions or event charging with balance reservation and refunds. This is prepaid logic. Stereo-shaken authentication, which is more for North America. CDR logging with support for interim records and rating cues. This is when you have your CDR sitting in a black box and have it communicate with your switch and have your CDR straight at the end of a matter of milliseconds without using any databases from the CDR side. You can have high-number of interfaces for event readers and exporters to mention some MQP, SQS, SQL, CSVs, XMLs and a couple more. You can have fraud detection with automatic mitigation, LCR with quality-based bundles, quality-based stats and bundles, call statistics with pattern monitoring. So you can find your ASR and your ACD live from your CDR rates. And also in combination with your proxy, you can find your average call cost and your total call cost. You can have dynamic pricing imports with templates. This is since all suppliers have different formats and CDR scan is compatible with most of them. You can use it with diameter, with radius if you need some authentication, Wi-Fi authorization. With DNS if you need enamel CR routing, which is also the topic for today. And you can also have a basic SIP server where it can do redirecting with your CDRs. You can have it redirect traffic from your switch to your CDRs with some routing and IP addresses. Well, else we have resource allocation and controller. This is some virtual channeling for your customers. You can have your API server with Gop Json, HDB Json support, built-in high availability with dynamic partitioning support, API capturing analysis service. This is something like an internal grant for CDR rates. Clustering through remote, replication for internal cache and database. Data versioning with automatic migration. This is when you need to move between releases in the same branch. You can do so with data migration. You can have and we also do, we also are agile in developing new features. So if you have some feature or some idea that you want to bring us, you are more than welcome to do so. This is an internal schema or diagram that we have for CDRs. It basically shows how CDRs has its components and interfaces and how they communicate with each other. On your left side you can see all our interfaces. You might notice that we don't have open SIPs over there because open SIPs has its own native module which is faster and better than anything we can do since it's native to open SIPs. And if we take one example, for example DNS agent which is on your left, you can see that it communicates with sessions which is our main subsystem and through there it can communicate with every component or all components at all or one component. It's all dependent on what you want to do with CDRs. For some use cases, again online offline charging, you can have a highly configurable rating bundle with voice, data, SMS, MS, monetary or anything else. In 1.0 you can really charge anything else. You can have there concurrent sessions with concurrent sessions handling and also a centralized CDR server. And this all together is what others call online offline charging system. Another use case which you can do is a dynamic routing system where you can use the dedicated subsystem for various routing strategies. There we can mention load balancing, the difference in our load balancers is that we cannot use setups but only real calls since we get that information out of CDRs. Also you can have LRN support via attributes, bundle support routing systems, quality based stats monitoring with thresholds and also load balancer which I mentioned. Now to get to the INOM LCR server that the topic is for. Firstly we need to know about DNS, probably most of you know but DNS is something like an internet address book where you query for something and you get information back specific to that what you question for. Depending on your answer the answer is categorized in some record types. There's a couple but we only work with these three, A-Type, SRV type and NEP type records. We work only with this because that's what most people need and nobody has really asked for anything more than this. To shortly describe them A-Type records convert domain addresses into IPv4 addresses, SRV records for network servicing. You can find priority, weight, port, targets from your SIP addresses and most importantly in NEPTR records which convert INOM addresses, INOMs into IP addresses. But what is INOM? INOM is basically a standard to translate telephone numbers into your eyes. Here's an example how you can do that. Firstly you need an E164 number. You can convert your number into an E164 number by firstly removing any leading zero before it and also adding your country code after it and with a plus at the end. Then to convert this INOM 164 number into an INOM number you have to remove the leading plus, reverse all the digits, add a dot between each digit and then add a suffix. This suffix, the one you have in this example is from RFC standards but in C-Drates we don't really care what you put in your suffix. In my example even I even replaced this ARPA later with the account string that I will use. For DNS agent I also mentioned earlier it's an interface, it's like a middleware where your DNS client communicates with DNS agent and then sends that information, that request to the DNS server and then from there maybe you can see from the schema. From there you can go into sessions and any component it can take any component and then give that information back to the DNS client. In terms of capability you can have as many listeners as you want. Also to mention in DNS agent we also implemented our DNS server and DNS service and listeners and for listeners you can have as many listeners as you want and they can all be opened at the same time. You can have UDP, TCP and TLS protocols and this means it is highly configurable and concurrent. Again for query types we support ASRV and NAPTR. For configuration this is in your configuration files. You need to open a new field, name it DNS agent, also this is JSON, everything is JSON in configuration. Name a new field DNS agent, enable it, by enabling it you allow it to receive and to send API calls. Then you name listeners where again you can see that it's a list so you can have as many listeners as you want. You name your address by giving it an IP and a port. In my case I use an empty IP since if it's sent by default in CJA we put what's in defaults and in this case in default is just localhost. For port I put 2053. If left empty again this will be filled by the default which is 53. And for that address I need to attach it a network. On this case I use the UDP protocol and again if left empty again it will be on UDP by default. After that I want to also be open to TCP listeners. That's why I create the same address but this time I changed the protocol. This doesn't mean that either one or the other will work. It means that both of them will work at the same time. There's something messed up over there. They should be on the same line for the last one. The address for TLS since I cannot have TLS and TCP on the same address I can put it in a different port for this example. And after you finish with listeners you go to connect your DNS agent with sessions and you do that by using session cons. You can have either localhost, internal or some configurable other connection which is done by you. I use in this case localhost since I want to track the network, the packets going through sessions and DNS agent. You can switch it with internal if you want to have a faster connection or if you do not need this debugging, this packet tracing. Just on that same DNS agent field you put request processors. To short explain request processors do the logic of what's going to happen after a query is done to your server. In this case you can have many request processors. In this case I'm only showing one. And this is what happens with it. First we define an ID for it which has to be different from other request processors. It doesn't matter what you put inside, it just has to be different. So in this case I'm describing what I do in this process which is NAPTR list cost route. After that you define filters. Because I want to find the list cost route to find a Cp address for my query. I first need to be sure that the query type is in NAPTR and that the leading country code starts with 32. This is just an example. You can have any filter that you want. The first filter asks the query type from the request if it's a full NAPTR string. And if that's true it goes to the second filter which finds if there's a prefix starting in that query name that starts with 32. And before it does that it converts that in number into E164. And that's done with filters. If those are true it goes to the next one which are the flags. In my case I want to create an event each time this query is being made. So I put there meta event which calls an API for sessions process event. Each time this query is true. And I also put routes authorized because I want to get the max usage when the query is done. And I also put routes because I want to do list cost routing with it. Next I put log there because I want to get some logs out of the query when the query is done. So I want to get the request and the reply from the query. And after that I put request fields. The request fields are what you want to populate when the query is being done. In this case I want to populate account, destination, set up time, type of record and usage. I want to populate this because I want to put them in my event later and the event needs to use them. How I populate them? I populate account with the query name by stripping away the first E164 and what's before it. So it leaves me behind with only the 1001 account which I will show later. This way I populate account with 1001. In destination I put the query name fully converted into E164. In set up time I put now for the current time of the query, type of record voice and usage of one minute. For the reply fields I want to put what I want to reply to the DNS with. So I want to reply with order of 100, reference 10, flags U and service E2U plus CIP. In the most important part the regular expression which I find through route parameters. I didn't show here but I created a routing profile before and I put there two information in two routes and that information are the CIP addresses which are different. One of them is of highly cost and the other one is least cost, is lesser cost. And since I have that meta routes flag over there, those routes will be sorted using least cost. And since I have reply I want to find that reply the routing parameters for that first index of the route. And the first index is always depending on the sorting route and make it least cost, the first index is going to be the least cost route. And under the reply you can see the reply. I find in the structure routing profile I go to run ID meta row, meta is in this case asterisk of iteration 0 of that ID. I go to routes of iteration 0 again and then I find the value of routing parameters which is the CIP address that it finds. And then I populate it to that regular expression. After that I just also put the replacement dot at the end. For the client, for the client I'm using dig, in this case I'm couring localhost on port 2053, the type of regular this NAPTR. And you can see the N number that I put there. You can see the 1001 account at the end. For the reply I captured this using ngrep. You can see the API that gets called sessions process event. In the flags they are the exact same that I put in my request processes. The tenant gets automatically taken by default configs which is cj.org, the ID is some random number. Time is the current time of the query. And in the event you can see they are exactly what I asked for in my request processes again, if you can see. And that's just the request for the reply site. I can see the reply from that API where I find the max usage of 60 seconds. If you remember I put one minute of the request. You can see that it's also 60 billion nanoseconds. This cj also works in nanoseconds. Also I have the reply on the routes profile site. You can see that it found the routes account for 1001. You can see the sorting that it used. It's LC for list cost. And also it shows all the routes that it found sorted by it. And you can see routes with ID route 2. You can see the Cp address ending with 12 and the cost that it would take of 60 units. And the second ID which is more costly with the Cp address of 11. And here we get the reply back from DNS agent after it's done. You can see that it found a regular expression with 12 at the end which was 60 cost units if you saw from earlier. And also as another use case you can have a fail fallback. So for example you can have multiple answers over here. In my case I would just have to make another request process. And in this case I put just one instead of zero over there and it gets the second list cost that it finds from routes. By that you can just get the second answer also. And that's about it. Any questions? I'm guessing not. If you have any questions you can also ask them at our Google groups. Oh sorry. Yeah. Going back to the request and the response. I saw you had a, in the request you were getting an account ID. How are you figuring out the account of the person asking according to DNS? Well it depends on what you want to do. In my case I just put that in my request on the DNS client over there. You can see at the end it's in that 1001. So I give it myself that account ID. Okay so you're giving each customer a phone top level domain name. Whatever you want. Any other questions? Okay. Thank you.
A Game Boy and his cellphone
So, ready to start? Yeah. Okay, so we have now Esteba with Game Boy and his cell phone. Hello, and thanks for being here for this talk about Game Boy peripheral that I think is very interesting and versatile. I'm Esteba and I've been working to emulate and restore this peripheral on and off for the last six years or so. But first, I should tell you what it is. The mobile adapter GB is a peripheral that allows you to connect your Game Boy up to your cell phone, allowing games to make and receive calls in part to also call an internet service provider and connect to the internet, allowing for all sorts of online connectivity, like sharing scores and getting updates for various things. It was one of the very first attempts by Nintendo to have any sort of online connectivity for consoles, but what makes this one very interesting in my opinion is that it supported a few rather high profile games. There were actually a few variations of this adapter made for several different phones. You have a blue, a yellow and a red one. The green one for PHS was also planned but never released. But what you will notice is that none of these actually work for any non-Japanese phones. So this service never left the island and unfortunately it was sunset very early, almost two years into its life in December 2002. But to give you a better idea of what this peripheral could do, we will talk a little bit about the games that supported it. So first of all, you got the mobile trainer with the adapter. It was used to configure the adapter and you had to use this before you could connect to the internet. It also came with a very useful usage manual but it also had some very interesting utilities, which were a mail client which supported both SMTP and POP and could communicate with the outside world so you could actually receive real emails. And a very minimal web browser which was hard-coded to one website to read news about Nintendo games and games for this peripheral. Now, the very first game that was released for this thing was Pokemon, a very popular franchise that I'm sure you're familiar with. But it was actually one of the very first time you were able to battle and trade online with your friends or at very large distances at least. Besides that, it also featured a battle tower which allowed you to fight people who have entered that tower previously. It got localized with NPCs in the west but the Japanese version worked with this adapter. You also had a trade corner which is a bit of a prototype of the global trade station which appeared back in Generation 4. And you had a news machine which I think is the most interesting part because you could download scripts which had news items but also many games, questionnaires and you had rankings to show off your friends how big your Magikarp is. Another very interesting game in my opinion is Net the Get which was one of the only titles which used the MBC6 on the Game Boy. It's a minigame collection that came with 15 built-in minigames which could download more and more would be released over time though they never reached the titular 100 minigames unfortunately. A few other games that were very interesting, Mobile Golf which is a sequel to Mario Golf which never got localized but it came bundled with the adapter later in its life to help sell the adapter. Starcom which is a sort of pet simulator, Game Boy Wars which was part of the Wars series known for Advanced Wars and Famicom Wars and Mario Kart which allowed you to upload and download ghost data. So let's tell you a little bit about how this project got started and where we are now. Somewhere in 2016, Haki posted a thread on GlitchCelapse which explained a little bit about how the mobile adapter protocol worked. From there we spun up the Python script which communicated with the BGB emulator allowing you to have a proof concept that this thing actually worked. Somewhere in 2018, a guy named Shinumi who is known for emulating various peripherals including suing machines and phishing sonars which were made for the Game Boy. Also emulated the mobile adapter and specifically Net-to-Get and created very comprehensive documentation that we are updating and keeping track of to this day. And at some point people wanted to actually bring a real Game Boy to connect to the internet and that's kind of where I stepped in and we started doing stuff. So fast forward to today, we have a group called Rion. We are a group of preservationists, developers and enthusiasts who want to preserve this system and make it usable to the common user as it used to be. For that we are making emulators, servers and translations for a few of the games so that they can be enjoyed by a wider audience. So to give you an idea of how this all fits together, I will explain a bit about how the system connects together. So this is a connection diagram. On the left side you have the user's Game Boy which communicates through a custom link protocol with the adapter which further communicates with a proprietary protocol with a mobile phone. The mobile phone is connected to the phone network but depending on who you call you can either call a friend and communicate with their phone directly. And this was used for example for the Pokemon trading and battling. Or you could call the internet service provider and use the point-to-point protocol to tunnel your connection through TCP and UDP to the official Nintendo servers. Now most of this stuff is kind of irrelevant when we are emulating this because when we are emulating it we can kind of make big black boxes depending on what you are doing. This is how it would look if you have a simple microcontroller that connects to your Game Boy and then further connects through USB to your computer. Your computer will communicate to either the game server or if you want to call a friend then we have set up a relay to punch through router firewalls and that sort of thing which allows you to connect to any other player on the world. And this can either be hardware or these blocks can either be full emulator which also emulates like the Game Boy itself and the adapter so it's a little bit more variable. So we have full documentation and emulation of the peripheral itself or at least the part that communicates with the Game Boy. And for that we have made a library called LibMobile. This library can be integrated into all sorts of projects from software emulators to hardware emulators and back. We've integrated thus far into the BGB emulator which is a Game Boy Color emulator. We've integrated into the MGB emulator. We've made a little fun interface to configure it as well. And some people have been playing around with making it work on the Raspberry Pi Pico and communicating over Wi-Fi for example or the Arduino Uno which is mostly what I've been using. There's also the GBE plus emulator which was made by the Shonumi which I mentioned before. This is more of a local only emulator but it's for some games that we don't yet. And of course full documentation of this is available in Dandox. So these are a few of the examples of things of setups that people have put together. On the far left you've got the simplest one which is just breaking out a few wires and connecting them to the Arduino and then just plugging that into your computer and doing it like that. Some people have made PCBs. The central one is able to communicate over Wi-Fi and Xenaro really active user lately has made a 3D print version of it as well. Now of course you don't need to connect it directly to a computer. You can also just use a modern phone which are basically computers these days. We've also of course started emulating the server side of things. We have the relay server which I've mentioned before which gives you a phone number and allows you to call someone else. We have a mail server which is implemented in Node.js and stores in SQL so we can manipulate the emails more easily. And we have a few complete game servers for Pokemon Crystal which supports actually everything at this point. And a very driven person called Winter who has fully emulated Mario Kart and Monopoly though Monopoly doesn't have many features unfortunately. Also GBE plus has emulated a few games in particular Net-to-Get Game Boy Wars, All Japan GT Championship and Hello Kitty's Happy House which allows you to send emails with items to your friends which is very cute I think. And of course we've also made a few translations in particular Pokemon Crystal of course was already localized but we've restored all the functionality for it. And we've also ported all of those changes to the four other languages that the game was released for. Mark Max came to us asking if we were interested in his mobile golf translation and most of it has been translated but not the mobile features because we don't have any support for it yet. And the mobile trainer which of course is a cornerstone of this whole thing. If you want to get into it or make an emulator for yourself or develop a game that supports this thing. We have of course the mobile which allows you to emulate the adapter itself. We have the re-enrollable story which you can extend with other games or if you want to emulate those though I would suggest if you make homebrew that you make your own server behind this. And unfortunately we still don't have a client library the library that runs in the Game Boy itself though we have reverse engineered the library from the Nintendo SDK. If you don't care about licensing problems. So in conclusion most of the things that you'd want to see are already there. Of course we don't have all the games yet. The problem that we're mostly struggling with right now is authentication and getting this useful for actual people who aren't very techie. So if you want to help with any of that documentation making tools websites whatever you can reach us on unfortunately discord only. We have if you want to make a matrix server and bridge that I would be very happy but unfortunately right now it would be the only person who would use that. Our github is over there and show numies block with a lot of more peripherals and funky things that he's emulated with the Game Boy. Can be reached through his github pages. That was it. Thank you. Thank you. We have time for one or two quick questions. I have a very quick question. Thanks for the talk. Do you know how the original games that you could download of the Internet back in the 2000s how those were captured. How those have been like captured like that's like 22 years ago. So one of the things that we actually sometimes need help with is if you have any of the games that supported the mobile adapter. Don't run them dump the save directly. If the battery still lives then we might be able to restore some of the games that were supported back then. Thankfully though we have the 15 built in games which serve as an example to make more so that helps a lot already. Another quick question. Yes. No. No. Okay. Well then. Thank you. You can get prepared. It was really interesting. Thank you. Thank you.
PiStorm - The evolution of an open source Amiga accelerator
Okay, we are right on time. Many thanks. So Andrew with the Pistole. Hello everyone. I was stupid enough to do this from an Amiga 1200, which is great because I don't have a screen in front of me, so I'm going to try and see what I'm doing whilst I'm doing it. But it'll make sense later. So I'm here to talk about Pistole. My name is Andrew Hutchings, I'm also a learner at Linux Jedi. During the day, I worked for a non-profit called the MariaDB Foundation. And by night, I restore Commodore Amiga's, Acorn computers, I design upgrades for them, and I'm part of the Pistole community and a whole bunch of other things. I've also written for PixelAdex to go by that, because the next issue's got a big article by me in it. And I'm also going to plug... The aperture there was made by Stu Cambridge, who from Sensible Soccer fame, he did cannon fodder and all of that lot. And you can get him to do Doodles of You just like that from his site. What's it called now? Design Droid. He doesn't know I'm plugging it, but I love his work. So anyway, about Pistole, it was a project created by a guy called Claude Schwartz. And if you've ever tried to use or upgrade Commodore Amiga today, you need a processor like a 68030 or a 68060. If you want a 68060 with a board and RAM and everything like that, you need to sell a kidney, basically. They are really rare, really expensive nowadays. So the idea was to create a very fast budget accelerator. And you can get a lot of compute resources from something called a Raspberry Pi, which you probably all know about. So what this essentially does is it emulates the 68000 processor on a Raspberry Pi running Linux originally, but the rest of the Amiga motherboard was used. And then it adds things such as RTG. Now, RTG stands for Retargetable Graphics. And essentially, that means it's like a second graphics card for your Amiga. So this is what I'm actually projecting from right now, is the RTG from my Amiga. It has the native Amiga. If I tried to run an old Amiga game on it, you wouldn't see it on the screen right now, because I haven't got the output for it. I'm going to talk about that a little bit later. It adds virtual scuzzies. So the SD card on there is basically a driver for the Pi, as the Pi Storm, to talk directly to the Raspberry Pi's SD card. So it's rather than being emulated, it's almost like a direct driver in a way. And it also adds RAM. So I've got a Raspberry Pi 4 in here. So nearly 2 gig of RAM added to what is normally a 2Mega system. So a little bit of a boost. And everything is open source. The boards are open hardware and stuff. What we used to do is a group buy where you could come along and say, I want to buy one of these, and we'd all go to JLCPCB, buy loads of boards together, and you just have to solder on the headers, which were great until the chip shortage, and then that kind of died off completely. But back then, I said you can pay more than 20 bucks for a Pi Storm. So about 18 pounds, it's probably about 20 odd US dollars, whatever. So it was really, really cheap. You just need a Raspberry Pi. So this is what the first one looked like. Now, you can see there's quite a few chips to it on top of what is normally a Pi GPIO there. So essentially the problem we have is the Pi GPIO is 40 pins, but you only get about 26 GPIO lines from that. And the Amiga, 16-bit Amiga has 16-bit data bus, and then a 24-bit address bus, and then control lines on top of that. It's a lot more than you have IO lines. So what we've got here is a CPLD chip, a programmable logic chip essentially. And we have in there basically this 6,000-8,000 state machine. And that does all sorts of multiplexing communications to the Pi. And then we have some buffers basically because the voltage-total translation is needed between the CPLD and the Raspberry Pi, and then the external IO logic. So it was nice and simple boards. We could get JLCPCB to build all these originally until the CPLD kind of ran out of stock, and then that became difficult. And the logic that we wrote for the CPLD is enough to run it for an Amiga, but it doesn't include some of the state control lines that other systems use because we were targeting an Amiga 500 at the time. So this supports a 500. It supports most of an Amiga 2000, the 1000, and the CDTB. And then... Oh, doing this on my clicker, clicker and of course, I've got my clicker connected. So it used to Raspberry Pi 3A originally. You could have used the Raspberry Pi 3B, but you'd have to raise the header a bit because otherwise your Ethernet board smash into the board. And that's not good. You can take off the ports on the 3B if you don't want them, or you can extend the header. Also, Pi 0 2W will work. If you don't know, Pi 0 2W is basically a Pi 3, but in a much more compressed format. We ran Mishashi 6800... I hope I'm pronouncing it right. 6800 CPU emulator, which... It was good. It's a pretty good 6800 emulator, and then there's some kind of glue code to make it work, but it was basically an off-the-shelf emulator. And most of that software was done by a guy called Bjorn. He's not part of the project anymore, but he's got a lot of great early work on it. Again, I'm clicking on my clicker. So, performance-wise, you can see here... This is what's called SysInfo. It's kind of a stock benchmarking software for an Amiga. And an Amiga 600, which is same as an Amiga 500, roughly. The original Pi Storm ran about 23 times faster, which is pretty good acceleration. You're getting even faster than what was called 6800, 25 MHz. So you're getting about 50 MHz, 030 processor, kind of speed out of it, which is pretty good performance for something that costs a lot less than even the CPU for an 030. How I got into Pi Storm? I was designing some new hardware for a Commodore Amiga, and the other advantage of having Mishashi on Pi Storm is the fact that you can, on the fly, change the entire configuration of the Amiga. I want a different OS ROM to boot into, different RAM configuration, different hardware configurations. All that can be changed on the fly. I started providing patches, helped build a community. This was probably in September. We had 7,000 members on Discord and 3,000 on Facebook. So it's grown to a pretty big community. Things I've done, I'm going to skip over this, but I did a lot of the early work regarding bug fixing and things like that for the original Mishashi Pi Storm. Then we released a version for the Amiga 600 and Amiga 2000. They are essentially basically the same thing, but Amiga 2000 has a coprocessor slot, so it's much easier to just debug it in the slot. At Amiga 600, you have to do this hacky thing where it sits on top of the PLCC CPU, and then there's a little kind of thing in there to tell that CPU to go to sleep, and then that basically is identical after that. So EMU68 came along. EMU68 is a bare metal emulator for Raspberry Pi, for the 6800, so it's much, much faster. You don't have to boot into Linux anymore. This is what this boot is from. It became an option for Pi Storm in 2021, and now it's pretty much de facto standard, and it uses JIT-based emulation instead of table-based. So performance-wise, it got a bit faster. 1,490 times faster, and this is just on the Amiga 500. Then the Pi Storm 32 came along. This project was scrapped. So essentially, it's the same kind of thing, but for the 32-bit Amigas like this one. But it became very hard to build, and it required a Pi CM4, which is a Pi without all the ports and everything. You just got these big connectors on the bottom, and it became difficult and expensive to build, so we gave up on that, and instead built the Pi Storm 32 Lite, which is Lite because it doesn't have all the ports on it. But basically, it's the same kind of thing. And we have a nice big FPGA on there instead of CPLD. FPGA, just much more logic, but you have to kind of flash it every time you turn it on. And that was basically the start of what became the 8-200. This is kind of the peak of Pi Storm right now. We released that about a year ago, and it's still going strong. Performance-wise, we're now talking 3,052 times faster than an Amiga 500, which is not too bad. Even the Amiga 1200, which this is, it's 1,326 times faster. And you can get faster still if you overclock it. I'm not going to overclock mine. I've got a little fan running underneath it as it is. And inside this Amiga, you can see this is what mine looks like inside. So you've got the Pi Storm in here. And then I've got a little cable running out of the HDMI port to the back, and that's what's running as projector right now. And then I 3D printed a kind of assembly with a fan in the net just to keep everything nice and cool. Demo time. So, John Clarmac said the Amiga is not powerful enough to run do. At the time, to be fair, he was right. The de facto Amiga at the time was kind of Amiga 500, my Amiga 600. If you wanted one that could run do, it would cost you thousands and thousands, much more than the PC would at the time. But today... It's later we were running do. Yes. But I can do a bit better. AmiQuake. And I haven't got sound hooked up, unfortunately, but what I can do... Time demo, demo one. It's slow, I know. So we just got to wait for all this demo to finish just to get a nice kind of benchmark out of it. And there we go. So we get 93 frames a second out of Quake through the RTG. If I run this through the AGA graphic, the built-in graphics instead, we still get about 45 frames a second. So it's a bit faster than native, which would be a few frames a second at best. Oh, there it goes, that window. So... If I use... CandlebyeStorm modified chip RAM. So chip RAM is chip set RAM. It's the RAM that the entire chip set... Can the Amiga talks to each other with? So you've got like the audio chip, the graphics chip, etc. That is capped at 2 megabytes by design, by Commodore. They were trying to move it to 8 meg for the Amiga 4000, but it never really hit there. No, it can't because we don't modify the chip set. We don't override the chip set, so we can't increase the RAM that the chip set uses. So whilst we have 2 gig of fast RAM, we don't have any chip RAM. Can you emulate a power PC? Probably you can, but it's going to be a lot of work, and we don't want to do it. So if anyone wants to put a PC emulator in there, it will probably work. Can you use PySom in other 6,000-8,000-based machines? Yes. So someone's done a port, I forget the name, they've done a port for the Atari, which basically had to pretty much rewrite the firmware to make it work because Atari actually uses all the 6,000-8,000, instead of the hacky thing Amiga did. I love Amiga, but Atari did that bit a bit better. And similar problems with the Apple. So there are projects where they're trying to get this running. It's not going... It's not all the way there yet, but they're working on it. CD32, sorry, 3,000, 4,000 versions. In theory, the one in this machine should work on CD32, but it doesn't, and we don't know why yet. We haven't had time to figure it out. It shouldn't take much modification to make it work. 4,000, 4,000 versions are going to require a lot more bus arbitration work, so it's just time to do that. And then the really cool thing we're working on right now is Amiga Native Video Injection Device, which we haven't got a name for yet, but essentially what it does is it captures the... It sits in various places in Amiga, depending on the model, captures the digital video before it gets converted to analog, pipes it through the camera port in the Pi, and then you can have both native video and the RTG video through the HDMI on the Pi. So, if you want to sponsor the Pi Storm development, Claude has a donate button on his Pi Storm 32 Like GitHub page. I'm just checking it out. Mikal, who develops the EMU68 project, has a Patreon to sponsor the development of it. And if you have any questions at all about the project, feel free to come to me. I'm the Linux Jedi everywhere, pretty much, and I'll be happy to answer them. And that is it. So we have time for questions. Any questions? Thanks a lot for your talk. So according to the SIS info output, it's not emulating a plain 68000, but a 030 or 60 or 040? So, Mishashi, you can choose which one you want to emulate. The 020 and 030 were the most stable doing that. For EMU68, it currently pretends it's an 040, but will support the instructions set for 060. Okay, so that's only about the instructions, and it does not emulate the MMU, I guess. Yeah, it's just saying, hey, I'm an 040, but it doesn't really matter. It will run 060 code, fine. Hi. I'm actually Debian's M68K maintainer, and I'm wondering if there's plans to add MMU support, so you can go to the Linux kernel. Other plans for the MMU? That is a good question. Mishashi, no, we did have it to begin with, and it was broken. So we didn't. EMU68, I believe, somewhat supports MMU, but needs some work to support it properly. It's at the moment a direct one. It's basically given a block of RAM in the pie, and just said, yeah, just use that. So we could probably emulate MMU. That's too much trouble there. Thank you for your talk. Just a quick question about the MMU68 variant. Yep. Does you need to maintain a second OS on the SD card without, or is it effectively a persistent thing once it's on? No, it's a system that boots on by itself completely. There's a whole set of tools that are out through Pi Foundation to create your own bare metal OS, essentially, so it's an OS in its own right. The downside to that is every part of hardware, we have to write new drivers from scratch to be able to talk to the hardware, which is why if you want to use Ethernet or Wi-Fi or anything like that, it becomes a much harder task for us to do that on MMU68, and that isn't there yet. So there's no USB host support. You can't use USB keyboard. I'm sorry, sir, again? There's no USB host support for the Pi. Not on MMU68, no. Right. There is a mishashi that will actually support keyboard and mouse through Pi's USB, yeah. Still time for one or two quick questions? Yeah, one in the back. Hi, quick one, I think. Did you have to do anything special to cope with the bring up time for the Pi, because it's a lot slower than the CPU? That's a really good question, the bring up time for the Pi. So the CPLD versions hold down the reset, until the Pi's booted. So basically the machine's basically, I'm resetting constantly, kind of thing. The version in this, the Pi's on 32, it will boot the native CPU first, because the FPGA hasn't been flashed. Once the FPGA is flashed, then the reset gets held down. And it's a very short time. You're talking like two or three seconds. Still time for one question? That will be the last one. So I guess the problem with the CPLD version is that AMD has announced that they're going to stop making those. So AMD, old as iLinks, do you use it? Yes, but iLinks is AMD, right? Yeah, no, we're not using... They announced like the last Pi's or something from now. We're not using iLinks ones, so I... Ah! No, so we're using... Yeah, I think we're using... Yeah, I think we're using... Yeah, I think we're using... Yeah, I think we're using... Yeah, I think we're using... No, so we're using... The CPLD is Oterra Max... Max2, yeah. Also as Intel. And then the FPGA is Trian... Maybe... Ethnics, Ethnics Trian. Ah, okay, maybe... I thought it was iLinks in the first picture, but maybe that's wrong. Yeah. Or maybe that's a prototype. The other projects I maintain, yes, they are all screwed in regards to iLinks, but... Good, that's it. Many, many thanks, Andrew. No problem, thank you very much.
MAMBO - Dynamic Binary Modification Tool for RISC-V
Okay, hello everyone. We are here to present Mambo, a dynamic binary modification tool, and what's the better way to start the presentation than with a demo. So we are going to see a fairly complex application running on risk five within our system. So let's see it. So we are going to use it to learn something about the running binary. So here it is. Okay, so this is not our tool. So this is just an image viewer of Linux, and we generated this picture with one of these fancy AI tools so we can kind of promote our talk on LinkedIn. But what's really happening is that this image viewer is running under our tool that runs on risk five, and then we use it to find some information about the binary. So here we have a very simple tool that counts the number of threads that the application was used so we can see we have eight threads. So the application run under our tool on risk five, and then we can see that we have eight threads. Okay, but thanks for your first. I'm Igor. This is Alistair, and we are here from the University of Manchester. And as I said, we are going to talk about Mambo, which is a binary modification tool for risk architectures. Okay, but thanks. But okay, has anyone knows here what the dynamic binary modification is, or heard the term in the first place? Raise your hands if you did. Okay, wow. Okay, a few people. That's good. But you may haven't heard the term, but I'm pretty sure if you did any development, you used those frameworks. So the examples of the very known open source tools that do dynamic binary modification are Valgrind and KMU. So I'm pretty sure you use Valgrind and one of these tools, which is called Memcheck. And most of you probably are in the risk five room use KMU. So both Valgrind and KMU are dynamic binary modification framework, and they have a various tool built on top of that. So this is what Mambo is. Okay, but let's break down this term a bit. So what do I mean by dynamic binary modification? So dynamic is working at the runtime. So while the binary is running, the tool is working. Binary, we are working on the natively compiled code. So we don't need a source code, we just take a binary that was already compiled, and we can analyze it. And modification means that we can alter the application in a specific way. So we can add extra functionality, we can remove functionality, we can swap functionality. So there are two terms that are related to that. There is also dynamic binary instrumentation and translation. So instrumentation is basically a subset of modification. We just insert new functionality into the binary. So for example, if I want to do some sort of profiling, I can input some sort of counters into the running binary. And then translation is kind of an overlapping term. I can swap one, I say to another, so we could do it by modifying the binary, or there are more specialized tools that do the translation. So you are probably familiar with the Apple Rosetta, which translates now Intel to ARM when you got your new MacBook, but there is also the KMU also can act as a translator and usually use like that because they can translate one architecture to another. But now, so very few uses of the tools. So you can do program analysis, you can do error detection. So I'm pretty sure most of you are familiar with that use case, and there is a dynamic translation. OK, but now the question is why would you like to use Mambo if there are other tools? So the Mambo has been specifically optimized for risk 5, ARM, risk 5 64, ARM 32 and ARM 64. So in the stock, we are focusing on the ARM, on the risk 5, but we also have the version of the tool that can run on ARM. And the tool features low overhead, and to our knowledge, this is the only at the moment available DBM tools that has been optimized for risk 5. And the tool itself is fairly low in complexity, so if you would like to dive into the database, is around 20,000 line of code. So if you want to learn how it works, or if you want to modify the internals, the entry bar is not that high. And then it has a simple plugin API that allows you to write the architecture agnostic plugins. So you can write the plugin for risk 5 and later on you can deploy it on ARM if you would like. But it's worth to say it is not a toy. So we showed it before in the video that we can run fairly complex applications, so it's a full GUI tool from the ship with Inux. It could run stuff like GIMP or library office as well. So the tool itself is not a toy. OK, and if you are interested what the numbers would be roughly, so we evaluated it on the spec benchmark, so don't worry about too much about numbers. If you want, we can point you to the paper or we can talk about it later. But the idea is for like, FP benchmark, which is more like data processing. We get around 6% overhead if we just run the tool. We just run a framework without an extra tool built on top of it. And it's around 30% when we do more general purpose computing. So the baseline then, if you have no plugins enabled when you just run the tool under, if you run the binary on the RL tool, you get around 30% overhead. OK, so that was the brief introduction of what the dynamic binary modification is. And I'm going to briefly talk about how Mambo works internally. So I'm going to mention a few details, so it's useful if you would like to, I don't know, contribute to the internal of the tools that may help you. But the focus of the talk will be more the developer side, so I'm just going to talk about it as well. But I would like to just highlight a few bits and pieces so you will understand how Mambo works. OK, so this is the simplified diagram, and I'm going to talk you through the more important bits of that. So the instrumentation plugin API, so this is the part that Alistair is going to talk in much detail about, and I'm going to cover everything else. OK, so first of all, the first component is the elf loader. So if you run any binary on Linux, it has to be first loaded into memory, and then we can run it. So in case if we use our framework, so if we use Mambo, then the Mambo itself is loaded by Linux using its default loader, and then Mambo itself has to load the application, which we call a hosted application. So the Mambo has a custom build loader inside of it, which takes the application and loads it alongside the Mambo, so it can interact with it, it can modify it, and it can run it. So that's the first element. The second element is instruction and decoder. So while we execute the application, we have to modify some of the instruction. We have to know what instruction we are copying and scanning and modifying, so this is what the instruction and decoder and decoder does. So you may be familiar with the custom project, which is like a fully fledged assembler. This is a very simple module that basically takes a text specification of the instruction and what the fields it has and uses some rubbish scripts to generate the C functions to encode any code fields, and this is what Mambo uses because it's fairly simple and low overhead, and that's something that we want inside the tool that runs dynamically. Okay, and now the two most important parts of Mambo, it will be a code scanner and the dispatcher and the code cache. So let me maybe first talk about what the code cache is. So we have our Mambo, and the Mambo uses the loader to load the binary into a memory. And now we want to run this binary, but we also want to modify it. So if we just load the binary and run it, then it will run as it would be before. So that's why where the code cache comes in. So this is not the instruction cache that we have hardware. This is just allocated space in memory that we call the code cache. And now the Mambo scanner will copy the instruction from the binary that we loaded into memory into the code cache. And in the process of copying those instructions, we can introduce any functionality, we can remove some instructions, we can replace some instructions. So the scanner is responsible for copying instructions from the binary that we loaded into the code cache. And then the code cache is what will actually execute on the processor. And then we have a dispatcher, which is responsible for actually running the code. So the scanner will copy a basic block, and then it will say, I finished copying a basic block. Now I go to the dispatcher and dispatcher will start the basic block, and it will actually natively execute it on a RISC-5 processor. And then when we finish the basic block, the control will return back to Mambo to scan the other basic block, and then again we'll go back to the dispatcher and dispatcher will execute the next basic block, and it will have this back and forth. And if the code is ready to the code cache, we don't have to scan it so we can directly execute another basic block without scanning. So this is very simplified. If we did it that way, it would be very, very slow. So there is a number of optimizations there. So for Mambo to stay in the code cache as long as possible. So it does scan things ahead of time and tries to guess what would be the next thing it jumps to and then if it can do it, then it can stay within the code cache. Otherwise it has to go back to the scanner and back to the dispatcher if it doesn't know what the next basic block is. Okay, and this is what I was talking about. So when we execute the application, we have a single process with two binaries in it and two contexts. So there is a Mambo context that scans instruction, and then the dispatcher changes from the Mambo context into the application context. So it will save the state of the Mambo, jump to the code cache, execute the code cache as long as it can, and then if it cannot find the next target in the code cache, it will go back to Mambo. So it will save the application state, restore the state of Mambo, and then the scanner will kick in and then it will go back and forth. So this is like a principle of it, of how it works. Okay, so the dispatcher and the scanner are like the two main elements in Mambo that allow us to do the modification and execute the code. And the last thing is the kernel interaction. So on top of just executing the application, the framework itself has to interact with the Linux kernel, so we have to handle and pass signals and handle and pass system calls. So this is important because for signals, if there is a signal coming from the operating system, it will first hit our framework, so it will first hit Mambo. But if you don't want Mambo to handle the signals, in many cases you want to pass it to the application because the application may have a handler installed to handle this signal. And in the same way, if there is a system call, so if the hosted binary is doing a system call, for example, let's say a thread creation, Mambo needs to know that it created a thread because it has to track every thread that gets created. So the Mambo has to learn first what was the system call and only then it can pass it to the Linux kernel. So that's also, I talked briefly about the architecture of Mambo, so we had the L flow there, we had the instruction encoder and encoder, two main elements, one free management scanner, dispatcher and the code cache, and then we had a bit about the handling signals and system calls. So that's, if you are going to just use Mambo to write your plugins and the tools probably you don't have to know all of that, it may help to know how Mambo works. And if you want to contribute to the internals of it, that hopefully will give you some rough idea how the system works. But now the bit probably people are more interested in is how we can write our own plugins, our own tools within our framework. And for that I will pass the microphone to Alistair. Hi, so yes, I will talk to you about the API, this is how you take Mambo and you build your own tool on top of it. So this is where it actually gets really useful. So we've mentioned use cases but it's worth repeating. We're talking about things like code analysis so you can build a control flow graph, you can generate new functionality, you can instrument code, you can analyze it, you can re-implement library functions, you can patch library functions, you can do all sorts because you can modify this running binary. So Mambo's API exposes events, so it's event driven. So you as the user of this API define functions which you register as callbacks on these events. And when one of these events is encountered Mambo will trigger the callback and execute the function that you registered to it. So there are two categories of events, there's hosted application, runtime events. So these are events that happen to the hosted application as it's being executed in the code cache. So here we're talking things like system calls, thread creation and we have Mambo scan time events so these happen as Mambo is scanning instructions from the loaded elf into the code cache. So this is something like pre-instruction, post-instruction, you can do stuff with these callbacks. So as I was mentioning pre-instruction, post-instruction, this kind of gives you an idea, you can insert something before and after an instruction, before and after a basic block, before and after a thread. So you can see it can be very, very fine grained or it can be at a high level of abstraction and of course before and after an application runs. So taking all of this, you see a slightly chopped off diagram there but it kind of gives you an idea of the order in which these callbacks will be executed. So at the very highest level, at the very start you have the initialization function which is where you set up a plugin and then you'll have pre-thread so that's quite high level, pre-basic block, you also have pre-function and so it kind of gets narrower and narrower and then it kind of expands out after these things have executed. So this is something that's important to bear in mind. So how do you actually use Mambo's API? I'm going to talk to you about the following things. So the functions that you'll need to register your callbacks, the functions that perform code analysis, the functions that perform instrumentation, so how you actually emit code into the code cache and then there are various helper functions which you can use. So the first thing you need to do is initialize your plugin and this is done in the plugin constructor function and there are two main things that you do here. You create a Mambo context which is a global data structure which holds the current state of Mambo and also the application that's being executed by Mambo and pretty much all of Mambo's helper functions will use this context to get for instance the current instruction that you're looking at. And this is also where you'll register callbacks. So for instance here we have Mambo register pre-instruction callback. So before an instruction is actually scanned into the code cache something that you register here will execute. And to register callbacks it follows this signature so you have Mambo register then you have an event time so that's pre or post something happening then you have the event so this can be Mambo pre-instruction callback. So it's quite easy to remember that way. So you've registered your callback so let's say we're building a plugin that counts the number of branches that are executed. So you've registered a pre-instruction callback. So now Mambo's scanning things and your pre-instruction callback has executed. So one of the first things you're going to want to do is use a code analysis function. You're going to want to know which instruction am I looking at. So you have things like Mambo get branch type, Mambo get condition which would for instance give you the condition of the branch that you're looking at if it's a conditional branch. So these give you information that you can use and choose to act on. So the function signature of these analysis functions follows Mambo action so that would be get set is and then the information. So Mambo get function type, Mambo get branch type even relating back to our example would get you the type of the branch that you're looking at. So bringing all of this together into a simplified plugin we have the constructor where we initialize context and we register a pre-instruction callback and when that's executed we get the branch type and then based on what type of branch it is we do something. It's also worth pointing out that the branch types that we're looking at here are generic so that's how it is portable between architectures. So you've found out you're looking at a branch. Now you're going to actually want to emit instrumentation. So this is instructions that you can put into the code cache to do something. So for instance we have emit64 counter increments so this is how you can tell Mambo to emit the instructions that you need to increment a counter. You can emit pushes, you can emit pops, you can set registers so you can do all sorts of things and there are two main types. You have emit instructions so that would be for example emit increment so that's more portable because we implement the backend tell Mambo which instructions to emit into the code cache for that. And then we have the more architecture dependent ones which are emit risk five instructions so this is when you know exactly what you are trying to achieve with the plugin. Let's say you need to emit an arithmetic instruction. You can do that until Mambo emit this arithmetic instruction. The only drawback to this is that it's riskier doing that. You have to make sure that you save and restore registers and that kind of thing which we do for you in the safer generic ones. And then finally you have additional helper functions so for instance Mambo will expose a hash table which is really useful for when you're instrumenting code and you have lots of data to associate with different addresses. So we have hash tables, we have Mambo allocator so these will help you to write your plugin. And then finally it can be very difficult to get your head around this. It took me a while to fully understand it and that is the difference between scan time and run time. So when we talk about scan time we talk about something that happens once when Mambo is scanning something and run time is when that scanned code is executing in the code cache and the reason this difference matters is if you are for instance counting the number of branches that are executed at scan time you need to emit instructions into the code cache to increment a counter so that when that code is executing you get the actual number of instructions, times that instruction is executed. Okay so it's time for an example. The code I'm about to show you can find on the Mambo repository in the plugins directory and it's time for a live demo. So I will be running Vim under Mambo on risk 5 to show you the source code of the branch counter plugin which is something that you can run and is in the Mambo repository and whilst running Vim I will also have enabled the branch counter plugin so you can see it in action. Sounds very convoluted I know. Okay so here we run Mambo and I don't know how well you can actually see that but... Command shift plus. Oh command shift. Hooray. Do we need more or? Bigger. Oh bigger. Even bigger. I'm trying to call it that wrong. Okay yeah. Okay so we start with the constructor function which is where we set up Mambo's context and we're registering four callbacks so we have a pre-instruction callback, we have a pre-thread callback, a post-thread callback and an exit callback and the order that these will actually be executed in will go pre-thread, pre-instruction, post-thread and then exit. So I'll start with the pre-thread. So in the... Let's hear some more. Oh yeah yeah yeah. In the pre-thread handler we're initializing the counters for that thread so we have a direct branch counter, indirect branch counter and return branch counter. The reason why we have this per thread is because each thread has its own code cache and therefore its own numbers of branches that we'll be executing which is why for each thread that we create we initialize its own set of counters. And then we have a pre-instruction callback. So for each instruction that's executed we're checking if this is a branch, we're getting the branch type and then for each of the types of branches, the return branch, the direct branch and the indirect branch we select the correct counter for that thread and we then emit a counter increment into the code cache so that the correct counter will be incremented. Okay so at this point Vim is running away, running away and when we close it the post-thread handler will first be executed and this will say okay so this thread is terminating let's take this thread count for each type of branch and add it to the global total and it does that atomically and then finally we have, oh yeah the exit handler which just says okay this application has now terminated let's print out the global totals which are composed of the individual threads. Since Vim is a single threaded application we'll get one thread and one total which you can see there. Okay and now I'll quickly talk to you about some lessons that we learned from porting Mantlot to risk 5 because it was originally written for ARM so there are differences that we had to take into consideration. So the first thing was the range of branches. So for conditional branches and direct jumps they have a range of branches and they have a range of branches. So for conditional branches and direct jumps they have quite a limited range which is less of an issue on ARM because they have a much longer range. Why this matters is because in a compiled binary obviously the offsets will be fine because that's how it was compiled. When you take that code and you put it into a code cache it's done as it's needed and so the ordering of that code may be different and therefore the offsets may be different and exceed the offsets of the original binary. And so we may have to replace these instructions with instructions that have a longer range. So with a conditional branch we may have to insert an additional jump instruction that is triggered when the branch condition is true to extend the range of that branch. And same for a direct jump it may need to be replaced with instructions that first load the address into a register and then take a register jump. We also have load reserve and store conditional. You can only have a limited number of instructions between these two instructions and you can't also have a limited number of instructions between loads and stores in between otherwise the lock will fail. This matters in dynamic binary modification because we can insert additional instructions so we have to place limits on what you can do with atomic instructions in plugins and with other optimizations implemented we have to be mindful of this limitation. And finally we have the thread pointer register X4. There isn't a dedicated register in the general register file on ARM that does this. And so when we create a new thread Mambo will save and restore the context by saving and restoring all registers. We need to make sure that the thread pointer actually points to the newly allocated thread local storage otherwise there will be a world of pain which we found out. Okay so in terms of road map where we take it from here we of course want to foster our open source community. We really welcome collaborations and contributions not only plugins but also any contributions to the main internals of Mambo. As part of this we are currently in the process of improving documentation and also developing more tools to kind of give people a flavor of what's possible. So for instance we're currently porting Mambo's Memchecker from ARM to RISC 5. We also are trying our very best to keep up with all of the new RISC 5 and also ARM extensions that keep appearing. We also have various research projects ongoing that make use of Mambo. And probably goes without saying since this is a talk at FOSTEM but Mambo is open source on GitHub with an Apache 2.0 license so definitely check it out. And we'd like to thank our sponsors. So yeah any questions? Yeah. Oh yeah yeah. So you're asking how do we handle pointers when we scan code from the binary into the code cache. Those pointers are still pointing into the binary. So we actually in the scanner we have instructions like that specifically. So for instance if we take a branch instruction the first time that branch instruction is executed it will point to Mambo's dispatcher which will perform a lookup. We then have optimizations which will replace that branch instruction with a direct branch to the next basic block. And the same for loads and stores. We update these to point to the new location. So basic block is a single pointer. Oh sorry. Yeah I'll repeat the question. So what is a basic block? A basic block is a single entry single exit point. So you essentially ends when there's a branch to somewhere else. At the back. Yeah so in a general case. Oh I keep doing this. So how often is the load reserve store conditional an issue. We find it's not that much of an issue. Most applications won't have an issue with it. It becomes more of an issue when you have plugins that do something in between. So for instance if you're counting a specific type of instruction that may occur between these two instructions and you emit stuff into the code cache you may end up exceeding this 16 instruction limit. You mentioned translation early in your presentation. Does Mambo support running ARM on the RISC-5 machine and vice versa? So does Mambo support translation? Not currently. You need to be on that architecture. What happens if I try to run a just-in-time compiler under a Mambo? What happens with a just-in-time compiler? I'm not sure. So the Mambo is designed to support self-modifying code. So basically what it does, you have some code in the code cache and just in time compiler recompile it so basically the cache will be flushed and then it will re-scan it again. So it carries some performance penalty but it will react to the things like that and it will re-scan the code and put the new version into the code cache. So it does support self-modifying code. It should be. Hopefully. This isn't tested on RISC-5 because most browsers don't seem to be ported. Any other questions? So what do we interested in about RISC-5 applications from plugins? We're interested in building tools that kind of perform things like memory checking, data race detectors, that kind of thing. So tools that are very useful to people developing software on RISC-5 to kind of help them do that. So just out of it, so we haven't mentioned it on the slides but we also have some research. That was for R but done on the architectural simulation, so kind of code design of accelerators and CPUs on the SOC system. So there's some stuff going on but yeah. So at the moment I think for RISC-5 the biggest push was to get the base system to work and now we are exploring on RISC-5 what we can actually do with the system. Any other questions? Does it update sections that refer to pieces of code like jump tables, different things between basic blocks? So the question is about does MAMBO support the jump tables? How does it do? So we do not rewrite any of the sections of the original binary so basically MAMBO works in a way on demand. So we have a jump that uses a jump table. MAMBO will try to remember the most recent jumps but then if you miss it you have to go back to the scanner, scan the code again and then go to the dispatcher. So we are going to use the addresses that are already there and then we are going to keep the translation of some addresses in the code cache but none of them. But we are not going to rewrite the actual jump tables in the data section of the binary. Any more questions? Okay so the question is about the data-raised detector and whether we could implement some sort of stepping back within MAMBO. So the data detection is in the early stages but you will not have such a verbose functionality as RR or GDB replay or whatever but what you can do in the very easy way when you scan the basic blocks. So you would have to probably have some sort of we don't have functionality to detect the data-raised. But let's say in the general case if you want to inspect what's happening you can introduce a trap instruction into the code cache and then you can run under GDB and then you will trap the instruction and you can inspect what's in the basic block after the translation and you could try to look what was in there before the translation. So you can do some sort of things in the manual but there is no automated way to replay and go back in time. Thank you.
Unleashing RISC-V in Managed Runtimes: Navigating Extensions, Memory Models, and Performance Challenges in OpenJDK
Hello, my name is, does this work? Or? It's green, it's good. It's green, it's good. Yeah, my name is Robin N. I work at Reavals with RISC 5 and I'm mostly working on the OpenJDK. So I'll talk about some of the experience with the OpenJDK. And unfortunately for me, I can't lie too much because I see some experienced OpenJDK people in the crowd here. So we'll see if they correct me. So yeah, this is basically what I'm going to talk about, the OpenJDK, the JIT, which is kind of important for a new architecture. We're going to mention the trampoline lines as Mambo. We have some cross-modifying code. We talk about all the extensions we have, a bit about sign extensions. And I was going to talk something about canonical NANDs, but I think Ludovic made a good job of it, so I might just skim that through. So I'm not sure how much anyone knows about the OpenJDK, but it's a huge C++ code base with inline assembly and there is a lot of C++ code which is architecture specific, since we have different ABI's on different architectures. So the C++ code needs to know what's ABI for these architectures. We have a template interpreter, which means we basically have assembly snippets implemented for each thing we want to interpret, which jump to each other, so it's not a C and a switch statement. We have two compilers, C1 and C2. One is very fast and one is a bit slower. The first one is usually compiled with profiling, so we keep profiling the interpreter, we keep profiling when we compile with C1, then we compile with C2 and we drop the profiling because the profiling eats from your performance. And we also generate the template interpreter is actually generated during startup, because we customize it because you might use a GC which requires some specific load barriers and stuff, so we generate the code for the interpreter and we generate a bunch of other code. Like we have a lot of assembly which is glue between like the runtime, the compiler, the interpreter. So, the risk five port, it's fully functional, all great. Well, we are missing some optimizations and when we say fully functional, we mean with limited tested. We are done, have done, as Ludovic have talked about, testing is a pain as we have small boards, we have QEMO and OpenUDK have a lot of tests. We have tests that can run for like a week, just one test. If you take that and try to run it in QEMO, it will take forever. So, we have JDK 21 and 17, we are working on the 11 to get the port done for 11. I wouldn't recommend JDK 11, I would recommend at least 17, because it's much faster, it's better and you also get a GC. Yeah, the other platforms since like x86 have had like 25 years of optimization and our report is like, I don't know, four years, so we are missing like at least 20 years of optimization in the codebase. So just in time compilation, why? Yeah, of course the obvious reason is because we want to be right once run anywhere, but we also have some other things in the OpenJDK going on. We have a dynamic class hierarchy, as we can do class loading or we always do class loading, otherwise we wouldn't get any classes, which means that the hierarchy is changing. So it's not such a good idea to try to pre-compile because at any given time your class hierarchy might be different. So even if you did pre-compile, since mostly everything is virtual, it's virtual by default, you would just do just virtual calls all over the place. So that would be slow. So but with YIT and profiling, we can avoid virtual calls and we can speculate a bit about the class hierarchy. When do we compile? Yeah, we compile hot methods. And as I said, first we compile with C1, we keep profiling, then we can compile with C2. So what we do is kind of a speculative compilation, which means that if we see you have never executed this branch in your method, we may choose to remove that branch and put in a trap instead. So if you actually want to run that piece of code in that branch, instead we trap, the optimized will go to the interpreter. And we can do the speculation based on the profiling. So if you have a hash table and you put cars in it, you call hash code, we can and I can guess that this call to hash code will be on the car. So we don't need to do the Vtable lookup, we can instead guess that you're putting cars here so we call hash code for car. And until we get proven otherwise. Yeah, so we also need to do some cross modifying code. So when we're kept compiling something, compiling is a bit expensive. So if we can just change the code instead and update whatever was what was missing, so we don't have to deoptimize and recompile, we will do that instead. So I'm jumping directly to talk about a jittered call site. So when the jitter lays out a call site, we have two instructions, jump and link, jump and link register. And when we lay out the call site, since we have a dynamic class hierarchy, I forgot to say that on the first page, but classes are loaded on first use, which means the compiler is not allowed to load classes, it has to be used by the program. So when we lay out a call site, we might not know where we're going to call because we don't want to do a resolve. Because resolving the call site might mean we need to load classes. So when we lay out certain kinds of call sites, we need a full range of that call site, which means we have two options, we can either load the address or we can materialize the address. Materializing requires a bunch of instructions. I think the example here is just materializing six bytes or something, maybe someone that is fluent in assembly can tell me. Yeah, the reason why normally you would maybe do a table look up here, but we wanted to actually lay out a direct call as we can without any loading of data and stuff like that. So that's why the call site looks like this. And for the full picture, it actually looks even like this. So we actually lay out a smaller call site in the code, which calls a trampoline, which will load the address, which is just under the jump and link in the trampoline, and then we will end up at a destination. But as I said, a dynamic call site can be unresolved, which means when we get the code, we actually just point the trampoline to a resolve stub. So the first thread that actually executes this, we'll need to resolve this call, whatever it's going. So if this was the, if a is the car.hashcode, when we lay out the code, we don't know this, we need to resolve this and figure out that this is the receiver of the call. So then we have cross modifying code. What is cross modifying code? It's that one core is writing the instruction or changing the instruction stream, and we have another core executing the instruction stream. It's a bit complicated, of course, but OpenJDK does it a lot too. Avoid recompilation is basically the thing we want to avoid, because especially during startup when your class graph is changing all the time because you keep loading classes, if we compile something that looks hot, we don't want to remove it directly and recompile and remove it and recompile. Instead we can do the speculative compilation and layout code and fix it a bit later. So you can talk about two types of cross modifying code synchronous, which is basically you're waiting for the other CPU to fix the instruction stream ahead of you. And here's an example with the modifying processor do a store to the instruction stream, then release the guard. The executing processor waits on the guard. When it gets released, it picks up the new instructions. It's not that easy, but pick up the new instruction is not just a simple thing, but I'll get to and then you have the asynchronous cross modification where we just store something directly in the instruction stream. Executing processor might see the new or the old instruction. We don't know. We need to handle both. So back to our example here. So one Fred calls to resolve. After you have resolved the know who's the receiver of this call, it will patch the eight byte address stored in the trampoline. So anyone else that does this call will reach a. But we still allow friends to see the old destination, which means that both of the old trampoline and the new trampoline is valid. Since if you see the old one, you will hit the resolve stub. You will see that this quality is already patched by someone else and you will just go back and re execute it and then you will pick up the new destination, which is a. Yeah, so point to so when the executing Fred actually sees the new instruction stream in especially in the said Jai said Jai the extension for cross modifying code. We talk about point of unification. So that means that modifying processor and executing processor agree on the global state. So I'll use the terminal leave from that extension. I'll mention it more later. So we have patched the trampoline. Well, good. No. So someone loads a B, which is also of a type. So we have a new receiver here and we actually need the V table look up. So we need to patch trampoline once more and add. A V table look up before we can land on a because it could have been a B. So the trampoline is not patched just one time. It can be patched. Well, I think at most two times, but yeah. And in this case, all three. Ways of calling is that is alive at the same time because you can still see the so one Fred lagging behind can still see this resolve. Someone else might see this jump and someone might see the V table. We allow all three to be OK at the same time. We do this, but we have a small piece of code in a which verify when you did the jump to a you had the right receiver as your intended target. But that becomes really complicated. The main point of the slide is to show that we need to be able to patch the whole site multiple times. So what we're doing here is actually not cross modifying code on risk five as we have a the as we do an LD on the eight bite address and we do actually a store of eight bite address. It happens to be in the. Just below the instruction stream, but it's actually not read as an instruction as we do an LD on it. So in this case, we're actually not doing cross modifying code since we load the address with an LD. So but. There's still some problems with this. First of all, as the the address is just below the addresses, your pipeline might try to decode the constant as instructions. You also have the problem with reading from the same cash line that you're executing. Some process might not like that. So you have the same cash line in I and D. And we also have the overhead of the jump from a to trampoline. So what we are suggesting suggested in on risk five. Yeah, and I can also mention as we need this place, atomic or patchable, that's why we can't use the ally since it's seven instructions and we can only patch one instruction atomically. So for this case, we're suggesting that we actually do the load directly at the call site in a and we only have the address as a piece of metadata instead of a full trampoline, which means we get rid of one jump. We put the address on a separate cash line. So it should be faster on any risk processor. This is just the general philosophy of open JDK, meaning that in hot pass, we don't have any synchronization. We allow execution of stale instructions because like you know, if you have your ISB instruction on a arch, it's really expensive. We cannot have that in hot path since we try to compete with C++. So in slope of we try to reach point of unification. If you're on AR64, that means that there's probably an ISB instruction in your slope off. Yeah, and there's a list of other examples of cross modifying code. JIT itself is cross modifying. It's compiled by one thread. Pointer is installed by one thread and another thread is picking that pointer up and jumping to the JIT code. So that in itself is cross modifying code. The third in this solution is when you do a field access. The class for the field access is not yet loaded, so we don't know the offset for the field. So we basically say, oh, you need to fill in the offset here. So the first thread that hits this path needs to load the class. If it's not loaded, figure out the offset, patch the code. And then you have different barriers for the method because they can get invalidated. We might need to update the method. So we have guards and barriers to protect the method. We can have addresses of objects directly into the code stream. So when the GCMOS an object, we need to change the immediate for that object that was moved. We can have GC barriers as immediate values. So when the GC changes color, we might need to update the load barrier to reflect the color change. Yeah, point of unification. So if you're running your AR64, that usually means you're doing an ISB. We don't have that. What we have is something about fence.i, which is not so good. What we're doing today is something really crazy. For every write we do in a page that is from the JIT, meaning we think we're doing cross-modifying code even though my first example was not, we're doing RISC-5 flash iCache, which means the kernel will do an IPI on all CPUs and emit fence.i. So every write we do, this is really expensive as from the last page, if we put in like GC barriers, which need to shift color for every load of object in the instruction stream, meaning that we might change 10 places in one method to reflect the change in GC color. So there will be 10 writes just in this method and that will cause 10 IPIs. That means that every write we reach point of unification. So it's working really well with cross-modifying code RISC-5 with OpenJK at the moment, since we actually don't have any races basically, since we do the IPI on every write. So I see in like a really small board it costs a half a percent of performance. On a large real CPU server class, maybe 2-3 percent of performance decreased due to all the IPIs all the time. Yeah, point of unification, the modifier needs to make the stores visible and executioner needs to make sure the instruction stream is invalidated and so he picks up the new instruction. But we still think we can do a bit better with what we have, since fens.i is an unprivileged instruction, we can actually emit it ourselves in the slow path. So we don't need to do the IPI, but we need help with context switches. So you're on your heart to use RISC-5 terminology. You emit your fens.i and think you have invalidated your instruction stream, but the kernel moves you to another heart. So if the kernel moves you, the kernel would need to emit the fens.i so you know that on that whole heart also the instruction stream is invalidated. And what it's going to save us, we hope, is the ZJID extension for IDI. ID synchronization. So instead of fens.i we would get an import i, but more importantly we will get a limit on the instruction fetching. So ARCH allows out of order fetching, which is problematic for us. So if you have a call, when you do an A, Y, P, C, jump and link, even though if you not bit out by first not being out the jump and link, and then you not about the A, Y, P, C, the iFetch could fetch the jump and link before the A, Y, P, C. So it reads the A, Y, P, C before you not that then it reads a not from the A, Y, P, C, then you're toast. So ZJID will specify how the iFetching will work, what we can overwrite without tearing instructions apart and stuff like that. So we're hoping we get that in place well this year. How long have we been going? Okay, that's fine. Yeah, that brings me to extensions. We have a bunch of extensions. When I looked, maybe this is totally wrong, but I found 60 ratified, which adds instruction for RV64. That's 450 base instructions, and I found 45 unratified adding another 400 base instructions. As an example, I took this fall I was looking at the CRC32 a bit. So OpenJDK have an implementation of it in Java, works fine, but you probably want to have an intrinsic for it to make it faster. So then you can make your table look up intrinsic with the base ISA, which is the standard CRC32 intrinsic. But you can also use Kerala's multiplication to do even faster intrinsic. Then you have your scalar Kerala's multiplication in the CBC extension, but you also have Kerala's multiplication in vector. So there's a possibility to have four implementations of the same CRC32 algorithm, one in Java, one for base ISA, one for CBC, one for vector, which is too much. Also at least I'm getting really annoyed with the architecture description through your compiler. And this is just the first of four lines. So if you have a server class CPU, I'm not sure how long that can get. So as Ludwig was talking about profiles, we're hoping that we get nice profiles. Right now RV823 is perhaps the one that looks best. And for the JIT, you need to add an option for every one of these. But we have HVPROB, so we can get it automatically. But there is like, you get an extension, you add an option, then you get HVPROB. So make sure you have like, so basically you need a 6.9 kernel or something to make everything work nice. 6.8 maybe is the next one. So I recommend using 6.8, which is released in, I don't know, because otherwise you need to add all the options on the command line. This brings me to the next problematic things. We have some major extensions like do your CPU allow misaligned access? Do you have vector, what are your memory model? We allow to turn off. Yeah, so the JIT, since we do this cross modifying code and stuff, we're really sensitive to code layout. So if we change anything with code layout, we would like to test it. Since you have so many options that changes the code layout from the JIT, we have so many combinations that we would like to test, but we only have basic boards in QEMO. That makes it really hard to guarantee that your combination will work fine, because I guess everyone is testing a combination which will be something for the CPU they are intending. So I think there's a lot of combinations which are not tested much at all. We also have the compressed. Yeah, we have an option for it. You can turn it on and off. We have an assembler that just changes the instruction for you if you want. Since we're sensitive to code size, some parts are fixed size, so just to make it easy for us, we turn off compressed in certain parts, because we want it to be at a certain alignment or certain address. We see 5-10% code size reduction. One thing we can do is, since you know the compressed just have 4 bits for the registers, we don't consider that, so we just use registers. For example, we have the heat base. If you have compressed points for your object, we have a base for it, which means every time you load an object, we need to materialize the full address. That one is in X27, which means we never can use compressed for that. So if we were to put heat base in another register, like X14, then we could use compressed more. Next, which Ludovic touched on, about memory models. We have your weak and your strong model. In OpenJDK, we're often dealing with free models. We have the hotspot memory model, which is from the 90s, I think. So it predates C++ and C11. Then you have your Java memory model. Then you have your C++ memory model. Since we have two hardware memory models, we get a lot of mapping around that. So we basically have six combinations here. And that also, extension, increase the complexity. Because then you have like SACAS, which introduce the CAS, which means we need the CAS for the memory model also. So yeah, again, if we're going to test all combinations, it will be really costly. Yeah, sign extension. Maybe it's just me, but I'm not a friend with it. So sign extension is when you have a word and you need to enlarge it. Oh, yeah, I only have a few minutes. So that's good. So you want to enlarge it to a word. You need to replicate the sign bit. So we present the sign as of the word when we treat it as a double word. And we do this because some of the instructions use the full register, branch and or, for example. So this is all fine when you let the compiler do the work. But as we have so much assembly and we do, yeah, type less passing, we have templates with inline assembly. So you get a type T and then you're supposed to put in your inline assembly. And we have type aliasing, meaning we have one type and we access it through a pointer to a different type. So when you write all this, you need to think about both like the short representation of your word, but you also need to think about the word as a eight bite. So I get confused and suddenly my branches go somewhere else because I forgot sign extension. So I'm not a fan of it. Yeah. And sign extension. I don't have much to say more than what Ludovic said. I had a, this is one example. If you're writing Java code, if you use this method, you will be surprised because if you have a negative NAND and you ask this guy, you don't know what the bit will be. It depends on the instruction and stuff. And the C++ version is even more complicated because compiler may choose to evaluate that at compile time, which means you get whatever the compiler think design flag should be. If you execute it in runtime, then it depends on the instructions. So if you see anyone using such one functions and they don't consider not the number then there might be a bug. So sorry, one too many. So yeah, I personally like RWA 23, but of course want said JID. So we can formalize the cross modifying code. And also like some of the more atomic extensions, I think SACAS is just optional in RWA 23. I would like it mandatory. And also would like one more instruction to materialize a 64 bit immediate. So that will help. So we don't, because the load we're doing in the trampoline, even though we remove the trampoline, we're doing a load, which means we can have cache missers, which means that the call can be really expensive. And all additional loads we need to do for the JIT itself or for the JIT code, its memory bandwidth. So when you're competing with other platforms, which can materialize a large enough immediate and have it atomically patchable, it's hard to compete when we can't do that in those cases. So I guess the road to one instruction to materialize a 64 bit will be long. Thank you. Yes. Two questions. First of all, is there a limited interface to send more dense ice through the UTI? You can use it with the, for the IPI, you can use the G-Lib C, cache flash, eye cache. So there is a G-Lib C function you can call, which do this is call for you and fixes it. Yeah, so that's, I can't remember if you changed that or we're using G-Lib C wrapper. So there's a G-Lib C wrapper over this is called. So you can just say, I want to flash eye cache. I can't hear. Yeah. I haven't given it much thought. So I'm not a big fan of compressed. So I don't mind what we're doing now. It's just that there might be, so from what I've seen, it's the smaller board which gains performance from compressed. The big out of order CPUs we're waiting for, we don't think there will be much difference. So we haven't spent time on it. I forgot to repeat the question. Yeah, sure. So I'm not sure if you're going to be able to measure the code size decreased, but were you actually able to measure any sort of performance? Using the Vision 5.2, I've seen some performance improvement, but that's an in-order, simpler CPU. So yes, on Vision 5.2, I see some performance improvements when using compressed. Yes. And you're using the Vision 5.2? I have one at home. We have many boards, but I have that one I have sitting next to my desk, so I often use it. So yeah. Well done. No corrections from the GD code.
A framework for RISC-V SBI verification and ISA extension validation
I'll get one more minute. Sure. You got some naked nathalus. You got sick just kind of out of the blue and then like, didn't have any. I thought it was. It was so sick like, nothing. It's sick sick or sick as it is. Yeah. I've never known. Yeah, I've got the beer induced one. I'm the last speaker and it's our hero that fills in for, you know, the. Yeah, I want to thank one or two missing. So yeah, take it away from. Yeah, thanks, Bjorn for letting me fill in. I heard I had wanted like a 15 minute session just to kind of advertise this framework because I'd like to encourage people to contribute to it. I ended up with a 30 minute or however long minutes the session is because of the cancellation. I'll have an hour. Don't worry. No, no, no, no, there's lunch and I can do it four times maybe. Anyway, so quickly just about who's standing in front of you talking. I work for Bentana. I work on Linux kernel, also KVM, open SPI and KEMU. And I'm trying to build, you know, the software system that we need for risk five. So I'm also participating in these RVI working groups and rise that we'd heard about earlier today as well. Prior to that, I worked on air 64 before risk five air 64 red hat also virtualization. So the Linux and the KVM bits KEMU as well. I've carried over into the risk five world as part of the vert stuff that I did previously. I got involved with Katie Munitess, which existed before my time because it's quite old. But I started, I wanted to use it for air 64 specifically. And so I did some ports. I'll support it to power PC and then kind of left that for others to maintain. I don't think it's getting a lot of action, but it's there. And I'm bringing it to risk five. And that's what this talk is about is the fact that we now have this tool available to us. So the outline is just Katie Munitess. First, I'll give a quick overview of the framework generally. And then it regarding risk five, the use cases I see that we could apply it to right away and also as the framework evolves. And then the the and you part is my kind of appeal for contribution. So, so as I said, Katie Munitess is actually quite old. It's as old as KVM. Avi created it shortly after his first couple of KVM commits in order to start testing. So to make sure it actually works. And then over that time, though, we've we've been expanding its targets. So now we can actually test not just with QMU as the user space is originally, but with KVM tool or you could probably put in Rust VMM or whatever you want in there. Cross VMM. I mean, with some efforts, probably it doesn't just drop in at the moment. But you can already test other hypervisors. People do that. And we can even test it on hardware now because we've added at least x86 and air 64 at this point, the ability to boot over some sort of a if you capable boot loader. So then what is this test actually that I keep talking about these KVM tests? And so they're actually like a little tiny guest kernel because that's what Avi needed for testing KVM, right? He needed to have a guest, a guest OS that would have to boot and maybe exercise some stuff that the hypervisor needed to provide for it. So that's what they are. These little guest kernels and originally, you know, kind of booted in maybe hacky ways or whatever. But over the time, we've actually tried to build the framework in a way that is easy to port and easy to maintain. And so we even have DT support in there, some limited ACPI support for this booting. Like I mentioned, we can boot with CFI protocol, which helps us to be able to do the booting over hardware directly rather than through hypervisor. And then for air 64 ARM and RISC-5, I've also taken my notes from the Linux kernels boot requirements. So, you know, particular registers need to be set in a particular way when you first jump into the kernel code. And so we follow that protocol and then it makes, you know, everything just kind of work for bootloaders that already know how to do that. Any bootloader that can boot Linux in this direct way can boot these unit tests. And so, yeah, you're in privilege mode because it's like a little kernel in kernel mode. So you can do all the things that you would do, manipulate the page tables, set up your own exception handlers, generate exceptions and make sure they do what you expected them to do, things like that. You know, you're privileged, so go nuts. So, despite the fact that we're actually writing kernel code, we don't have to make it complicated. We don't have to make it something that's hard to do or at least feel hard to do at first look. So the framework tries to allow the unit tests to be written in a C-app type of way. So you kind of look and feel that way. You've got your main function, which is actually the entry point for the test. And then we have a bunch of libc, api, ported over, not a bunch, but enough for most tests. And we are, you know, of course welcome to add as necessary, whatever kind of looks like it's needed. So all your expected ones, assert is there, which is, you know, of course one of the most important ones for a test framework. Also, with the scripting wrapped around these tests, when you execute them, at least over QMU, then when you do, when you get an assert or any sort of an unhandled exception, you actually get a back trace for all the ports in a way that support stack walking. So we have that, and then this is just a little snippet of code to show you that, you know, don't be afraid. It's just the, and very simple. See, it's just main, even environment variables can be provided to the unit tests. For that, we do a little trick where we take a text file of environment variables. So, you know, your usual key equals val, just a whole list of those. And we put them into an NDRD, so they're in RAM disk. And we can just read them out of there, and we can find it through the DDE, all that stuff, just like we're supposed to. And then we can load those environment variables into memory, and you can use them like a normal C program. So that can also be nice for passing in your expected values and whatnot for unit tests. You can also pass in expected values for the command line, of course, which is a little bit easier to do. But it's, you know, if you have too many of them, it gets kind of ugly. So, of course, you can also, for at least people who want to test on hardware, they're free to manipulate their device tree in any way they want. So they could create a special node for test cases, sure, why not. And then the unit tests would just, you know, parse that node and get all their input. However, however you want to do it. So how do you run the test? So originally, it was, you know, from the command line just for running KVM Guests. So that still, of course, works. You can just pass the test, you know, as a kernel. That's the kernel parameter to continue. Depending on which KVM user space you're using, you'll do it in some similar way. There's also some bash wrapped around all of that stuff. It allows you to run all the tests for you automatically so it can be built into CI very easily. And we do have it built into many different CI already. So we run just a single group. And then the reason is bash. I mean, some people wonder why, because it gets kind of awkward to add more advanced functionality to the test harness having to write it in bash. It was historically in bash, is probably the main reason. But then we actually had a discussion a couple of times, like should we use Python or whatever, go whatever the latest thing is these days. It's a little bit easier for the harness. And we had some pushback from people who have been using this framework quite a lot. And they like to have a very lightweight framework that they can put on an embedded, you know, busy box type thing. There's nothing there except for bash. And they didn't want to bring in libraries and everything else for something else. So bash is not that painful. We don't have that much functionality. So I don't really have a problem with it. Another thing we can do, we can build standalone tests with it. So nothing changes except make space standalone. And it'll actually wrap a lot of that bash around the binary after it converts the binary with base 64 to be embedded all into one nice text file, each text file depending on how big your test is. And you can actually just email that or send it to people. So if you build a quick and dirty test, and I'll get to talking about quick and dirty tests a little later in the talk, if you do that, like, you know, a few lines is to like prove your point that this is broken. Then maybe you just want to package it up with this make standalone thing and mail it to somebody. They can run it and see for themselves. I don't think that's used a lot. That was one of the things I invented that I thought would be useful, but not too many people have been mailing these tests or whatever. So now we know what the framework is, and this is a risk five talk, so we finally get to risk five. So we already have a use case for it. The tech PRS working group has more or less committed to using it for the SBI verification framework. So the SBI for those of you that don't know, I guess most people in this room do, is this interface between either supervisor mode and in mode, machine mode, or also between a virtual supervisor mode and hypervisor. And so we, you know, we either respect community or trying to keep that interfaces from going nuts in all sorts of different directions. We have a standard for it, the SBI spec. And so we write when we want new functionality that we need, the supervisor needs to ask for some service or some information from in mode or we want to emulate that in mode for the guest. Then we need to provide this interface, right, this SBI. And so as we add these functions to the spec, we explain how in the spec it's supposed to work, the parameters, etc., like usual. Then it would be nice to be able to have a verification framework for that so you also say, okay, you've written a nice, you know, addition to our spec, a new extension, SBI extension, please show us, you know, how it's supposed to work. And you could do that, and we do do that with Linux proof of concept codes. We always submit patches for Linux and also for open SBI or Rust SBI that show that, you know, it works, right? We prove our extensions. But with the verification framework, we can actually avoid having to any, focus on any specific projects or people having to involve an entire Linux kernel for the test. They can just do this quick and, this quick small thing here. And so that's the idea is to try to build all those function tests in there and have regression tests for that as well for everybody's SBI implementations. So we can test already, right now you can start writing tests for open SBI. It's quite easy to run over QMU, you don't need hardware for that. You can actually, with QMU, you can swap out open SBI and drop in Rust SBI. That also works over QMU. Probably other SBI implementations can be run from QMU. Of course KVM is a SBI implementation because it emulates, so you can already test that as well. That's one use case already, which could be started now. So, we have a CPU validation as people actually have CPUs to validate. And when we get the EFI support merged. So I haven't done that yet. I'll come to that too, like with current status. But as we, when we get that done, then you'll be able to just put these tests directly, boot them from U-boot or CBEFI and you'll be able to do some validation tests. So ARM does that, I'm quite aware, because they've been involved with KVM unit tests for a long time now. They're doing their memory model, litmus testing, they use KVM unit tests using the EFI support to go straight on hardware and run that. So microbenchmarks are another great use case for KVM unit tests because while you can always find a way to create like some sort of a privilege level test where you write a kernel module in Linux and then you like put it, I used to do that a lot, just like in the init of the module, I would have my whole test case and then I'd, you know, I just mod probe it and now it runs my test, right, that privilege. But, which is kind of awkward to begin with, it's not a real test framework. But it also requires Linux to be booted up and working and everything. And it's not very good for a microbenchmark because you've got Linux doing whatever Linux wants to do. And so you're not really isolating your instruction sequence. But with KVM unit tests, you know, the world is yours. The unit test is running there and nothing else. So it's actually quite good for that. When you get your timing numbers from that, they're pretty reasonable to trust. Question. Yeah. So in this diagram, what does the test say? Ah, yeah. So the test is either this guest kernel or actually the host kernel. It's one of those two. So if it's bare metal, if you just launch it from the boot loader, you'll be the host. That support isn't in the RISC-5 port yet. But you can already do the guest kernel version. Okay. So, yeah, the tests are easier to write as we already talked about. And the quick and dirty ones are even easier. I do, I do the, so, so I do this a lot. I actually, because I'm familiar with the test suite, I use it for a tool while I'm working on something else. Like something for Linux or whatever. I use it just for my own testing purposes. And then it's kind of ugly and it doesn't really look like people would be maybe interested anyway. It's too, like, one-off. And so I just kind of toss it. Or maybe I keep it for myself to look at later, but it's not shared, which isn't really a very good open source approach. So I've actually been thinking about that, that for these types of tests that don't really necessarily fit what we consider the main test suite, maybe we should have a separate branch though for them. So we still collect the code. And I kind of did that already. I recently wanted to test TCG. So I kind of forgot to mention that for CPU validation, we already can, of course, test our, you know, emulators and our other models to see if they're correct. So TCG is, you know, QME's emulation framework. So I wanted to make sure that the MMU model that it had was able to handle the access to dirty bits correctly, because there's actually a couple different ways to do it in spec. And QME had picked one by default. And then a couple extensions came along that actually allow you to decide which one you're going to use. And a new bit was added, which is actually going to require another SPI call. So we'll go back to the SPI verification for that. Anyway, kind of, you know, balloons as we know. And I wanted to make sure it was actually working the way it's supposed to right now. So I wrote a test case in KVMunit tests. And then I wasn't sure, okay, this is maybe not the one that we're going to merge because it's just for this one-off test. But I've already decided maybe at least goes to a branch that we should keep track of these things. And then, you know, and the other reason why posting them, even if they don't get merged in the end, or at least not to the main branch, but to the side branch, is because when people do post-tests, sometimes they reinvent something they need inside the test case to get the job done. And that looks like something, oh, we should probably pull that into the common code, right? We can let the framework evolve better the more people who contribute. And there's no one and done. Usually I write something, some quick and dirty test, and then like three weeks later, I'm like, oh, yeah, I actually need that again because something similar is broken or whatever. Yeah. I think I talked about everything on this slide. Those are some links. And, yeah. So one thing I was going to do, because I have way more time than I need, but I was just going to show that test that I just got done talking about. So it's a little bit more complicated than that little snip that I shoved in the slide. So you can see that it's still not that complicated, right? Oh, yeah, sorry, everyone can try to brighten the screen somehow, maybe. Yeah. I don't know if I can turn off the light. Just smash it with a hammer. Yeah, it's probably, you know what, maybe I can go to a black background and just cut the file. It might be better. Is this better than before? Yeah, because black background is better. Don't touch that. That sounds like fire hazard there. Anyway, so I'll just, you know, just kind of slowly scroll through it, I don't know. Just to show you that really you can build these tests with like 100 lines of code, and they achieve a pretty reasonably good goal, like making sure that an NMU behaves correctly in like three different modes. So, yeah, so I don't know if there's any particular lines here I want to point out, so I wanted you to get a feel for what a test would look like if you guys decided to go sit down and write one. You don't have to like, you know, you don't have to learn a whole big framework with some bizarre looking APIs. The APIs that we have are minimal to begin with, so you're going to write your own functions. But when you do need them, you know, they're pretty self-explanatory and C, so you just, you know, you can grab for anything you need to know. And yeah, that's the bottom of the file already. It's only like three page downs. So, um... So, does the actual return value get used? I mean, I noticed you're carefully returning a report summary. Yeah. But does anything actually look at the return value of this, May? Yeah, so CIs will do that. So, like, this will dump a summary to the screen. So, if you're just running it yourself, which I guess I might as well go ahead and... So... Yeah, you know, I'm feeling brave. But, um, so, yeah, you can just run it. And then it'll dump... Yeah. It'll dump stuff like this out. And then CIs will, they know how to parse that, right? So, they'll be looking... And we have the... We have those, you know, reports and report, pass type API to try to make sure you get a nice, uh, consistent format so that it's parsable. You know, we don't use a TAP. Maybe we should. We've done that in a different test suite that I'm involved in as well, KVM self-test that's in the kernel. We're starting to... We're not there yet, but we're starting to migrate the TAP for that one. Uh, yeah. This one we have our kind of our own thing going. We've had it a long time now. Um, anyway, so that's like one and then there's like this... Yeah, there was another test. You probably saw it said skip. And it's skipping because I didn't give it an environment variable. Uh, let's see. Yeah, that's the file. So, this is that text file I mentioned before. You can create just, uh, you know, plain old text with all your environment variables. And then when you want to pass it to the thing... Um, Oops. It passes... Like this. And we'll just run that one group of tests this time. We're seeing about live demo is I have to type in front of people. Um, and, uh, so now we, now we're not skipping anymore. Now we're passing because I gave it, I gave it the inventor ID, which is zero for KMU. Um, and, uh, it matched. Working demos, working passing, passing SPI test. Uh, yeah. You showed the failing test also. Oh, yeah. Yeah. I want to see that it's true. Yeah. Yeah. Good challenge. Uh, yeah. Forgive what I called this one. There we go. So, yeah, this is that other one was the, uh, was the, uh, in a new testing. Um, oh, yeah. And so now it's here. It is failing. It's skipping. Um, that's failing, but skipping. And that's because, uh, this, uh, CPU, the default CPU is missing the, um, the extensions needed. So we can, we can fix that, of course. Um, something like this, um, we can actually add, um, we can add the extensions. So, uh, still spot. It was not there because I don't know why. Oh, no, because that requires an extra, uh, extra step of adding an SPI implementation that allows you to turn on, uh, the, uh, AD bits, um, uh, the hardware AD bits, um, where you don't have that yet. That's actually, we need to add an SPI extension. I think we're going to call it FWFT, allowing us to tell SPI to flip, uh, bits and registers the machine, uh, environment, config, enable bits. Because if you want to turn on this particular feature, uh, you need them, you need, uh, to be at the machine, uh, uh, mode level to be able to do that. So I can't do it from, uh, the s mode level. And so I actually hacked OpenSBI to let me do it and to test this out. And I, I'm not going to look for that in a live demo file. But, um, yeah, I have that. It does work. Yeah. Um, yeah, I think, uh, So what's in the run test.sh? So you wrote the C file, right? Uh, again? So then you had this run test.sh? Yeah. So, did you write it as well or is it, so, or is that the test? Okay, run test is just the, the, the test suite that kind of pulls everything together. So, um, if we look at, uh, this one, for example, this on the screen, uh, the log here shows at the very top. Which is at the bottom of the screen. Uh, this, this time out, et cetera, et cetera, et cetera. So that's actually the command line, the run test, figured out how to compose, uh, based on some configuration files and stuff. And then this is the output of that. There's, uh, this, um, uh, configuration file that you can, uh, provide, uh, for your groups of tests or for individual tests. Uh, allowing you to, um, um, to tell run tests what to do to pull it all together. I mean, of course, you can also just manually do the command line. And I do do the manual QMU command line, uh, when I want to also, like, do something with GDB or, you know, make sure I get the, the address is dumped out and I can find them as obj dump or something. So, um, yeah, I don't always do everything through run tests. Actually, very rarely. That's more for the CIs after you've got the thing working. Which one? No, that's already there. That's, that's a static. Yeah, it's committed to the repo. Yeah. Yeah. Um, yeah, nothing for scripts is automatically generated except for when you do the make standalone. And then you get, uh, might as well show that because we're in demo mode now. Um, so then you get this guy, which is generated. So this batch script was automatically generated. All this junk is the, uh, base 64 of the actual test code that was written in C. Um, yeah. And then, you know, this, some of this stuff is just kind of extracted directly from, uh, other scripts that are used by run tests and they're just chucked in there. And now you can, now it's all one unit. Yeah, you could put anything in there. I mean, don't trust someone to send you a reproducer. Yeah, it could be like, sure. Yeah. This is for developers passing things among trusted friends. Yeah. Yeah. Yeah. Make them sign it or yeah, just, yeah. Sure. Yeah. Yeah. I mean, yeah, anything could, absolutely anything could be in there. Right. Like enter your password. Please. Thank you. Uh, those tests are very similar to what is a case of test does and, uh, those tests integrated to case of test and if no, do we have such plans? Yeah. So the question is more or less, how does this relate to his, uh, case of test? Yeah. So there's, there's definitely overlap in what is tested. Um, the frameworks are quite different and how they work. Uh, there's, there's more overlap between this particular one and KVM self test, which are in case of test. That's one of the many sub directors in there. Uh, KVM self tests, KVM case self tests, uh, has started to be probably, uh, be probably the, the main place we add new tests for KVM. So actually, you may have noticed I did an entire presentation on KVM unit tests and I think I said KVM like only when I said the name of the framework, but I never actually talked about testing KVM. Uh, we do that still. Uh, we have CI that's specifically testing KVM using this framework, but, um, now we usually use KVM case self tests for the new ones and even some of these reporting to that framework. Um, and I'm seeing that this one's more going towards the testing of hardware or other hypervisors are still using it and stuff like that. But, um, yeah, KVM wise, uh, and, and, and actually I talked to Paulo about that yesterday, uh, on my third beer or whatever. But, um, I was like, you know, KVM self, the case self test is, is the way for the future for KVM testing and I'm not going to really talk about it too much tomorrow. When I talk about KVM unit tests and say, ah, but KVM unit tests are still easier to write and he's right. Like you can write a test case, uh, quicker, faster here. So if you're doing KVM testing and you want to do those quick and dirty tests I was talking about, uh, you might jump to this one first because the other framework, uh, uh, well, it's growing like support quite fast, but you have a little more boilerplate code and everything you have to do because you're actually, when you write that you're writing both the user space code and the guest code simultaneously for a test. And here you only do the guest code. So they initially, uh, here we can just simply read a test and if it's worse then we can move it to a case self test with bigger overhead. Yeah, yeah. And for your question on the other case self test stuff, like risk, there's a risk five directory there too, right? Where we test some instructions. That stuff is good. We need that too, but it's user space only, right? Yeah. So this, this is down in the kernel level. Okay. Thank you. S mode. Any other questions? No, let me, let me appropriately go to the last slide. There. Thank you. All right. That's it. See you next year.
The best `case` scenario
you you yes sorry so let's talk about case which is a keyword that hopefully most of you have used, if you haven't, it's okay, we're gonna go through it. And we're gonna figure out how we can use it, how it works, how we can use it better, and what the latest versions of Ruby have given us to play with this operator more. So yeah, that's more as what I'm talking about. So just in case we're gonna go through what case is, what the different syntaxes are, how you usually use it, and then we're gonna look at how it's implemented, which is terrifying, and we're gonna have a small dive into how the Ruby VM works and the instructions and stuff like that. After that we're gonna go through several use cases, some of them are pretty basic, some of them I think are pretty cool on a Ruby standpoint. And finally we're gonna take a look at pattern matching, which has been coming to Ruby since 2.7 and is mainly operated right now using the case keyword. So let's start. What's a case? So does anyone not know what a case is, or has anyone not used it? Cool. So that will go fast. So basically a case is more or less a big if-else, that's usually how people think about it. So you have your case, you have your different branches, and then you match each branch against your case. And depending on the branch that matches, you go to a different path. So in this case we can assume that, I don't know, status is something you get back from an API, you match it against different cases, and then if you have a success you proceed, otherwise you want to fail depending on what you have. If you want to, you could be even more compact by moving the stuff up a line and using then, and if you want it to be even more compact, you could even add more things to a branch. So if you wanted different conditions to go to the same branch, you can separate them with a comma. So that's basic case. One interesting use case that I don't think I've ever seen before, I don't know if it's useful, but it's still cool to look at, is you can write a case without anything at the top, just an empty case, and then it behaves exactly like an if-else if, so you have to use usual predicates, the same way you would an if-else. So I'm honestly not sure that has much interest, but it's cool. So how does case work? And in general, I kind of also wanted to take the opportunity to talk about a bit about how anything works in Ruby and how you can, when you're debugging something and trying to figure out how something works, how you can go deeper into your code or someone else's code. So if you're, for example, if you have a method that you've written or someone else has written and you don't know where it is, so let's say you're in a big code base and you have 20 methods called, I don't know, count or show, and you don't know which one is being resolved. In Ruby, everything's an object, as you might have heard before. So, so are methods. And you can, on any instance of anything, call dot method, capture your method, and then you have access to two methods that are pretty cool. One is called source location, which will tell you in which file it is. So interesting when you don't know which method is being resolved. And another one is just dot source, which will print out the source in your terminal. Just plain up. So that's interesting also. If you're looking for something more lower level, so a Ruby method, like array.last or integer.next, and you don't know how it works and you don't know where to go, you're kind of stuck. You're going to have to go read the fabulous manual of Ruby to figure out where it is. But in our case, we're kind of one level deeper because we're not looking at a Ruby method, we're looking at a Ruby keyword. So you, if you go to the documentation, you're going to find how it behaves, but you're not really going to be able to see the source code per se. So in this case, one way that I've used to figure out how the internals of case work is to go look at the Ruby VM instructions. So big-ish caveat for the next couple of slides. That's the very limit of what I'm trying to understand this year. I'm kind of in that phase in my Ruby journey when I want to understand how things work. So if I say something outrageous, stop me. So from my understanding, the Ruby code that you write goes through a journey before it is compiled and interpreted. So your Ruby code first gets turned into tokens. So for example, you can imagine that your entire program gets turned into a big array of syntactically relevant stuff. So that could be depth, for example, or an open parenthesis or a space or part of a string. So everything gets turned into a token. And then those token get organized into something called an AST, which is an abstract syntax tree, which is really hard to say. And basically what an AST is, is that big array, but formatted into something that is more understandable. So if anyone has ever played with RuboCop before, that's probably where you've seen something like that, because you have to play with syntax tree, which you want to write your own cops. So the tree is composed of a lot of nodes, and each node has a name. So you have a class node or a method node or a begin node. And then inside the node, you have all relevant information for that specific class or method or begin block or anything. And then all of those, all that tree gets turned into virtual machine instructions. So that's the part where I think this, what I'm going to talk about probably only work on C Ruby. I'm not sure this applies to other implementations of Ruby, like Truffle or JRuby. It probably works a bit differently. So if we look at the case that we're looking at before, and in the Ruby console, you have a class called RubyVM, which gives you access to any tool you might want to, to turn your code into either the tokens or the tree or the instructions. You can end up with all of this, which we're going to try and go through in some kind. So first of all, in case you've never used it, the Ruby virtual machine, the one from C Ruby is a stack based VM. So interact, everything in the VM is a stack. So you end up, you have a lot of instruction here that just interact with the stack. Like the put object over there just puts an object on the stack, the top end finds an object and then moves it to the top of the stack. And you have a lot of things like that. So in our case, if we look in detail, we can see a few things. So first of all, here we're mainly preparing the stack and here we have something, here we can find the status that we have over there. So this is basically calling status to fetch the value that we want to match against. And under this, you have a Ruby optimization, a Ruby VM optimization called case dispatch. What this does is in some cases, if you're using a simple case with simple objects inside of it, like strings or integers or symbols or stuff like that, what it will do is it will create a hash where the keys are basically this and this and the values are the number of the line in your VM structure that you need to jump to. So what that means, at least the way I understand it, is if you have a lot of if, else if, else if, else if, it will be usually faster to build a case because you're losing some time here to build your hash. But then whatever case you want to go to, it's just a hash access. Whereas if you're doing a bunch of if and else if you have to go through each of them to see does this work or does this work or does this work, etc. If we go a bit below, we can see what that would look like technically if we would need it to go through each of the branches to see which one works. So here we have our success symbol, which was our first branch. And what this does is it going to compare it to the status using the triple equal method. And that's the cool part of case. That's technically what's doing the heavy lifting behind. And if that equal works, then it's going to jump to instruction 28 below. If it doesn't work, then it's going to keep going. So second branch is error. So we're going to take error, put it on the stack, compare it to status. And if it doesn't, if it works, we go to 33. If none of those work, if you remember the case, then that means we're in our error case or like our else, which is over there. So if none of those work, so we keep going down our instruction and we end up here called the fail harder and then leave, which is instruction 28. And then under that, then you have the lines that you would have jumped to if anything worked before. So the 28 here, which will call the proceed and the 33, which will call the fail. So that's more or less the instruction patterns of a case. So that turned our question before, answers our question before, right, of how a case works. And the simplest answer that I can give it, it works thanks to triple equal. That's what it's going to use to match everything against everything. So if we wanted to push case to the limit, the question that we want to answer now is what does implement triple equal? And in Ruby, that's a bunch of classes. And the interesting thing and the main reason I wanted to do that presentation is that depending on what you're calling triple equal on, it will behave differently. So the simplest example that we've all used is all the base classes. So string strings, integers, float, arrays, hashes, anything you want. And in this case, it checks for equality. So that's the thing we've seen before. You might have seen that code. You get a param that has a response and then you don't know what the fuck the other person in the API has done, whether it's a string or a 200 or a success or a string or a true or true as a string or anything. So you do your case and you match it against whatever and try to figure out. So in this case, it's always going to check for equality. So here with the come out that we've seen before, it's one or the other or the other. And then you have arrays, you have hashes. Otherwise, yes, you can give up. Another thing that implements triple equal with another behavior are classes and modules. On classes and on modules, triple equal checks for, I don't really know how to say it in one word, checks for type, for ancestry. It's a bit like the is a method of Ruby. So when you have an object and you call is my dog an animal, it's not only going to check the class, it's going to check a bit above to see if animal is included in it if you're going composition way or if it inherits from animal, if you're going the inheritance way. And that's more or less what we can do here, for example, with errors. So I say you have your code and you've defined a bunch of different types of errors. And you've tagged some of them maybe as ignorable. So if it returns any errors that's in that type, then I want to ignore them. If it returns those two different errors, I want to return a not found. If someone forgot about safe navigation, I want to tell them. And then a lot of errors, for example, in Rails, and I'm assuming in Ruby, not entirely sure, don't put me on that, inherit from standard error. And so those maybe you want to raise, but if you have something else, then that's probably a lower level, maybe a PG error if you're dealing with a database, and then you want to do something else. So that's it for classes and modules. Another class, another type of classes that implement triple equal or ranges that I'm assuming most of most of us have already used that check for inclusion. So for example, if you have an integer at the top, then you can check that it's included in this range or this range. And it works with the endless ranges of Ruby. So you can be very, I mean, this might as well use an if else if and just check that it's greater or lower than, but it's good to have options. You never know. And one thing that I found, if you're working in networking, that could be cool, IP address works the exact same way. So you can define IP addresses with their masks and everything, and then have them act as ranges, and then check that your IP address belongs to one or the other. This one we've all probably used is also is reg X. So this one checks for just a match. It's the exact same equivalent as if you wanted to match your against string. So that's a kind of real use case that I have from the company that I'm working for where we manage a lot of messages between clients and providers. And so we want to check in those messages that they're not trying to bypass us, for example, by sending an address and trying to meet somewhere, or they're not sending sensitive information or sometimes people can keep their dick in their pants. So we have to be careful about that also. Stuff like this, right? So this one checks for match. Probably one of the most interesting example, but yet the one that have the most trouble coming up with a good example for are prox and lambdas. On prox and on lambdas, triple equal calls the lambda and gives it the object that you're matching with. So for example, here we can define, let's say we want to define simple prox or lambdas that just delegate to another method. So for example, unknown host will take an element and then check if the host is included in the list of something. Oh shit, yeah, I've done it again. In case this is just, it's the new way of writing the old thing here with the pipe pipe and you enter a variable, this does the exact same thing. It just takes the first one. So underscore one would be the first variable that you enter here, underscore two, the second one, underscore three, et cetera, et cetera, et cetera. So let's say that we've defined a simple list of hosts. So when we get, in this case, probably a request, we could delegate to one of those to see if it whitelisted or if something went wrong. And then we can, if it goes there, yes, we can take a request, let's say a web book for example, and write our case on it and say, okay, when it's whitelisted, then I want to do something. If the host is unknown, I want to do something else. If the action is unknown, it's going to do something else. And what this is going to do behind the curtain is it's going to call whitelisted and give it web book as a first parameter. So it's a more, again, more compact way and allows you to put that code somewhere else instead of having to copy paste it into three ifs. And the last one, we're in Ruby, thankfully. So for every other class, we got duck typing. We can just implement the triple equal method and have it work for more or less anything that we want. So bear with me because that's going to take a little bit of time. So in this case, same, still sticking with my response example that we've been following the entire presentation. So here I can define in my response class or module or whatever different classes that implement the triple equal and that do anything that I want. And then I can, if I do this and I call them, this is going to do what we've seen before in the VM instructions, right? It's going to take the response called triple equal with this and then see if the answer is true or not. So with this, you can basically create as many matches as you want, especially on custom class that can be pretty interesting. If you have one example that came to mind also is payments, for example, if you're managing payments, then you can in your payment class define different subclasses that could be success or canceled or processing that just calls your payment API and checks if it works. And so all that code is it's in own place and then you instantiate your object here and you can use case to easily delegate where you're going. Another example that we've kind of used is a wrapper for services. So basically you define new classes for your service and your service answer a class that's either a success or an error and then you can use this to do some kind of early, early days pattern matching. So speaking of pattern matching, how does it work? So just in, again, just in case, we're going to go quickly through what it is and what it works, how it works, sorry. So the whole idea of pattern matching is that you define as the naming price, you define a pattern, then you try and match it against something and see what sticks. So here my pattern is going to be a hash with a status key, a body key inside of which I'll have a user with a name and an age and whatever is in here, if I can match it, I want to store it in the variable and then once you have your pattern, you can try and match it against any collection of stuff. So in this case, it's going to work because we had the same status and the form that we're trying to match against was the same and what it's going to do is it's going to assign the name variable to whatever was there and the age variable to whatever was there. If you want to match it against something that looks very different, so this hash for example is not going to work because either status and body are here, this value is not going to match against that one, right? So if you try and do this, then it doesn't work, so you're going to get an error. In Ruby at least, this is going to, sorry, this is going to raise an error that just tells you I wasn't able to match it and in Ruby that was implemented using case. So the way it works is if you have a response or literally anything, you want to create your different patterns that you're going to want to match it against and one thing to note is that it's no longer, you know, to make the difference, you no longer use case when, using case in because in is going to be the keyword that's going to be mainly used for pattern matching even out of cases. So in this case, if the response that I get has a status success, I'm going to take whatever is in the body and put it there and otherwise if it's an error, I'm going to fail and put it over there. So it's, again, it kind of does the same. You could do the whole counterpoint to this presentation is I could do it with an if, else if. You always can, but I do think this is a bit more verbose and makes it more clear what you're trying to do because you can see the entire pattern. Whereas if you wanted to do an if, you would have to open response and do if the status is success, then I want to look at the body. For this example looks the same, but if you're dealing with big jasons from APIs where everything is nested like four times and you have response body value and then you take the first element and then the address and then whatever this starts to become more interesting. Another thing that we get with pattern matching that we can do with case when is we get access to guard closes. So whatever that allows us to do is I want response to match with this only if I'm not in maintenance. So this gives us a bit more control over whether or not we want the pattern to match because sometimes you might want to put patterns that are very similar, but you want to condition them to something different. Another example and another thing that we can do with pattern matching, so let's look at a more complex pattern. We have access to a lot of new tools. So for example, here, what this thing here means is that I want to match this pattern where the ID is whatever I put on top. If I didn't put it, then it would act as the one we store before and store it into the variable ID, but by doing this I can tell it no, no, no, use the value that's already there and match one that has 69 as an ID. I don't want anything else. And we also have access to splat operators, kind of. So simple splat for arrays, double splat for hashes, the same as with method arguments. So what this allows me to do is I want to take user and if the user is in an array with some elements at the beginning, some elements at the end, and then somewhere in the middle, an element with ID 69, I want to store the value of admin. So this is kind of equivalent to take my entire array and do a detect where ID 69 and then print admin. So this kind of does the same thing, but in a more flexible way because I can then kind of keep putting more patterns underneath to filter out more stuff or try to find more elements. So how does it work? I kind of, at this point in the talk, I kind of wanted to go through the same journey with pattern matching as I did with a simple case. So try to open it up and look at the VM instructions and see how it works and try and figure out what's underneath. The problem is that pattern matching is kind of new. So in the Ruby VM, that is a lot of instructions to go through. So I ain't going to go through everything. But there are a few things that we can see here. So for example here, we have the same response. So that's the beginning of our case. So this calls the thing that's going to go in the case that we're going to try and pattern match against the same. We're looking at pattern matching. So of course, the thing called check match, we kind of kind of assume that it's going to match or pattern against something. So the way, at least the way I understand it is that all of this is going to build or pattern and then it's going to match it to continue. And if we look at the way it builds the pattern, we can find one method that is interesting, which is this one, which is deconstruct keys. And after looking at it a bit more and going to read out the documentation, this is what Ruby used to do, at least for now, to do pattern matching. So you have two methods. One is called deconstruct keys that is used on patterns that are hashes. And another one is called deconstruct, which is used on pattern that are arrays. That make sense? And so this does all of the deconstruction and then if the pattern that you're sending doesn't respond to the deconstruct keys or the deconstruct method, then it's just going to give up and tell you to implement it yourself so that it works. And after that, it's more of the same thing, right? So that's the second pattern that we have. It's still trying to deconstruct them. And then eventually, if it doesn't find anything, it's going to return a no match error. So the interesting thing then is how do we implement it ourselves? So if you have your class and you want it to be, you want to use pattern matching on it, then one thing that you can do is use, is implement the deconstruct keys method. So in this case, we have a location and we want to have a latitude and a longitude in the deconstruct keys. And then that allows us every time we have a location to use pattern matching on it, because it's going to deconstruct this, deconstruct this, and then see what matches. And so in this case, and interesting thing also is inside of our pattern, we have access to everything that we've been talking about earlier. So in your pattern, you can put classes, you can put reg X, you can put ranges in this case. And the only thing I think we haven't seen before is this little thing magic that just takes like, it wants to match it against this and then store it into the variable that we can then use for anything else. And I think that's it. I've tried to go through everything. I sped through that one, sorry. We have so much time. I didn't. You used a variable that was not declared before. Yeah, probably. Where? In the right address before? One? latitude. Did you declare length to be equal to new before? No, you don't have to declare it before. Basically what this does is it takes this and then store it. It takes whatever matches here. So that would be technically this and then store it into the latitude variables. You don't have to declare it before. And what's the scope of that variable? It's going to be a scope to whatever the case is in. Right? So if your case is defined in a method, then you have access to it in the entire method. This is in current? Yeah. I think this might be, this might have been implemented in Ruby three. And the first occurrence of pattern matching, the one with the case in, was experimental in 2.7 and then actually arrived in Ruby three. And they've been trying to push it a bit more in subsequent versions. So now, for example, you don't necessarily need to have case. If you want it, if you want to use pattern matching, then you can just write your variable in something and use it as a predicate to see if it matches or not. In your example where you're looking for an admin user in an array of users and you have those operations at the back, does that work if your admin user is the first or the last? Yeah, yeah, yeah, yeah. Then the, then the, Like it might not. Yeah, fair. Yeah, yeah, definitely fair. What this will do is it will put nil in here and nil in the other variable. Right? It's like there's nothing after, there's nothing before or there's nothing before. Yeah, that's the thing that I was a bit, if you about this, basically the, shit, I have to go through all the animations. Sorry, bear with me. It's going to scroll again on. Okay, sure, whatever, shit. The argument that it takes is in case you only want to deconstruct some keys. Right? So if that is, if you have a big object and you only want to deconstruct latitude, for example, you could work it this way. That's what it's supposed to do. In the example, I didn't go through the trouble of implementing all of it, because if I want to write code big, I can write too much code. And yeah, that's why. So it was deconstruct for arrays and deconstruct keys, though. Yeah. And you can define deconstruct as well. If you've got a class that implements an interval or something. Probably, yeah, yeah, I think so. Just to take how stable you think the syntax is. Do you think it's going to stay the same? Huh. Would you, would you, would you, would you want to do, for example, like the, the, the, the, the, the, the, the, the, the, the, the, I think it's going to stay the same. No, sorry. Yeah. I know I was thinking in my head, I think it's going to stay the same because it's the exact same syntax that Alexia uses, for example. Like they've probably been inspired from other languages and used it. And so I'm expecting it to stay the same. But then again, I don't know. I think right now I'm still, I'm trying, I tried to push for it in very simple use cases. So usually in a, if we get, if we have to make an API call, that's probably the best, like, foot in the door to get it working in your code basis. Like, because that's the thing that seems the most obvious, right? I get an answer and then I can, not only fetch the status, but assign everything in the answer and then give it to another method. I think that's a bit, not a frontend dev. So don't quote me on this at all. But it looks a bit like the object deconstruction thing from JavaScript. Or you can get an object and then assign all the variables into it. In this use case, I think it's a good first step to implement it in a code base. I wouldn't go all out and start putting deconstruct keys in every class. That would be really, I really hope that, I really hope they put it in Ruby at some point. I don't think that's in the plans right now. I think the idea, the main idea behind was, like, when they put it in Ruby at all, pattern matching in 2.7, it was kind of touch and go. People were discussing a lot about do we want this in our code base because pattern matching in the collective brain is usually more functional than object oriented. But now that it's there and it's past the experimental and it's now stable, I think they're eventually going to do it. It'd be a shame not to, right? Do you think some of this stuff is going to end up in the Ruby style guy? And be something like Ruby cop goes and says, no, no, you don't want to do that. You don't want to do that. You want to use this instead. Probably not in the near future. Because I think people are still very much like trying to figure out what the good style is. Even when I was preparing this, I couldn't find a lot of examples. So I kind of came up with what I think would look the best. But I don't think for now, at least, there are a lot of established guidelines. We good? Cool. Nice.
Besides Web: a Worker story.
Okay, awesome. The mic is on, hopefully. All right, good afternoon everyone. So I'm going to talk to you about a worker story, which is something we did at work recently. And for once, it was like not using Rails. That's awesome. Not using web at all. That's what motivated me to tell you this story. So before we start, I would like to know who here is the Rails developer? Who would like? Yeah, awesome. Who would say that they are Ruby, but not Rails developer? Okay, awesome. That's great. Love it. I didn't expect that. Awesome. All right, first of all, who am I? Because if you don't know who am I, you might not rely on whatever I'm going to say. So I've been Ruby and mostly Rails developer for 10 years. I've been working with Kevin for almost that period. More recently, I have become a lead dev, then a manager, then a CTO. So I'm doing a lot of new responsibilities now, which also gives me a new perspective on a lot of programming topic, actually getting new perspective when you start making a decision about people and processes and stuff like that. And finally, I've been a teacher for more than six years. I've given a lecture at EPL and Le Wagon. Hopefully, we'll do that again. I feel like a deep-footed lover for teaching and sharing knowledge. And this is also why I'm here today. So I was saying the point of this talk is talking about Ruby, but not about Rails, not about web. And this was the first time for me. I was like a new experience. And it's strange to see how much changes when you start doing that, how much you realize Rails was giving to you once you don't have that anymore. I have some notes. By the way, all my slides are going to be minimalistic. I'm not going to show you a single line of code. I'm also going to forget a lot of stuff, which is why everything I intend to tell you is written in notes available directly in this. So hopefully, you will get everything I intend to say because I'm going to forget part of that. So the main message of this talk is like it's doable. It sounds strange that this is my message, but as most Rails developers sometimes, when we think about Ruby program, we're not even sure we can do it. We're not even sure how would we approach that. So the main message is like, yes, it's doable. There's a lot of tools. There's a lot of process. There's a lot of help along the way. And you can possibly, you can very likely, sorry, get most of your tools and knowledge used in a normal, not web Ruby application. The second news is you can also get all of your Rails knowledge useful in Ruby application if you get things right. So the story I'm going to tell is about like a worker. What is a worker in our case, in my case? It's like microservice. The specificity, why do we call it a worker? Because it's not a web microservice. It's a microservice which is consuming messages from a queue and very likely is going to process files, so it's going to get files from a bucket, process them locally, put them on another bucket. We have, we are using the word worker because we have like lots of them. That's simple definition. We have lots of them. So I'm going to talk about like one of them, but it could be any of them. So we think start with a loop. The whole story starts with a loop because when I started this, I really like I opened my editor and I saw something which I hadn't seen since school. It's an empty directory. It's very strange. Like really first as a Rails developer, I'm really used to Rails new and then you get like everything. You get folders, tree, substructure. You get the config directory, you get the app directory. You, like there's drawers everywhere about what you expect to put things. In this case, I just like create a new folder and it was empty. I'm a firm believer in emergent design. So I started immediately like new file, worker.rb, make a loop, while true, read, perform, delete message. I'm done. It was nice. Like it was, I knew it was not the end, but it was capturing whatever I was, I knew about my process. It was a single level of abstraction. So I knew it was a good start, but it wasn't. It wasn't a good start because I was already forgetting like my main tool when doing Rails apps, which was going to be my main tool when doing any app. It's tests. Anybody who knows me know that I'm a firm believer in tests. And it's a policy. It's not a religion, but it's a policy. This is how I write code. I do believe in it, but you mileage may vary. But for me, it was the beginning. And it's funny because I knew I was going to write loop, the loop of my program, but I was also starting another loop, the loop of my process. And this is what tests are for me. Test first does not mean you do test, then you do code, then you're done. Test first means your first step in the journey is test. Then code, then test, then code, then test, then code, then test, then code. That's what it means to me to do test. But I did it wrong. I started with code. So I tried again. I deleted my file. I created a spec directory. I created a spec file explaining what I knew about it. And I was happier because test is the file that depicts my best understanding of what I currently believe is the success. And I need that because I'm going to write code right after the word. And once you're deep in the code, you're super focused. You forget about landscape. You don't know what comes next. You might have a story. You might have specification requirements. You name it. But I do believe that a story or specification is like coordinates of where you're supposed to land. The whole puzzle, the whole activity of development, of programming, is like playing golf in the fog by night. You know where you are at the beginning. You sort of know when you want to land. But after your first shot, you're going to be lost. It doesn't matter even anymore what you're supposed to land because you've given your first shot and you don't even know where you are anymore. I'm using tests as torches in the night. So I read my specs. I write some tests. This is my belief. I'm going to follow that path. And then I shoot my first shot. Hopefully I'm going to reach my first torch in the night. When I have reached that one, I'm going to go to my second torch again and again. But my loop is that my test is only my best understanding of my success. So my test is going to evolve. I'm going to move my torches and I'm going to move my ball. And this is how they make sense together. Back to the story. I wrote my test, was happy with my understanding, run it, and it failed. It was a catastrophe. And why did it fail? Well, because it couldn't find our spec. Because I didn't bundle it. Because it couldn't find bundler. Like, that is how empty the whole story was. Like, I didn't even have bundler. Okay, so bundling is always easy. Bringing my dependency, starting my gem file. I need to run my spec. Run it again. Well, it still fails. But for a better reason. And that's the whole point of TG, right? You have to fail, but for a better reason than the previous failure. So now it's failing because it doesn't know about what is a queue, what is the method we see in the queue, what is a message, what is a processor, what does perform even means. Well, that makes me happy. Because now I can actually write more tests about what do I believe is a queue at this stage. Why do I believe is a processor, what do I believe should do the receive method. And this was really the starting of both my loops. I got my main loop back, but I got my working loop as well. I got a lot of tests. I knew that trying to make them go green would just generate more tests. Trying to make them go green. I got my actual work loop. Right. So, test code, test code, test code. I was in the middle of it. And every single of the code file was starting with like probably five to ten require or require relative. And I wasn't happy with that. First of all, because it is boilerplate, it's noise. I don't like noise. Also, because I want my code files to be about the responsibility they're supposed to hold. And knowing what files contains the dependency that this file depends upon, it's not the responsibility of each file to know where do I store the other responsibilities. That was wrong. And this is not something we have with Rails. I realized that we actually get something super nice from Rails is put any file in any sub directory of app folder, and you get it. It's like magic. Once you have to start all your require by hand, it felt wrong. So, I Googled. I got a few options. And the best one, which is actually the one which is currently adopted by Rails, was using SideFerq. Hopefully, I'm pronouncing it right. It's written in my speaker mode. And that stuff helped me, like, auto load the constants I was looking for by looking them up in my lib directory. Default config. I'm happy as far as I know this is what I need. But reading the rest of SideFerq, I also realized that this enables you to use short names. So, if you are in the same namespace, you can just mention a constant by a short name. Well, obviously, I want that. I'm doing that in Rails, so I want that again. It's also handling multithread code loading. I have no idea if I'm going to need that, but I certainly don't want to handle that myself. It sounds like something I really don't want to handle myself. And it also handle code reloading, which is not something I'm going to use because of TDD. But again, this is my approach. I know that most people don't do that. And code reloading is a very important part of code loading. So, SideFerq was like my first take, my first really great companion that I found along the way. The second one was dry container. Now, small disclaimer, I knew from the stuff that I was going to use dry gems because I wanted to. And as Kevin said, it's also a little bit about finding joy. So, I wanted to heavily rely on dry gems, but I wanted to wait until the use case was there. I wanted, because I did not only want to skip the requires, I wanted to not know the classes. I wanted to not call new in the middle of my code. My code is about business logic. Most of the code is about business logic. I wanted to separate, sorry, the logic about creating objects and the logic about like, I need something. And most of the time, when you're in a controller, in a Rails controller, you don't even care like where does the request object comes from. You're just like, okay, I want a request object. Just make it happen. If you're in a view, you don't care about what the view context comes from. You just have it. You just want it. And it's really comfortable to write code with just focusing on like using the stuff you need, not focusing on how you get them. So, this is what dry container brings. I've been using dry system, which is like dry container for handling all of that, and dry injector. And dry injector basically works hand in hand with dry container and allows you to call your services, call your dependencies by the small name, by the first name. You give a name to an object and then you can basically say, okay, I want this object. I don't want this class. I don't want to instantiate that class. I want specifically that object. And I'm going to use it. And I don't even care what its class is for. I want that object by name. Interestingly, this had almost no effect on the test. Even though it's a very different approach, I still had most of my tests instantiate object by themselves. Why? Because unit tests actually give a lot of fake dependencies. That's the point of unit, right? You want to test a single unit. So I was still building my subject into tests manually. And for the larger, the broader tests, I actually wanted to use the container set up correctly because I wanted to test that things were correctly wired together. So even though dry containers is like, oh, some you can stub and fake and change whatever you want. I didn't stub it because I was either using it and testing it or not using it at all in my test. And... Sorry. Yep. Yeah, I'm still in time. Dry container also brings something else, which is quite interesting. It's a settings, a settings object. And I realized very soon that the settings object was the object that I was injecting everywhere. Almost every part of my system needed to access settings. So I was injecting it everywhere. It was awesome. And dry settings provide some really interesting value. First of all, it allows any of the settings to be overridden by environment variable, which is quite important. If you know about 12 factors, it is one of the aspects you want for your config to be overridden by the environment that your program runs in. So that was the first part. And the second part is that you can coerce, you can define the type of your settings. Because if you work with environment variables, everything is a string. But when you work in your system, not everything is a string. We do have a lot of strings, but we have dates, we have integers. We have a lot of system. And usually what we do is we just parse them. Dry types allows you to create all types, name them for starters. Also naming things is probably the most important stuff we do in our work, I believe. You can name your type and get them correctly and get your settings in the proper types, which brings me to my next slide about dry types. So dry type creates a contract. It says, okay, this value, this settings, it has to be a phone number. And I'm going to explain exactly what is a phone number. And I'm also going to coerce like a string into a phone number, which means at the end of the day, I either have an error or I do have a phone number, which is exactly the object I want. And it makes a big difference. I don't know if any of you have ever created a class like phone number, like age, like bucket name. If you read correctly the literature about object-oriented design, we are supposed to do that. We are sort of supposed to do that, like subclass string when we want to make a first name. To be honest with you, I've never done that in my life. I've always used string and it's not a first name. It's a string. I know it's a first name. I know I'm not going to use all the methods of string, but the variable is name, first name, that's enough. Using types allows us to actually have proper types, more meaningful types, without creating full-blown classes for everything. Well, settings is one thing. But this contract, it can really be used for something else. It can be used for app input. When you are working in a web application, app input is a request. This is where most of our payload comes from. In our case, the app input was messaged from a queue, but the concept was very similar. As soon as we got one message, we treated it in a very similar fashion as we would have treated a request. When working with app input as a web, there's a very known pattern for handling that input, for validating that input, for correcting that input to everything that you wanted. These are form objects. We basically reused the same. I realized that I'm doing my slide in the wrong order, but you don't care because you don't have the order. But that's okay. We used kind of form object in the form of a dry contract. It comes from dry validation, that is the gem we have been using. Dry validation is really about two pillars. The first one is about typing. Eventually, it leverages dry types. It ensures that you get the keys of your payload that you expect, that you get the values that you expect, that basically your data is of the type you expect, that's the schema, that's the structure. Once you have the proper types, you still have business logic to handle. This is the second pillar of dry validation. A typical example would be if you have to handle a deadline. Imagine that somewhere in your payload there's a deadline. The first pillar would ensure that the deadline is actually a date because you get a string. Hopefully it's an ISO 8601 string, but it could be anything else. You want to coerce that in a string, you want to ensure that you have a string. If it's not coerceable into a string, you want the first error. But now that you have a string, you also need to validate that this actual date is in the future. This is what the second pillar is. You can create rules, business rules. That means that once your payload goes through the dry validation mechanism, you actually get a very valid, very reliable payload from a typing perspective, but also from a business perspective. Once we have that payload, what do we want to do with that? We actually want to process it. For that, we are using a pattern which is named Interactor. At least we used to use a gem which is named Interactor. You can think of an Interactor a little bit like an operation in Trailblazer. I don't know if anybody has used Trailblazer previously. No? Okay. All right. I'm going to go back. The idea of an Interactor is that this is the entry point to your business layer. Because the entry point to your application to most of the web application are the controllers. This is how... I'm not talking about the rules. Let's consider that the entry point is the controller. But that's not true because sometimes your entry point is your test. Sometimes your entry point is a rate task. Sometimes your entry point is an active job. Sometimes your entry point is a channel. So you actually get a lot of entry points into your app. But at the business level, you don't really care if you want to delete a user because of a GraphQL request, of a REST request, of an active job. You want to delete a user. It's the same business unit. And this is how we encapsulate things we are using in Interactor. One Interactor is responsible for one business unit. And well, very fortunately, DRI has a solution for us. It's a name DRI Transaction. So their name for it is a transaction. It allows you to create a series of steps. It relies on DRI Modad because each step can give you a result. And if the result is a success, then the next step is going to happen. If the result is a failure, then the next step is not going to be done. You're going to keep your failure. This is known as the railway oriented programming. Nothing related to rails. It's just because you either stay on your success track, like train track, or at each step you have a junction to your failure track. Well, the thing is we didn't use DRI Transaction. So I wanted to let you know because I would really recommend that you use it. I wanted to use it, but also we have a team of several developers who are used to our Interactors. And it sounded like a better idea to use what everybody knew than trying to reinvent the wheel. We had something, it's working well, everybody knows it well. So this is like my manager voice talking. If it's end broken, broken, don't fix it. But if you're doing it from the start, give a chance to DRI Transaction and DRI Modad. At this point in the talk, I hoped to try my own definition of DRI Modad, of Monad, what is a Monad, which is probably going to take the next two hours. So let's keep it. So the end of this slide is about why do we want to do all that validation early? And this was also something a bit new. First of all, like failing early is a good idea. But it was not enough because doing the business validation at each step would have made more sense. It's just easier to keep the business steps together. It makes more sense if you want to check some permission, then delete a record, then send an email. It makes sense that you do everything related to sending the email at the sending email step. It doesn't really make sense to already check stuff from the start. But the thing is, in Rails, we are very much used to a highly rollback-able environment because most of what we do, well, sending email doesn't count, but most of what we do is manipulate the database. And this is a huge comfort being able to say, my record.transaction do blah, blah, blah, blah, blah, blah, blah, blah, blah, blah. If anything goes wrong, just roll back and done, nothing has happened. When you're doing a microservice, at least what we are doing, nothing is rollback-able. Everything you do, if you send an API request to something, if you delete a file, download a file, create a file, there's no rollback to that. And this is why it was so important to check as much as we could right from the start. All right, next step. Next step, next challenge. The next challenge was an interesting one, as every challenge, because it was about design and design opinion. And there's no truth, there's no strong truth in design opinion. So what was the challenge exactly? The challenge was that we realized we were not using dry containers properly. It felt like we were supposed to use it in a new way. Why was that? The reason was that we are very used to object-oriented design, object-oriented programming, which means we are putting together state and behavior in small objects, and they are responsible for doing that stuff. And the dry system, the dry container, was pushing us to use stateless objects, because that's what you could enjoy if you want to inject something everywhere. It better be stateless. But the code we wanted to write, because we have a lot of experience with that, was stateful. We don't want a command wrapper. We want a command execution specifically about this option. We want to ask a specific invocation. We don't want the full program. So it was very important to be able to write the code that we wanted to write, but it was also important to use the tools properly. And initially what we did is we had that big interactor, or big entry point, get injected with a ton of stuff from the container. It was getting all the services that it would eventually use, and that interactor was instantiating all the small objects, the small life cycle objects that it was going to use, and it was instantiating those objects, giving them their state, so maybe the current date, the current user, the current payload, and all the dependencies that the objects needed. So maybe there's a command service, maybe there's an API client, so the interactor was instantiating all of that, which means the interactor knew about almost everything. There's a name for that. GodObject. And it's a bad name. So we knew we were doing something wrong. We had a small discussion, and we realized that actually the literature again had a solution made for that. There's a pattern made for that. The pattern is factory. So what we eventually did is that we created new services, factories, very shallow services. Each factory was injected with the services that it needed, and the interactor was simply injected with the factories, and the interactor was just asking the factory, well, give me a command invocation specifically about this file, about this API, about this payload. And it's not a fun because it was so difficult to realize at first that we needed that, but at the same time it was so obvious what was the solution. It also raised an interesting comparison with a former colleague of mine who told me he was like a functional programmer. He, I'm not going to say despised, but he despised object-oriented programming. Well, I said it. And he told me, you know, an object is just a set of partially applied function. He was very like this day in for like, oh, it's just a set of partially applied function. We have like, we have object at home. Well, it's not the same. But to be honest, like, introducing those factories gave me that feeling because we had like those functions. We were partially applying all the dependencies. That's like first partial application, and then we were partially applying the state. It also opened our mind about what is stateless, what is stateful. Usually state is like all your instance variables. It's not really true. You don't see things like this anymore. Like your dependencies might still make you stateless. And your state is really what makes an object throwable. So if it's a reusable object, it's stateless. If it's like a one-use object, it's stateful. That's sort of our new definition of that. And factories helps us creating one-use object because factories are all stateless object. Well, I felt bad creating the slide without mentioning a single dry gem. So I also want to bring one here, is the dry initializer gem. And to be honest, this is my favorite, and it's so small. The thing is this is so small that it's crazy that this is my favorite gem. It creates contractors. It just creates an initialized method. But why does it matter? Because if you are very strict about it, all your initializer very probably look the same. It's like you pass them arguments, and then you store them into instance variable. Nothing more because doing business in initializer is a bad idea. So you always get the same initializer times and again, and it makes no sense, and it creates noise. And if you're used to more style guides, it has to be at the top of the file, and it also takes a very important part, focus, because top of the file is very important. So dry initializer, just do that. It says you can create one line for each dependency or state that you want. You can give it a type. You don't have to, but if you have a drive type, you might want to. You can give it a default value, and you automatically get an initializer that accepts them, and you automatically create an ATTR reader for each of the dependencies. You don't want the reader. You don't have to, but by default, you get that. And that's it, and you just transform something very long and noisy into a series of lines. We used to have ATTR reader anyways. That's most of our classes have ATTR, one line for ATTR reader anyways. So it changed nothing in terms of noise. It changed everything in terms of clarity, intention, and anyone reading a file now gets something directly by reading those lines. And yeah, I'm still on time. Well, we were done with the code application. Of course, we had additional challenges, but eventually using those tools and approaches, we reached up the end of the application, and we were done, right? Well, no, we still had to package it. We still had to deploy it, because as long as we were actually solving problems, we had nothing. This is again a time when we realized how rich is the Rails ecosystem, because for deployment, you either get services like Heroku or similar services, or using Capistrano, which does everything for you. You write one CAP file and everything is magic. When we had to deploy, we were like, yeah, we have files with code, but we still have no application. So we get some help from partners about that. We use Docker Compose locally for creating containers. We use Kubernetes remotely for deploying them. We use Helm for actually doing the deployment. And this led us to realize that we still had problems, because we had no observability. We had very difficult access to the log files, so there was still a lot of stuff we didn't have. So what we did is we introduced Yabbaida from Evil Martian. I don't know if anybody is from Evil Martian here, but if you are and if you watch us, like, thank you, you're awesome Evil Martian. So we used Yabbaida, which is an observability framework. It allows you to mention what you want to observe, create metrics, without having to care like where you intend to put those metrics, what we intend to do with those metrics. And then another part of Yabbaida, you can mention, like, actually what you want to do. You can separate the two. So your business logic is not riddled with, like, technical details about monitoring. So this observability allows us to expose some metrics, which in turn enabled us to create autoscaling to measure health. So these are typically stuff that you get for free in Rails if you're using your relic or data. But we had to do it by hand. And we finally reached our latest challenge, because we are not experts in Helm or Kubernetes. We are actually very noob in that. So we had partners helping us. But those partners are also responsible for, like, running and ensuring that our app is working properly. So the way the agreement we had with them is they handled their own repo with everything they do about us. And we have our own repo with our code base. And the problem we realized, and we still haven't solved, is that part of the application is actually in the infrastructure. And this is something we are not used to do in Rails. But typically the queue we use have a dead-letter queue. If you try to read something and it fails, so you release, you retry to receive, it fails. After sometimes that message, you put it into a dead-letter because you don't want to lose waste more time trying to handle that. Another aspect is buckets have life cycle. If a file is forgotten there after 24 hours, you want to delete that file. You don't want to pay fees for that file for the rest of your life. And this is application logic. Even though it fits in infrastructure, like it is application logic. And this bothers me because application logic, anyone who clones a repo should be able to see everything, to know everything. They don't have to be master at everything. They don't have to change everything. But cloning a single repo should explain everything there is to know about this app. So at the moment we still have those two repo. One is like focusing on the infrastructure. One is focusing on the code base. Hopefully we will solve that very soon. But with that done, we actually had the app deployed, monitors scaled, we learned quite a lot. We actually made a blueprint out of that, so we are creating several other workers right out of that. And we feel much more confident actually using Ruby for something else than web application. So thank you, everyone, for your time. Thank you. Any questions? We have two minutes of questions, hopefully. You've talked a lot about... I mean, first you never talked about Rails, but you actually miss it a lot. It's pretty funny that it was not about Rails, but actually... Anyway, you talked a lot about types. Is that something more to bring to the rest of the ecosystem? Yeah, that's a very good question. So the question is, I talked a lot about types. Do I want to bring that into Rails? Actually, the interactor is something we do in Rails already, which means we are using dry validation already, which means we are using dry types already. To be fully honest, we don't use it enough. We sort of use it when we realize that we should have used it before. So it's like not good enough, but it is something we are using, and types have been very helpful in the past already. And there's a lot of other tools that we have discovered here, because we had to, and I very much hope that we are going to use them. But also, my first slide means that I'm no CTO, I'm no manager, which means I don't get to make those calls anymore. And it's very important to me that the one who writes the app are responsible for writing it, maintaining it, running it, so I can influence, I can give my opinion, but I don't make those calls anymore. Yes? You said that you use dry monot. What has been, can you tell me more about your experience, because I used it quite extensively in the past before they introduced these two notations. And it was very sticky to the code as in, it made Ruby not look like Ruby, like something else. So, if there is something changed there, how's your experience? All right, so the question is, do I use dry monot? What do I think of the do notation, and how Ruby-esque does it feel? Is that right? Yes. Okay. So I am not using dry monot, except for like toy projects. So we are not using dry monot in this, so our own take is using our own interactors. So whatever I'm going to say is out of my experience on toy projects. I've learned initially about monads in Haskell. This is still very painful to me 10 years later. So my take on monads is like, most of the time, it's like not the right tool. And it's something that people, the learning curve for understanding what is a monad is so high that once you've earned the right to understand what it is, you want to put it everywhere. A little bit like meta-programming. So this is my take on monads. I wouldn't force them into anyone who is not very comfortable using them. I do believe that it is a very elegant solution, but I also do believe that sometimes a bunch of if-else makes the team happier than using the best tool for the occasion. And I don't have any opinion about the do notation and how Ruby-esque it feels. All right, thank you.
Writing your own Rust linter
We're going to have your attention. We'd like to begin with the next talk. We have Guillaume. He's going to explain to us how to write your own Rust Linter, as you can see on the lovely slides. And if you talk, Luca, have we got the audio as unmuted and everything? Perfect. Wonderful. Okay. Take it away. Hi, everyone. I will try to speak loud so everyone can hear. So like he mentioned today, I will explain to you how to write your own Rust Linter. So first little presentation. I'm Guillaume Goumez. If you come every year, I give a talk, so now you should more or less remember me, I think. I'm a member of a few teams of Rust projects and I'm an engineer at Huawei. So first, let's explain what a Linter is in case some people don't know yet what it is. A Linter is a tool that generally is an addition to a compiler of a language. And here in Rust, I suppose everyone heard about Clippy. At least I hope so. The goal is to detect some very basic logic errors to suggest improvements for any method you might use, anything you could use better. The goal is to make your code better in short. So now how is a Rust Linter actually working? We are directly entering into the subject. So let's say it's an extension of the Rust compiler. The Rust compiler has an API, very unstable. So very frequently we have to update the Linter to be able to still work with the Rust compiler. And that's exactly how Clippy works. So when Clippy is running, it's actually running a lot of parts of the compiler to get parts like AST for people who don't know what AST is, it's a token representing your code. So if you have the struct keyword, it's a keyword and it's a struct. So that allows you to have information higher than. But it's not only that because if you only had the AST information, you could only make suggestions like, yeah, you use this generics but not in a good way so you can do it like that, et cetera. So the goal is to go beyond that and to get access to more information like a borrower checker and everything. So if you have a trait you're using but you could use another trait which does the same thing but shorter, we can now suggest it because we have this information from the compiler. But because of that, we have to update the Linter often or never update which version of the compiler we are using. So why does it need to be a rest compiler extension? It's quite simple to explain. So unless you want to reimplement all the parsing, the borrower checking and pretty much everything, well, better use what is already existing and ask them likely to make their API public so you can use them. And that's exactly how things went with Clippy and that's exactly how I went as well. So I mentioned a few limitations already. So it can only work on crates compiled with the same Rusty version. You don't see it with Clippy because it's tied with your compiler. When you un-style a Clippy, it's tied with your current compiler version. So it just works but it's something to keep in mind you will see later. Like I mentioned, the Rusty API is not stable so very often you have to update your Linter code to be able to keep up. It's tied to a specific Rusty version and I'm not talking about a stable release but literally to commit version which is a bit annoying. And also because of everything, it's annoying to wrap in a cargo command because you need to use a very specific Rusty version. Again, we'll come back to that later. So I will voluntarily don't mention all lint passes. I will only speak of the two main ones, the early and the late passes. The early passes give you access to AST. So you are able to see the syntax and the work a bit on it but you don't have type information or everything. You can only just know that this is a struct and its name is and it has generics but you don't know what traits it implements or anything. You just have very basic information and you have the late pass which in this case goes a lot further. You have access to Borrowchecker information. You have access to everything. What does this type is implementing? Does it implement this trait? What is its layout? Everything. So in this case, we will talk about how to write a linter but with Rusty tools. The goal of this trait is to wrap the Rusty API into something easier to set up because there is a lot of things to set up. And to add it, it's just that. Like you would add any other trait. For now, the version 0.3 later on it will be updated. And now we start to enter into the fun. So actually to make it work, you need to add this little line in your Kaggle file to tell okay it's a trait but not any trait. It's a Rusty trait. So you need to do some very funny things. And we'll come back to this one but it's some things that you thought were something that we had for years like having to write an extant trait to import a trait. As back you actually need to import a trait from the compiler with extant trait. Otherwise it doesn't work. It's not provided by default. The other thing is we need to create a Rust toolchain file. It's literally its name. If you never use it, if you have a Rust toolchain file in your folder, cargo will only use the version provided inside this file. So in this case, the version of the compiler we're using. This is all in the documentation of Rusty tools. Basically you just need to copy and paste the file into your local file. So in here we say that as components we want the Rusty div which means the traits from the compiler. We want Rust FMT because we are not savages. We want to actually format our code. And the LLVM tool, the preview is to be able to actually compile. Otherwise you don't have a backend which is also problematic. And now let's get into the code. To declare a lint it's mostly macro. As you can see on top we use internal Rusty traits. So lint and session. Lint provides some types linked to handling lints. And session allows us to give information to the Rust compiler about things we want it to run. So in here we create with the declare tool lint macro a lint called warn generics. In MAG. In capital letters. I can't do that. It's warn by default. And we add a message when it will, in case you want information about it, it says warns if any item has generics. It's an early lint pass. So it means we only have access to the AST information. I voluntarily picked this one because to be honest the code is much, much shorter and simpler and for 15 minute talks it will be better. The other thing we need to do is to implement some very basic traits provided by the compiler which we don't need to care about. So they provide a macro for that. So declare lint pass which is in our case allowing you to declare a structure called warn generics. And we link it to the warn generics lint. And after that at the end we have the very empty implementation of the early lint pass trait for our type. This visitor trait, if some don't know the visitor, how to say, pattern, visitor pattern, let's say. The visitor pattern allows you to have access to, literally, you implement whatever you need to have and then, for example, visit function. Whenever the visitor will encounter a function, it will call this method and it will be ours. If we don't care about the rest, they are already implemented. We don't need to care about them. So very convenient. In our case, we only want items that could have generics, so very likely functions and n-us and everything like that. So it will be pretty easy normally. So now we implement the lint. So as I was saying, check item. We don't have anything else to do. It provides a context, the context of the compiler at this stage, an early context, and we have the actual item. And then it's pretty simple. We have methods provided by the compiler and everything. So we check if our, I hope everyone knows the syntax of first, but we check that we have generics. We check that with some generics. We check that the generics are not empty because otherwise there is no point. If we have generics and everything, then we will say, okay, we found generics. We don't want generics because, because, and let's then emit our lint. So first, the lint name. Second, the span. The span is how the rest compiler map your size to your actual source code. It's basically to your size beginning and an end. And you don't have to care about what it's pointing to. You just say, okay, the type I want to lint about is starting here, ending here. You underline you do whatever you do and I don't care. And we have our message saying, no generics here because we don't want generics. And the last thing is in case you wanted to add more information, like for example, we could say, the help and we could add a help message and we can do that a lot more. In case some of you don't know what it is, the syntax with the straight bar is a closure. So a closure taking a diagnostic type argument. Now, the interesting part is now how can we run this lint? And as you can see, not much code because RustyTools is doing pretty much everything. So first, we get the cargo args because it's a cargo command. We will run the cargo tools. We don't want the two first arguments because cargo and tools are not something that we are interested into. We pass the rest of the arguments, if any, into the RustyTools cargo integration command, which we internally call cargo, build everything with its own version because it's not necessarily the case. And once everything is built, it will generate the command line that you actually need to pass the Rusty compiler to be able to run our linter, which we do with WistLint. So this time, args is what cargo provided us so we can now generate and run our lint. So we just give it access because it's already done by RustyTools. And inside this WistLint, we need to actually say to the compiler, OK, I created a lint. It's called not an void call. I did badly. It's a one generics. And that's it. We have everything. We can now live. And the compiler will do everything when living as a WistLint function. So now it's always nicer to be able to run a cargo tool. So you just run a cargo install dash dash pass if it's local, otherwise not. And I named it in this case tools inner. You will understand why later. So we just run it. And it doesn't work because we are not using the same version of the compiler. Congrats. So in this case, what's important to note is that you actually very much need to use the same version of metadata as the files generated by the compiler to be able to use them with the lint. Rusty doesn't understand itself if it's not exactly the same. Like if it's just a commit difference, no, I don't know him. Don't care. No problem. So now we can actually go around this limitation by providing the version like this. So if we do, I thought I had an error output. So if we do, we actually have the tools running. But to be fair, we can't really ask our user to do that themselves. It's pretty bad user experience. So we will go around that and do this very long file as you can see, which for this case will be called the cargo tools. And this one will literally run this command that we saw here itself. And that's it. It does just that. We just wrap our linter and it's just running. So now we install it. We run it. And again, I don't have output. It's very shaming. And believe me, it works. So yeah. I voluntarily, like I said, didn't show a late linter pass to have access to the compiler and everything. But I wrote a blog post explaining that much more in depth. Inside it, you have an example with an unwrap if I remember, saying, yeah, don't use unwraps use something else. And you see how we actually get the real type information because when you call unwrap, you need to check that unwrap is actually called on the result on an option. But for that, you need to actually get the type check information because if it's, for example, self with a capital letter double colon unwrap and then you pass your type, you actually need to infer the type. And for that, you need type check information. You will see a lot of things that are seen very easy but are quite not so easy. For example, if you want to have, I don't know, which type this implementation is being implemented on, funnily enough, it's quite difficult. You can have the trade very easily but the type it's being implemented on, not so much. And thank you for your attention. More information on my blog and you have my email and social media and everything. And thank you for your attention. So we have about two minutes for questions if anyone has them. Yes, come right to the back. Hello, thanks for this presentation. No, don't share at all. Okay. Hello, again, thanks for this presentation. A few years ago, I wrote a refinement type system for Rust as a linter. I had the courage to maintain it for about one or two versions of Rust. A few months ago, I tried to pick it up again and everything was broken, bit rotten, two tiers, everything had changed. Do you know if there are any plans to make things a bit less messy? Because right now it's really, really, really painful to maintain a linter. No, it's just pain, enjoy. It's a shame. No, in fact, it's actually better now because we have less function to worry about. For example, a lot of APIs that were existing before, only for Rust. Because Rust.doc is a compiler extension. Being less and less used because we said, okay, we now stop accepting completely broken code. And soon enough, we'll be very likely using the same API as Lins. So normally it should be still breaking as much, but not as much. I don't know. How is this related to Clippy? I don't hear you at all. Ah. Basically, it's working the same way, but it exists because in Clippy, not all Lins can be implemented if you have specific needs for your project because you need to have higher security levels or you don't want certain code pieces or everything. You can't expect them to be implemented in Clippy. So you implemented them yourself and that's very much why RustyTools exists. So you can actually do it without having to set up everything yourself. Perfect. Thank you so much.
Hardware pointer checks in a Rust application near you?
Alright, we have a real Fostin hero standing in for Lewis, Pierre Emmanuel again. We have two more heroes at the back who have also obviously fixed the audio. Thank you very much as well. Take it away please. Hello again. I'm still not Lewis, and I'm still not the original speaker. The talk will be even worse than the first one. Let's talk about the hardware pointer on the Cherry architecture. Before we get started, we'll cover what we'll be talking about. We'll be talking about mirror safety, capabilities, the Cherry design, digital security by design, as well as CyberLife Connect project. We'll then talk about the motivation, the Cherry and the rest, as well as the implementation and the different challenges and problems we found during this walk. So, mirror safety. Accessing memory pointer, what could go wrong? You're probably your read and answer if you're doing some rest, but is rest even safe about this? So, the problem with rest is once you tag a code with unsafe or something or you're in unsafe context, the hardware will not back you up. You will simply let you access to the hardware if you're lucky, you have a kernel which will give you a page fault, but that's all. So, the hardware will not protect you against user-free, out-of-bound data-scored, everything. But you may already know rest and it helps us. I mean, safe rest is cool. There might still be something that could go wrong. So, what are capabilities? Capabilities are some kind of metadata that we embed at the assembly level with pointers. This means every pointer will have a big field of metadata, whether it could be written, read or even just used and how could it be used? And the second part of the pointer will be the address itself. So, we can encode in this metadata bound permission in validation states, all those kind of things. And the helpers catch code that behaves badly even when a compiler thinks it is valid. So, let's talk about Cherry. So, Cherry is a project from Cambridge University. Cherry isn't an architecture itself. Cherry is more of a specification. It's a set of specifications for an hardware extension. It allows the creation of a capability-based system and the specification covers all capabilities required in order to make code cycle. So, I was talking about this metadata. So, here you can see in this slide the encoding of metadata on the Cherry specification. We've got the permission, the type as well as the bound of the address in order to check any out of bound or array indexing or things like this. And you've got the 64-bit address behind it. Okay. One note, pure cap on hybrid mode. Cherry provides two modes. Pure cap basically is every pointer as metadata. Every pointer is 128 bits. And the hybrid mode is here in order to ensure compatibility with order, not just not order, but capability-less systems. Okay. Okay. So, here you've got an example for an instruction with capabilities. So, it takes an address and it raises an exception if permissions are not correct or something is wrong. For example, let's say on the previous slide. Okay. So, we've got bound set here. This means we can use a pointer for an array on set bound and if we are trying to access this array out of bound, the machine will trap and give us an exception. Digital security by design. What is it? So, in the Kingdom Government Initiative, that want to expand the use of Cherry out of academia to the industry. Zephend, multiple work to demonstrate the application of Cherry and make it work in the real world in the industry. Initially, it revolved only around Morello. You may not know Morello. So, Morello is an extension for our system, ARM. Recently, they focused more on architecture such as Rix-Fi, for example. CyberHiveConnect. CyberHiveConnect is a security-critical application within the rest. It's one to implement end-to-end encryption of a mesh network. So, yeah, here you've got an example. This application is a security-critical application. And it is with a mesh network and end-to-end encryption. So, this means obviously there should not be vulnerabilities. Okay. So, why Cherry and Rust? So, Rust already provided the different restrictions. Some restrictions cannot be provided by Rust. For example, there are runtime enforcements that are provided by Rust, but that slows down the flow. You may have seen out-of-bound checks on your arrays when you index an array. You may have seen that kind of thing. And this kind of code is slow, but if you replace this kind of code with Cherry-based extension in switching, it can now be faster on an extension to access an array out-of-bound. We'll simply trap. You don't have to end-to-end it yourself. You just have to end-to-end the trap. So, when you need to connect an application with Rust code, for example, with the FFI, for instance, a function interface, you may be safer because the Cherry extension will be here to back you up and provide you the correct pointers. And you who are sure that the pointer you'll be using in Rust won't come from nowhere or aren't a pointer or whatever. So, yeah, unsafe can become in some way safer. Here's an example. We've got an array. We converted it to pointer. We make a string, and we try to read the same line and pass a number. And then, at the end, we try to add the index to the pointer. And as you may have seen, we are using the safe code. And the Rust compiler won't catch any of this because we told him to do so. So, here, Cherry might help us. And Cherry will provide an exception on this when we want to go out of bound in the array. Lewis provided two new targets for the Rust compiler, more or less known pure cap and more or less known through the PUDESD pure cap. As you may have seen, both are pure cap. This means those are not compatible with AI breadmode. This means those implementation are not compatible with standard pointer. As we may say it, all pointers should have capabilities enabled. So, here, we have a new type of pointer. It's coming in Rust 5. And all those files are available in the repository right here. There was different implementation challenges. We should provide a new pointer type with capabilities. There is something that's made the created debate a few months, slash years ago, is the size type. You size in Rust, what should it represent? Should it cover the entire addressable space? Should it be able to contain a whole pointer? That kind of thing. So, we chose to represent only the address part of the pointer within the size. Layout and address space differs for pointer on capabilities. More on that later. And we generate, we have to generate some cherry specific interesting for LLven and AI. Again, as I said, your size is not UN pointer. Okay. So, I should have been a demo but I haven't one. So, well, enjoy the screenshots. Okay, so here we get the segmentation fault when we make an out of bound access in our array, even if we don't hit a numlap page, for example. So, that's cool. I'll give the slide. Yeah. Sorry. Okay, future walk. Future walk. So, what will the WIS concentrate on in your studio? It will add more cherry targets. Hopefully, yeah, more possibly some hybrid model for the targets. We want the rest test suite to pass. For now, we have only 50% of the tests in the USC that pass, and refactor the code, document the code, and rebase on a newer version of Rust because right now it's on Rust 1.67. So, yeah. And Lewis would like to begin upstreaming his walk. Well, thank you. And sorry again for this whole talk. If you've got questions, I may be able to answer those questions, but to be fair, probably not. Thank you. What other targets are you looking for, then? Are there other targets besides Morello, which actually implement Cherry today? I'm sorry, you didn't hear. Are there actually targets which implement Cherry today besides the arm Morello thing you showed? I don't know. I mean, RISC talked about some RISC-5 extension, but a journey behind them might be able to answer. So thank you. I'm one of Pierre Emmanuel's colleagues. There are a number of RISC-5 implementations out there. Code of SIP demonstrated by the RISC-5 summit and Microsoft, and I believe low risk also have ones as well. So RISC-5 is actually running ahead of ARM, if anything. But is there RISC-5 implantations so far virtualized, or are there any boards which support Cherry? I'm sorry. Regarding RISC-5 implementation, are there so far any boards which support Cherry, like RISC-5 Cherry, or is it mostly virtualized QM environment? I suspect these have only ever been made by the development teams as demonstrators on FPGAs, but Code of SIP certainly intend to be able to ship stuff to their customers, and I think before too long there'll be hardware available. You have a slide about GDP. Do you have GDP support for someone who prints one of these pointers, like the semantics of the, you know, the extra-secretful bits and bits? If we take a look at, in fact, if we take a look at Lewis Walk, the capabilities are stored in address space 200, if it makes sense. So there is some kind of support, but I believe it's more axe than real thing. I'm not sure, as I said, really, I don't know much. So just to follow up, so I believe there is a reasonable support for GDP and Cherry on Cherry BSD, and it displays all the things you need within the GDP. Any more questions? If not, then let's thank our speaker again.
Friend or Foe Inside? Exploring In-Process Isolation to Maintain Memory Safety for Unsafe Rust
All right, let's settle down. We have Merve Goulmez. She's going to talk about friend or foe, exploring in process isolation to maintain memory safety for unsafe rust. Thank you very much. Take it away. Hello, everyone. I am happy to be here. Let's get started. I hope that one is working now. As you see, previous presentation I talked about is uptake of rust and EtoB project, for example, rust for Linux or Mozilla or currently is happening is rust in Windows OS. For example, Mozilla is now 11% is rust and the other different languages, for example, C and C++. Of course, that one is all virtual developments, requires mixed language application. Also today we saw previous talk. They talk about unsafe rust. Rust has actually highest two languages. One of them is safe and the other one is unsafe rust. And unsafe rust doesn't enforce memory safety guarantees and why we need it. Sometimes we want to do some low level control or implementation details or sometimes we need it for optimization. In Cherry Talk, they did really demo here and unsafe rust can violate completely rust application memory safety. They can do different route pointers or they can allow us to call unsafe functions via foreign function interface. Academic work shows that more than 72% of rust created is dependent on at least one unsafe FFI. Now we have two things, safe rust and unsafe rust. Unsafe rust says trust me, I know what I am doing. Should we do trust or should we do something and put on shield? And the gap is here actually. I always mention this mixed language application undermines memory safety guarantee of safe rust. And as a result, isolation is really needed. And I am a PhD student. I am a researcher. A lot of academic work to address this issue, we have a lot of academic work for example airing, trust, sun crust or so on. But what is the difference between these different academic publications? They either say that okay, we should use process-based isolation or we should use in process isolation. When you have process-based isolation, firstly you have integrity. It means that each processor, I mean it is on virtual memory space and also the other nicest things you have resilience. It means that each processor, it is on failure boundary. And if one process is crash, the other one is not affected. And a good example for that one, multiprocess software architecture. And the other side, we have in process isolation. It means that you have one other space and inside this one other space, how we can isolate one part of it. For example, you want to protect just a key or you want to protect one part of applications. Of course, if you have in process isolation, it can significantly reduce the runtime cost because context-severing compared to the traditional process isolation is lower. And I put here early approaches. This small box and inside the small box means that in process isolation and the other we have just sandbox provide the process-based isolation. And I would like to highlight something. Just one of them is offering crash resistance, but this is process-based isolation. We have STRAT here, but STRAT doesn't support for us, just it supports C applications. And I did some measurement and according to this measurement, if you have process-based isolation, actually it is 10 times higher than compared in process isolations. But the gap is here actually can be provided best of the board wars. It means that can we have integrity and failure boundaries similar to process-based isolations and we want to also have overheads similar to in process isolation. And my goal is firstly maintain the rest application safety and also we want to increase the software resilience of rest applications and also how we can provide ease of use in the development. In my early work, we provide some approach for C applications. This is secure rewind and discard and this is an approach for recovering vulnerable application after an attack is detected. And how we achieve that stuff? First we compartment the application in the distinct domains and we want to make sure that a memory defect with a domain must only affect that domain memory. This approach is relying on hardware assisted software fault-based isolation. This is a protection key for user space. And also how I detect it? I use different pre-existing detection mechanism, for example, stack canneries and domain violations. And as a result of this work, we publish some C library, SDRet library. If you want to check out, you can scan the QR code. Now I would like to explain the high-level idea. We have function F and we have unsafe. And if you write just some box on top of that, we want to have some memory safety guarantees and we want to have some isolation. And let's get started. We have parent domain and we want to run this function F in another domain. Another domain means stacks that can heap isolations. It means that I want to run this function F in another domain. And to run my function in the nested domain D, firstly I need to push the argument. I need to enter domain D and pull argument from the parent domain. And invoke F. I am executing applications, execute the function F. And the question is that is there something bad or not? But we have a guarantee that now if nested domain has some memory vulnerability, it will not affect the parent domain memory. It means that parent domain is still secure. And I am offering saying that you don't need to create your application, actually. You can just continue the execution. How I am offering, probably all of you know that this RAS says there's some API for panic. And we run the function F in the nested domain and after that we are checking the result. If results say that yes, you have something bad things happens and for example I can detect stack memory overflow or domain-domain violations, it will return error. If everything is okay, we don't need to do anything, it will just return okay. But the idea is that actually we are using this API, panic, but we are actually adding a new feature. Panic has also memory safety guarantees. It means that when panic happens, you can still continue your execution. And after that I explain this rewind and discard. And if nothing happens, if we didn't detect any memory safety violations, we will just push mutable arguments and return value and we will return parent domain. For this whole process in high-level idea, this STRAAT API, CAPI is already offering this pink box, but the point for this blue box is how we can track the rest of the types of the arguments and how we can push another domain and how we can pull it again. And probably all of you know that we have a lot of serialization creator. And what is serialization? It means serialization, we should encode the data in a format, like I just put it in a jar, and after that we should deserialize. When we jump to the nested domain, we should deserialize. And as a case study, I work on two RAS crates, BINCOT and ABONOMATIONS. BINCOT is a transformed data in a common binary representation that allows passing data between different platforms. And only the mention that SANDCRAS is process-based isolation mechanisms and that uses BINCOT. But actually, we realize that BINCOT, for our cases, is redundant because SANDCRAS and also STDF-FI, we are ingested in single platform. And I explore ABONOMATIONS, it's based on Rust Object MemoLayout presentations, but it is just for specific single platform. And it doesn't store any metadata or any type systems. It can deserialize the place without need for another cooperation. And we realize that ABONOMATIONS is efficient and suitable for our purpose. And we did some benchmarking. SNAPE is a fast comparison C-library from Google, and it designed for high-speed compression and the compression of data. And also presented as an FFI example in the Rust books. And what we compare, compress and uncompress random-generated data of different sizes. And we measure execution time of each operation for different serialization creators, like BINCOT and ABONOMATION. I show as a demo here, when I did all stuff, I just used a sandbox macro. And sandbox macro is ensure that this compressed function is completely run in the different domain, and it will not affect the parent domain. And uncompress is also here. Here we tested with different number of bytes, and this is the execution time. What is our lesson learned stuff? Of course, if you have number of bytes, if it is smaller, in process isolation approach clearly outperforms compared to process-based isolations. Because if you have in process isolation, you will not so much overhead. But the interesting start with later, we realize that even for modest-sized arguments, the context, which is not important anymore, is dominated by data serialization method. What you use. And our lesson learned tree, the data serialization method can significantly impact performance, and it is critical to optimize it for different cases. If you are working on this serialization creator or developing, we can talk about it, how we can improve or how we can fit our use cases. In summary, we introduced secure rewind and discard with isolated domains for REST-FFI. We have two goals. Firstly, we want to protect integrity of REST application from memory safety violation in unsafe program. The main point is that actually I would like to highlight, we are increasing the REST application availability, because we have a still option for if unsafe portion of our applications is the some memory safety, we can return back. We have option there. And I provided REST-FFI creators, it is open source, if you would like to try. And what is our takeaway? Improved isolation approach clearly outperforms compared to process-based isolations. But other important things is that data serialization method can significantly impact the performance. Thank you if you have a question. Can you quickly explain how these domains actually work? How do you enter a domain and how do you define what part of memory is part of the domain and what is outside the domain? Of course, it is actually handled by my C-Labri before that. I wrote it. But for RASPEX, if you just use sandbox macro, it will automatically handle it. But if you go into details, for each domain I will create a new stack and new heap memory area. Early, when there is some talk about this allocator, you can specify for allocators for specific domain. Entering a new stack, what does it mean? Just change the stack pointer and continue execution there. So you do a stack switch and share it to an entry point. But the important point is to do this rewind and discard. You should first save your execution context in a secure way. This is the point how we can recover. That is kind of like set-jump and long-jump style. Yes, set-jump, long-jump, but in a secure way. Now we have a guarantee that... Then you use some hardware mechanism to make sure that certain domains, certain memory is only accessible. Yes, exactly. That is true. That is completely true. This is the install feature. We are using that one. It is lightweight. That is why. Because you don't need system calls? Yes, exactly. You don't need to use a RAM trip? Yes, exactly. You have got everything now. First, thanks for the great talk. When deciding which piece of memory you put in the new domain, the global state is shared between different domains or you copy all the global states? Current version is just supporting HIP and memory. HIP and memory, but for the global shared, you should copy and pass it. It is not going to be your application. You should change it. But as a future work, I would like to support this. How we can actually sync between different domains to global shared states? That could be very costly. Sharing and copying the global state could be very costly. Yes, exactly. For example, here also, even though I have improved the isolations, changing arguments between domains create a lot of overheads. Yes, this is the bottleneck now. How we can improve actually this part? How we can pass the function argument one domain to another domain? This is the actual cost, actually. Second question. You copy back all the mutable arguments. Do you use that even if they are not changed or do you do that all the time? I am just pushing this. If they are mutable, I am pushing the argument. But you don't check if they have been changed by the function. If they are mutable, then you copy them back. Yes, exactly. So it is a static check and not a runtime check. Thanks. Thanks for your nice question. Awesome. Sorry, unfortunately that was all we had time for. Can we give another? Thank you to Mervin. Thank you. Thank you.
The Four Horsemen of Bad Rust Code
Let me do a quick survey. Who has a JavaScript background? Okay, maybe like 10%. Who has a C background? C++? Holy hell. It's like 80% for the people on stream. Who has a Python background? What are you, Paulie Glotz? What's going on? 70% or so. Any other languages? Just scream out. I heard something like, it was something like, oh, but I can't really remember. Does anyone own this book? I found this book on my attic and it was kind of peculiar because it had some arcane cantations in it and it looked like magic, but it certainly had something to do with Rust. And I was really excited. I was really enticed by this book. This is why I want to talk about that book. It was pretty old. There was one section in there which I really liked and it was called the Four Horsemen of Bad Rust Code. This is what this talk is about. Before we get into what the Four Horsemen are, I would like to introduce myself. I'm Matthias. I live in Düsseldorf in Germany. I've been doing Rust since around 2015. I do Rust for a living as a consultant. I did a Rust YouTube channel a long, long time ago called Hello Rust. Only 10 episodes, but well, what can you do? And lately I started a podcast called Rust in Production. If you like what I say in this talk, maybe you also want to subscribe to the podcast later on. That's it for the advertisement, going back to the Four Horsemen. I thought about this title a lot. Why would you talk about Bad Rust Code? I think from my experience as a Rust consultant, I see patterns evolving over time. I see people doing the same things in Rust that they do in other languages. They repeat the same mistakes and I saw that no one really talked about those problems. That is an issue when you come from a different language and you try to learn the rustic way, the idiomatic way to write Rust code. This is what this talk is about. Let me present to you the antagonists. While I do that, try to picture yourself. Imagine who you are and what you think your role would be in this talk. The first horseman is this. Actually, let me show all of them. And the first one is ignorance. What is ignorance? Magical little term. We will get to that in the next slide. And we have excessive abstraction, premature optimization, and omission. Of course, you could add your own personal Rust horseman. And these are just very subjective, but these are the things that I see in the real world. Now that we introduced the antagonists, let's go through their anti-patterns and what they are famous for one by one, starting with ignorance or ignorance. The horseman that is behind this pattern is someone that uses stringy type APIs. You have seen it before. Someone uses a string where they could have used an enum or they don't really embrace pattern matching. And that makes APIs brittle. You are in a situation where if you refactor something, you might run risk of forgetting that you changed something or maybe you make a typo and then your string is incorrect. And so it doesn't represent what you want to represent. They also freely mutate variables. They go and say, yeah, this is state and I can change it. Rust has the mud keyword for this, but they do that liberally across the entire code base, which makes reasoning on a local scope very, very hard. They also use bad or no error handling. We will get to that in a second. They use unwraps a lot and they don't really think about the error conditions of your application. They also have a lack of architecture in their applications. And they use a general prototype style language of writing Rust code. And where do they come from? Usually those are people that were administrators before or they write shell scripts or they come from other languages like scripting languages. And this is what they know. Nothing wrong with that, but they haven't fully embraced what Rust is capable to offer. How do you discover that you belong to this group in the code? Well, if you do things like this, you have highly imperative code. You go through the code and then you tell the program, hey, do this, do that, do this, do that, instead of using, for example, a declarative way of describing what the stage should be. They also use magic return values like minus one or an empty string to represent a certain special value instead of using errors. Everything is a string. Unwrap is used freely. You clone all the things and you use the mod keyword. Why is cloning a bad thing? I don't think it is. But the problem with clone is that you maybe don't buy into the Rust model of ownership and borrowing. And that means that you bring what you learned from the past from other languages to Rust and at some point you run into issues with your architecture which you cannot easily resolve anymore. And this is why clone is kind of a stop sign. It's not a warning sign, but it should make you think for a moment. It's an indicator of structural problems in your code, if you like. Okay. With that out of the way, let's make it a little more practical. How could we maybe put this into practice and improve our code step by step? Imagine you wanted to calculate prices for different cities for a bunch of hotels that you have in these cities. For example, imagine this was a map. This is an actual map, by the way. Africa does not look like this. And also, Jerusalem is not the center of the world. I mean, we can debate about that, but certainly geographically there are some issues with this map. Imagine your input looked something like this. It's a CSV file. You get a hotel name, a city, a date, a room type, and a price. And you go through this file line by line and you try to parse it into something that looks like that. For Brussels, you have a minimum hotel price of 40 bucks, a mean price of 80, and a maximum price of 150. Fun fact, I arrived yesterday not having a hotel room because I thought I booked a hotel, but it was last year. So I was in the upper range here. Thanks, Walshbeng, by the way, for sharing your room with me. Otherwise, they would have been a nightmare. If you wanted to parse the input file and create a result like this, all you have to do is write this code. That's the entire code. Nothing really big going on here. There are some peculiarities, but this is usually what someone would write who would say Rust is not their first language. Maybe they just try to port whatever they had in another language to Rust. This is code that I see them doing. What you do is you read the CSV file, then you create a hash map of cities, then you iterate over each hotel, you try to parse the data by splitting each line, you extract fields from it, you parse the price, and then you update the city. Updating the city happens somewhere in the lower end. At the end of it, you print the mean, the max, and the minimum. That's it. That's the entire code. You know, it's working. Technically, you could run this code and it will produce the result that you expect. Prices for different cities, we're done, right? Unless we think about the bigger picture and the demons and the monsters that are out there out in the ocean, and they can haunt us and bite us. There's dangerous beasts out there, killer animals. I think what you want to do is improve that code a little bit. How can we make this code a little more idiomatic? This is the same code. Now, let's look at some parts that I personally wouldn't want to have. Consider this block. There's some things going on, but overall, it's a very manual way, a very imperative way of going through the list of hotels. We literally have a couple if conditions here. If price is smaller than city data zero and so on, we update the price, yada, yada, yada. There are patterns that make that a little nicer to read in Rust. This is the same code. It's just something very similar, but we kind of manage to shrink it down a little bit. In comparison to what we had before, we get city data and then we use some sort of tuple extraction to get the mean at a minimum and the max. That makes things a little easier. We can suddenly talk about mean instead of city data zero, for example. That's not the major problem with this code. There's unwraps too in here. Well, for a first prototype, that might work fine, but later on, maybe you don't want to have that. What if you cannot open the hotel's CSV file? What if you cannot parse a price? In this case, the entire program just stops. A question of design, but I would say if there's a single line that is invalid, you probably don't want to stop the execution right away. Another problem is that we index into the memory right away. Who tells us that a line has that many entries, five entries? It might have three. It might have zero. Who knows? But if we index into something that doesn't exist, the program will panic and that is kind of a bad thing. The underscores mean that the variables are not used, so we can remove them. We have a little bit of a cleaner structure and a simple way to check that a line is valid would be to just have this manual check in there. I know it's not very sophisticated, but it helps us along the way. Now we check if the hotel data length is five and if it is not, we just skip the entry. Let's look at parsing for a second. How do we want to handle parsing? I said that maybe we don't want to stop the execution when we run into an issue and we can do that in Rust by matching on the parse result. A very simple way to do that would be to say match price dot parse and if we have an okay value, we take it and if we have an error, we don't really care about the error. We just print an error on standard error and then we continue with the rest of the parsing. Looking at the input, one thing we can do as well is apply a similar pattern and introduce a result type. Now we use a box for representing a result type. This is because you don't need anything, any external library to have a result type that has an error type which can be literally anything. So it can be a string, anything that implements error, the error trade. In this case, it's a very simple way to improve your Rust code. It's a good first step. What we do instead now is we say read to string and then we map the error in case we have an error to something that a user could understand and act on. Then yeah, the code is already a little cleaner. We handled a few error cases already and this is something that might pass a first iteration of a review cycle. Now of course there are certain other issues with this code. For example, CSV handling. CSV is tricky. Proper handling of delimiters is very hard. For example, you might have an entry which has semicolons like on the left side here or you have something that has quotes around a semicolon and you probably want to handle that. So a simple string split does not suffice. Same with encodings. On what platform are we operating on? Do we know the encoding right away? Does the CSV file contain headlines or no headlines? And there's many, many caveats like that. If you're interested, there's a talk called stop using CSV. I don't say you should stop using CSV, but I say you should start watching this talk because it's really good. Right. How can we introduce types? I talked about types a lot and Rust is great with types. We should use more of them. Here's a simple way. I already talked about the result type and in the first line we just create an alias for our result and we say it's anything that has a T where T is generic and the error type is of type box dün stet error. And then we can use the result in our code to make it a little easier to read. As well, we introduce a hotel struct and we have a couple fields, just strings and floating points at this point. But this helps us make the code a little more idiomatic already. We will combine those things on the next slides. But first let's look at the CSV parsing. There's a CSV create. I advise you to use it. It's pretty solid. And what you can do is you create a builder and a builder pattern allows you to modify a struct and add members or modify members dynamically. And in this case we decide that our CSV file has no headers and the delimiter is a semi colon. And the way you can use it is like this. You now say for hotel in hotels deserialize. No more strings splitting. And now we match on the hotel because this returns a result. And now we need to make sure that the hotel that we parse is in fact correct. And after the step we don't have to deal with edge cases anymore because we know that the struct is valid. That means it has the required amount of fields and prices are also floats. Which is great makes the code much more readable already. And it was very simple to do so. Now I want to quickly talk about this part. There's a cities hash map. It has a string which is the city name. And then it has three floats which are the mean, the min and the max price. I don't think this is particularly idiomatic. The way it was used before was something like this. And we kind of managed to work our way around it. But a better way I would say would be to introduce a type for this as well. Because if we're talking about prices and pricing seems to be something that is very central to what we do in this application maybe we should have a notion of a price. It's very simple to do that. You just introduce a price type. Now you might be confused why we suddenly don't have a mean anymore. But instead we have a sum in account. And the reason being that when we parse the files we update the sum and later on at the end we can calculate the mean. Which has some mathematical properties which are favorable because now we don't really have, we don't run into rounding issues anymore. This is an aggregation that we can do whenever we want to get kind of a mean on the fly. And at the same time we have a default. Now the default is not really idiomatic too I would say. But the great part about it is that we can later reuse it and make our code a little more readable. In this case we set the min price to the maximum float. But then whenever we introduce a new price it will overwrite the maximum because I guess by definition it's smaller than the maximum or smaller or equal. And same for the max and some in account are kind of set to zero to begin with. And just before we bring it all together here's one more thing that we should do which is have a notion of a display for price. In this case we implement the display trade and we say yeah if ever you want to print a price this is the structure that you should use. The min, the mean and the max. And then this way we can make our code way more readable. Now you can see that instead of using a tuple or floats here we use a price. And when we update the prices we can talk about this object. We can tell the object hey update your min for example. Here we say price.min.min holds a price and we automatically get the min price as well. We update those price fields and yeah we can even introduce a price.add method. I don't show it here but technically why not. We can add a new hold up price. Prices could be added over time. Now that depends on I guess your taste, your flavor of rust. This is the entire code. It's a little longer but you saw all the parts. And now you have something that I would say isn't a workable state. It's not great but we did one thing. We considered rust. We thought the ignorance. We started to embrace the rust type system. We started to lean into ownership and borrowing which are fundamental concepts in rust. We lean into design patterns and we learn how to improve our architecture. And I would also say if you want to improve this part try to learn a different programming paradigm. Rust is not the only language. Try rock or try a functional language like Haskell. It might make you a better rust programmer too. This is how you fight ignorance. Now if you see that none of these horsemen fit to you by the way just think of your colleagues how you would want to introduce them to rust because this is the code you have to review and also probably maintain in the future. So it's time well invested. If you want to learn more about idiomatic rust specifically there is a website. I just put it there. It's an open source repository. It has some resources. This is a rendered version of it. You can sort by difficulty so that's your experience and then you can sort by interactivity if you want to have a workshop or not. For example there are free resources on there and paid resources too. Right let's go on and look at the next horsemen. Excessive abstraction. Everyone in this audience knows someone like that. They try to over engineer solutions because rust leans into that. It allows you to do that. It's a nice language to write abstractions. Everyone likes to do that. But then you add layers of indirection that maybe people don't necessarily understand if they come from a different background. They use trade successively and generics and lifetimes and all of these concepts are great in isolation. The combination of which makes the programs hard to read and understand for newcomers. Now if you find yourself in this camp try to fight this as well. Common symptoms of this are things like this where you have a file builder which takes a t as ref of str and a lifetime of a and this makes sure that you can pass any type and that it has no allocations that are not visible because of the lifetimes. So this might be fast and it might also to some extent be idiomatic but it is something that your colleagues also have to understand. Another thing is I might use this again. Let's make it generic or trades everywhere. And how do you get to that mindset? It's very simple. After you wrote your CSV parser it's natural that you want other parsers too. Of course you want to chase on. Of course you want to read and write into a database. You start thinking that you'll need all of those formats at some point and this is the part that is important at some point. And then you end up with something like this. It's a trade definition for a hotel reader and it has a single method called read and it takes a self that's why it's a method but it also takes a read which implements the read. That means you can pass anything that implements the read trade and it returns a box of iterator of item equals result hotel with a lifetime of A. No allocations except for the box but the iterator itself is a very idiomatic way to say a result of hotel so parsing errors are considered and it's very applicable for all of the reader types that you could possibly want. Let's say you wanted to use that trade and implement it for our hotel reader. Now suddenly we blow up the code to something that is harder to understand or if it is easy for you to understand please reconsider if your abstractions are too much. Maybe you ain't going to need it. Right. So we have a hotel reader and it owns a reader builder and inside of our new method we initialize the CSV hotel reader and we implement hotel reader down here. The single method called read and we say self.reader builder this is the code that we saw before we just put it here this is our CSV parser the initialization of it and then we return a reader.into the serialized hotel map and this is where we map the errors. Right. Does it look great? I don't know depends on someone's nodding. We need to talk but it's certainly nice to use I guess. Now we can say for hotel in hotels.read file. Should hotels know about files? Maybe not. But it's great if you go one step further and you implement iterator on it and now you can say for hotel in hotels. Alright we're getting somewhere from a user's perspective that is really great. But remember we're talking about application code. There's probably code that you earn money with. It's not a library function that is used by thousands of people. It's your simple CSV parser and now we just blew it up into something that is harder to understand. Do you really need this? Well I don't think so. I don't know what this person on the bull does but it certainly looks confusing to me and this is what people think when they see the top signature. I know kind of you wanted to optimize it a bit but at what cost? Right whenever you sit here and you think oh I should implement JSON support and you don't do it for fun. Start thinking if you really need those subscriptions because they can haunt you. Most of the time they don't have no need of it. I don't know what sort of animal this is. Is it a lion cat or something but it's kind of strapped to a cannon and it doesn't look too happy to me. I don't want this. Probably you're not going to need it. As a side note another thing probably you shouldn't do too often are macros. There are traits out there that excessively use macros. What do I mean by macros? Macro rules but also macro derives and these are great but they come at a cost and the cost could be compile times. Just yesterday I talked to Daniel Kerkman who I don't know is he here? He's not here. But thanks for the tip. He has a situation at work where compile times just blow up because of macros and for you it might be easy to write but for other people it might be hard to use. Maybe you want to prefer traits over macros if you can. That was the second horseman fighting excessive abstraction. How can it be done? If you find yourself in this situation keep it simple. Avoid unnecessary complexity. Just think that the person that will maintain the code is not a mass murderer but your best friend. Do you treat friends like this? Watch newcomers use your code. That can be humbling. Ensure that abstractions add value. Yes you can add a layer of abstraction. Does it add value? That's up to you. Decide and don't add introductions that you might need in the future. Add them when you need them. Right. Two off the list we have two more to go. Next one is premature optimization. This is for a lot of people in here because you are C and C++ programmers. I'm looking at you right now because 90% of you raised your hand. I see a lot of people from C and C++ come to Rust with this mindset with these patterns. What are the patterns? They optimize before it's necessary. This is important different from adding too many layers of abstraction. Optimization in this case means profiling is not done but instead you kind of try to outsmart the compiler and you think about performance optimizations way too early before you even need it. Did I even tell you how big that CSV file was in the beginning? How many entries does it have? You don't know. Maybe you should not optimize for it right away. They use complex data structures where simple ones would suffice. For example we saw the hash map with the three tuple elements. These are things that are kind of unravel and then it ends up being a mess not very idiomatic and arguably not even faster. And they also have a tendency to neglect benchmarks. Some red flags. Quotes you might have heard. Without a lifetime this needs to be cloned. Ignore that. If you know that you have a performance problem then you can think about lifetimes. It's fine to clone. Let me help the compiler here. The box is so much overhead. I use B3Map because it's faster than hash map. No need to measure I've got years of experience. They love the term zero cost abstraction or zero copy. Actually it should be zero cost in here. And they hate allocations. Whenever they look at an allocation they feel terrified and they bend over backwards to make that program faster. So whether this is the developer or the compiler and vice versa is up to you. I've been in both situations. They turn a completely simple hotel struct with a couple string fields which are owned yes they live on the heap. Do something that lives on the stack and has a lifetime. And every time you use a hotel you have to carry on the weight of the lifetime. Well does it matter for this one particular case? Probably not. But then you look at other places of the code base and you see that they kind of reverted your changes. They made what you introduced your hard won knowledge about the abstractions and they took them away. Now we start to index into our data structure again. We use string split again. We go backwards. We've been there before. It is super fragile. Again we are going backwards. Now let me play a little game here. Since there are so many C and C++ programs in here I expect you to answer this. What is the bottleneck? This is a very famous medieval game who wanted to be a millionaire. What is the bottleneck? Is it CSV parsing? The DC realization of our entries. Is it string object creation after we DC realized it? We put it into a hotel struct. Is that the bottleneck? Is it floating point operations when you parse the price? Or is it hash map access? Who's for A? Some shy hands? Don't be shy. Who's for B? Okay. Nice. Who's for C? No one. And who's for D? The hash map. Nice. The correct answer is you forgot to run with release. How do you find the actual performance improvements? There's just one correct answer and it is measure. Profile. Use the tools. Cargo flame graph. Cool thing. You will see that in a second. Use benchmarks. There's criteria on Nick still in the room? Nicolet? No. His benchmarking tool. Divan. Pretty great. Use it. Okay. I will give you one example. Let's look at a flame graph of our initial program. The one that a junior developer could write in two hours. What is the bottleneck? There is no bottleneck. This is the setup of our flame graph itself. This is the profiler setup. The code itself is negligible. Negligible, I guess. And why is that? Again, because I didn't tell you how big the fire was, do you think I can come up with thousands of alliterations for hotels? No. So I added 100 entries. There is no bottleneck here. Okay. You might say, but okay. What if the fire grows? Let's add a million entries. Okay. Oh, this is still 120 records. So let's add more. This is a million. You probably ain't going to read it. Let's increase it to 10 million. And indeed, deserialization of the struct takes most of our time. Okay. If we look a little closer, it says, serde deserialize deserialize struct. Okay. We have some memory movement going on. Let's take a baseline. That is our baseline. This is what it takes. 34 seconds. Okay. Now, let's say we kind of want to prove our C and C++ developer wrong. Does this other abstraction that we added for the hotel struct really add that much overhead? No. It's the same. It's like 34 seconds still. Oh, actually, this is the part where we remove the unnecessary fields. But we can go further. We can say, yeah. Here we have a little safer version. We don't index, but we say nth.1. And we have 32 seconds. Now, our bottleneck is append string. String appending. Okay. I think there's something that we can fix. Well, okay. Maybe this is not really that readable. But what we do is we split now by a string. And instead of doing an allocation where we append to our string over and over again, we use this pattern matching here. And this reduces the runtime by 30% already because we save on allocations. Now, if we try to profile this code again, where's the bottleneck now? Read until. Okay. What is that about? We have a lot of memory movement going on. And now we reach a point where the disk becomes the bottleneck. We can use an M-map for this. Now, remember, we are talking about performance and maybe you should not do those optimizations, but prove a C and C++ program were wrong and they are in tuition. And then you see that the bottleneck might be solved elsewhere. Now we are at 30 seconds by changing like four or five lines from the entire program, not the entire thing. We can keep using our abstractions. That's the main point. Here we use an M-map. That's a memory map in the kernel. We save on allocations. 30 seconds. Okay. What if we wanted to do more? It's hard to read, but now we reach the point where in fact the hash map is the bottleneck. And one more step to improve the code would be to split it up into multiple chunks. You can use rayon. You can now finally use a better hash map like a hash map. And we are down to 3.5 seconds. And we did that not by guessing, but by profiling. Now if we want to run a profile, it looks different again. Very different. These are the individual chunks that we managed to split up. We went from 40 seconds to three or four seconds in a couple slides and with few changes. And the point is don't guess, measure. This is the worst part that C developers bring into Rust. They think everything is a performance overhead. And if this challenge, by the way, looked very similar to the one billion row challenge, this is why it was inspired by it. And it is very similar. Read it up. It's kind of fun. We did something similar for hotel data. But the more important point here is how can we fight premature optimization? Measure, don't guess. Focus on algorithms and data structures, not micro-optimizations. More often than not, if you change from a vector to a hash map, this will be way, way more efficient than if you remove your little struct. And if you add lifetimes everywhere. You can get carried away pretty quickly and Rust encourages you to do so, but it also has the tooling to fight it. Be more pragmatic. Focus on readability and maintainability first and foremost. Use profiling tools to make informed decisions. You covered all of that. Your code is idiomatic. It is fast. You didn't overdo it. What is missing? Well, the entire rest. Do you have tests? Do you have documentation? Is your API too large? Does your code lack modularity and encapsulation? These are things that I see from people that are like the lone wolf coders. They know all about Rust, but what they are not really good at is the rest. Explaining the differences to their code maintainers. And writing documentation. Not about the what, but not about the how, but the what. What does your program do? Some things they say. It compiles. My work is done here. The code is documentation. Let's just make it all pop. I'll refactor that later, which never happens. Let's look at that code again. This is our first version junior programmer. Three hours. Okay. How do we test that? It's kind of impossible because this is one big binary, one main. How would we test that? Well, I guess the question is what do we want to test? Well, first off, I would say let's add a test for parsing the entire thing can be a very simple, true test. But if we refactor it such that we have a function that parses cities, now we can start to introduce a path here and do the parsing. And this is where the parsing logic is, by the way. We split it up into a main and the parsed cities. Great. This is our first test. Very crude, but we get to a point where suddenly we can test our changes. We create a temporary directory. We have a path and then we write into a file and that's it. The parsing is done. Great. If we wanted to make it a little better, instead of passing in a path, we pass in something that impels read. Now we don't need to create files like here. Instead, we can have our input as a binary blob. And these are simple things. Add some documentation, add some tests. It's not that hard. And in order to fight a mission, what you need to do is write more documentation, write unit tests, use tools like Clippy and cargo UDAPs, set up CI CD so that you can handle your changes, create releases, use release please, Marco, greetings go out to you, and keep a change lock of what you changed. Right. We're getting towards the end. We have seen the anti patterns. You know them now. I hope that you will be able to, you know, see them in your code. If you want to learn more, there are some other talks that were given here at FOSSTEM and other places. You might want to check them out. Maybe I can put the slides somewhere. And that is all I have to say. Thank you. Thank you.
WASM 101: porting a Sega Game Gear emulator to the browser
So we have Anis Astier is going to tell us about the Wazem 101 which is very nice to put in. Thank you very much. Thank you. Thank you. A quick presentation. My name is Anis. This is not my first talk. This is my first time here in the Rust Dev Room. You can find my social media here. Follow me if you want. I've been learning Rust for five years on and off. I wanted a bigger project to learn a bit more about Rust. I said why not write an emulator. I started this project. This is a Game Gear emulator. The Game Gear is this small device. I don't know if you've ever heard of the Game Gear. So yeah, it's a Sega handheld from the 1990s. So this is the name of my emulator. Gears you can see. It's written in Rust. It depends only on the standard library. It has a native UI. This is how it looks like. It works. After I developed this native UI, I thought maybe I should port it to the web. To do that, I would need to use WebAssembly. So quick show of hands. Who here has never heard of WebAssembly? It's interesting. Who here has heard of WebAssembly but never used it? Who here has heard of WebAssembly but never used it and developed things with it? Oh, many people. Okay, quite interesting. So WebAssembly is a kind of a new platform. You can think of it as a new platform, a new to port code. It defines the text by code format. It's a take on the Java Compile 1, so whatever your system. It works in the browser where it's as secure as JavaScript, it's sandbox. It also has many other use cases. You can work on servers. You can use it in fast. It has many use cases. So I want to port my emulator. So there's this first level which is how do I build my code? How do I compile it? So let's go through this journey. How do you compile WebAssembly? I assume you know about Rust, but if you don't, usually you install Rust with this tool called RustUp. You need to add a new target with RustUp. Then you also need this tool called WasmbineGen which will bridge your WebAssembly code with the JavaScript world and generate some things. Use RustUp. You use WasmbineGen to build your code with the new target and then you use WasmbineGen to generate a directory with JavaScript. You serve that with an HTTP browser and that's how it works. You don't have to use WasmbineGen directly. You can use tools that integrate WasmbineGen and call it. There are many such tools that have selected a few. Wasm Server Runner. It comes from the baby community. You have Cargo Run Rasm. You have Trunk which is even higher level and Wasm Pack which is from the Rust Wasm project. I won't go into the details. You can find the comments on how to run them here. I did a quick comparison of those tools from let's say the lowest level tools to the highest levels. WasmineGen, everyone uses it. It's like the reference tool. Then you have a bit higher level tools and then more open-united tools like Wasm Pack and Trunk. Wasm Pack will generally be used to generate libraries that you can use from the JavaScript world whether with NPM for example. Trunk will integrate even more things like compress your HTML assets and things like that. You know how to build. How do you run the code? You usually write a binary. You have a main function and the entry point of your main is how it works. Or you can build a library and usually you annotate your entry point with WasmineGen macro start and you say, okay, this function is my entry point. You start executing from here. We know how to compile. Let's continue porting our application and go to the second level of porting the emulator. This emulator I've written called Gears for the desktop UI. I only selected dependencies that work with WebAssembly. So the whole wasmineGen wasmineGen was capable. They work with the web platform. Have pixels, WinIt, simple, Giller-S which is for gamepads. We'll go deeper into that. They all support WebAssembly. How hard can it be? It should be very simple. Well, it depends. For pixels and WinIt, pixels is a library to make a front buffer, basically a front buffer library that's GPU accelerated. So you can write pixels to coordinate and then it will run that with WGPU. Pixels use WGPU. It's another great to do the rendering. In order to work on the web, you need to enable the WebCR feature of WGPU. In the future, it will also use WebGPU, but that's another subject. The initialization of pixels is also different because it uses WinIt and WinIt needs to be initialized differently if you want to render your UI in canvas in the browser. Last but not least, the initialization of WGPU is Async. So in my emulator, I never used Rust Async. I needed to add that. So I used WebAssembly. Gen features to bridge the Async world from Rust to JavaScript promises. To part the audio part, I'm going to use the WGPU. I'm going to use the CIPL create, which also works on the web. This is a reference create to play audio. It needs to create feature as well. There were also some challenges because maybe nature started directly and if you use a browser, you can't start playing audio directly. That's actually a good thing because it means you can't play audio on anyone's browser without interaction. So you need to have interaction. The user wanted to do this action. Another issue I had with the standard library is I used NPSC channels and they don't work on the web platform. So I wrote a quick channel myself because it was in the core. There are other channels that work on the web platform. But I prefer to implement something with no other dependencies. For time, usually for synchronization in an emulator, you need to know about the current time. Just like for the channels, in the standard libraries, the time API are not available on the web platform. So there are crates that do the bridge. I used the instant crates. You can also use web time, which also works. This is the code, the use code if it's for the was 32 target using instant, if not, use the standard import. For Gil arrest, which was very nice, there was no action needed in order to support working the browser. Everything worked out of the box except the gamepad API, I would say, on browser is not as much much mature as on native. So there is some rebinding to do. There are good reasons for that. For example, browser don't want you to be able to fingerprint someone with the gamepad API, but then it means the bindings are not mature enough. Not the bindings, but the key bindings, which is something else. And then during porting, I also had bugs that were entirely my fault. I used a bit to turn it into a huge bit, but I didn't use it too much. I used to make it too much new size, mostly because I like to index slices. That's what you need to do in the slices. Wasp 32, as it says in the name, is a 32 bit platform. So I had overflows when I had multiplication, additions, it grew bigger than 32 bits. All these were codes because in my project, in my cargo project, I had a lot of defaults in Rust. And yeah, it worked well. I just replaced new size with 64 when it did. And that's it. So let's take a quick break and let's go through a demo of what it looks like. So just for first then, brought to you again, which is this one, I will lend it to you for a few minutes. It's for them exclusive. I recommend you play this demo on, not necessarily on mobile, it will work, but you won't be able to control it. So maybe more on desktop browser or anything that has a keyboard or gamepad controller. So I'll let you a bit more time to load it. It might not work for you if you don't have WebGL enabled on your browser, but otherwise it should. If you have Firefox or Chrome, here's how it looks like. So I've loaded the page, it's play, and basically the emulator starts. If you have audio, it will play audio. And yeah, this is what you should see. Okay, yeah, it works. I can play it. Who here successfully runs the demo? Just a quick show of hands who managed to run it. Okay, thanks. Okay, let's continue. So we have this porting. Okay, mostly worked. I showed you. It worked. There were a few tricks, I picked along the way. There's not mandatory, but let's see what we have here. First thing, if you're used to debugging like me with println, you print code on terminal, it probably won't work as is on the browser, so you want to use the Web console. There's this console log crate which does the binding of the console. If you use the log crate, it's really well integrated with the log levels and things like that. I also recommend that you use the console error panic hook crate. This one helps show when your program crashes, for example, I showed you the overflow checks it can panic. It will show you basically the panic in the console. That's how you register a console panic hook. Another trick I picked along the way is the cargo config. For this demo, I showed you, there's a bit of a problem with some interactions. Some API I use, which I use directly from Rust, and I use the Web Cyscrate, which allows accessing those APIs for this demo. In order to be able to access those APIs, which they are considered unstable, and you need to add an environment variable when you build, which is a bit annoying to add every time. You can add this Rust flags directly in your cargo config.tamo. This way you can build with cargo builds. It will work. Another trick if you use to having VS code or integrated development environments, you probably are using Rust analyzer. If you have code that works on multiple platforms like me for the native, there's WebAssembly, you probably want to tell Rust analyzer to build as a tool. You can do two different architecture. This way you have completion on the WebAssembly part. This is done as well in the cargo config.tamo by specifying multiple build targets. When you build it, you will have multiple build targets. There are some drawbacks for that. It won't work with the workspace member. It must be at the root of your workspace. It also means that when you use cargo run, since you have multiple targets, cargo run will say, oh, no, you have to pick one target in order to run, which makes sense. It can be a bit annoying. So let's go with what did I think of this experience of putting this emulation. What's my feedback? I would say in general it's very easy to port standalone code to WebAssembly if you're using Rust. I did not change anything in my app's architecture. The total port took a few hours over a few days. As I told you, I did custom code for initialization, which is I think, and for DOM interaction, which is the demo you've seen. To go a bit further, what I won't talk about in this talk is how to build a web UI, for example. You probably want to use U or Laptos because I don't recommend accessing DOM APIs directly. This is very ugly, not really ergonomic. I did it so you don't have to try. Those library developers do a great job to do that. I didn't try building a complete UI. As you saw, nothing is configurable, etc. I'm thinking of building a UI with slints or a GUI, but I'm not really satisfied with the current status of font rendering. I know it's something that's being worked on. Just like as well, minification in web size is not web-specific. There are many Rust tutorials you can find on minification, and I didn't do any performance measurement. I can tell you that it works. It also works on native. But I don't have any special feedback for that. That's it for my presentation. Thank you. We have a question. Yes, I have a question. When you build websites today, they have to be responsive. You use media queries in CSS style sheets to adopt to different kinds of resolution so that on the mobile tablet or desktop, it still looks nice. Can you also do this in web assembly that you would say if I run the game in portrait or landscape mode, or if I do it on a bigger screen, that it takes care of the resolution? Will it also scale the graphics accordingly? There are multiple aspects to that. If you're building a web UI, you probably do that with CSS. If you use leptos or you, you will be able to generate HTML whether on the server or on the client. Then it's basically the same thing as web development. You have CSS, you scan this HTML directly. For this demo, this is an emulator. It's a bit specific, especially because it's a full-screen application. So basically it takes the whole width of your screen, and that's it. That's how it works on mobile and tablets and desktops. But it's not that you can combine those and that you can also do something in JavaScript or CSS. You can do that. You can find tutorials on the Rust-Waston book. You can look at the Rust-Waston guide and on the Rust-Waston project, which is this URL. You can find information on how to bridge the two worlds. If you decide to use a crate, as I recommend, like you or leptos, they also have a lot of documentation on how to do that. I understand. Maybe a general question. Why did you choose Rust? Did you also consider programming in C++? Or are there any advantages of using Rust compared to C++? That's a great question. It was actually covered in other talks, but usually I like using Rust because it's a very nice language. It has nice ergonomics. It's fast and native. It has more safety guarantees than C++. A great ecosystem. Thank you. You're welcome. Any other questions? I'm curious what your main loop looks like. Do you spend all the time polling for events? Do you get called back from the browser? Does the browser hang if you never sleep? That's a good question. I did not modify my main loop, but mostly because I used Winit. I used a Winit event loop. This is specific to the Winit crate. Nothing was modified in the main loop. It spins. I don't remember how many times, but basically the length of a frame every time, and then it gets refreshed. Yeah, that's it. And that's all the time we have. Thank you.
Thunderbird: How to Exchange Rot For Rust
So, if I could have your attention. When we got this talk, I didn't know Rust and Thunderbird had a connection, so this is pretty exciting and pretty cool. So we have Sean and Brendan are going to talk about how to exchange ROT for Rust. Thank you very much. Hi. I'm Sean Burke. I am a senior software engineer at MZLA, which is the company that maintains Thunderbird. And this is my colleague, Brendan Abolivier, who is a software engineer at MZLA as well. So we're here to talk about how to exchange ROT for Rust. So our colleague, Ike Dordi, couldn't join us. But I feel I need to shout him out because we would not be giving this presentation without him. And I also have to applaud his pun in the title because the project that forms the basis for this talk is Microsoft Exchange Support in Thunderbird. So we're working on adding support for the Exchange Web Services Protocol. This is the first Rust component written specifically for Thunderbird. We, our code is based on Firefox and so there's Rust there. But nothing specific for Thunderbird. And it's also the first mail protocol to be added to Thunderbird in Thunderbird's lifetime, which is a slightly strange statement. But I will explain that a little bit here. When we started this project, nobody actually knew how to add a new protocol to Thunderbird. And that gets into the ROT part of the title a little bit. So first off, a little bit of history of Thunderbird. Thunderbird grew out of Netscape Communicator originally, as did Firefox. So a lot of the code in Thunderbird predates Thunderbird itself. And the 0.1 release was July 2003. So this is a fairly old code base already. In starting around 2012, Mozilla started to hand over Thunderbird to the community because it felt that Thunderbird wasn't self-sustaining under the Mozilla umbrella. That situation persisted until around 2017 when Thunderbird rejoined the Mozilla Foundation. And so what does that actually mean for Thunderbird? We had a pretty big gap in paid maintainership, which results in, you know, a community can only do so much. Thunderbird is a very large project. There's a lot of work to do. Just keeping up with building, making sure that it's following Firefox's changes since we're based on Firefox. And that gap meant there was a lot of time where you can't expect a community to have a holistic view of the architecture of a huge project like Thunderbird. You can only ask so much time from them. And so changes were made without a view to how this would affect the architecture, how the architecture played into things. There was also a loss of institutional knowledge because the people who'd been employed to work on Thunderbird were no longer, and there was nobody there to take over for them. In a lot of places in Thunderbird, there hasn't really been any kind of architectural maintenance in over 20 years. And that also means that, you know, large portions of the code base are written in C++. C++ has changed quite a bit over the years, and Thunderbird has not kept up. So this is a pretty significant challenge, but also presents us with a pretty significant opportunity. That opportunity is Rust. So we'll talk a little bit about why we decided to use Rust. This is a room full of people interested in Rust. I'm sure most of you are pretty aware of the major benefits. We're a large application maintained by a small team, and we take input from anybody who sends somebody an email, and so memory safety is pretty critical. We do not want security bugs letting anybody have access to somebody's computer. Performance is also pretty big. We use a lot of JavaScript in our code, but for low-level stuff, JavaScript is going to have some performance issues. And then, you know, the modularity of Rust, having that built-in gives us access to a pretty large ecosystem. There are a lot of people doing mail-related stuff in Rust, and we can benefit from that. The other, the next reason is that, I mean, we are based on Firefox code, and Firefox already has Rust in it. So the build system is set up to integrate with cargo. We share CI infrastructure, and so that already has provision for Rust. And then, also, Firefox has something called XPcom, which is kind of a framework for communicating between the different languages that Firefox uses, and there's Rust support in that already. And then, Rust also kind of introducing a new language gives us permission to rethink some of the aging ideas in Thunderbird. It allows us to kind of ignore some of the more delicate code paths that have been around and changed ad hoc special case throughout the code where changing things is a little bit scary. You don't know what you're going to break. And also, I mentioned the loss of institutional knowledge. We need to rebuild that, which means a lot of documentation, and personally, I love the documentation tooling that Rust provides us. And I think that helps a lot in moving forward. But as with any project like this, it's not just, okay, we're going to use Rust. Cool, we're done. We're good to go. We had some problems getting started. Part of that is just we have a large existing code base, which means we have existing patterns. A lot of idiosyncratic async stuff going on that doesn't integrate nicely with idiomatic Rust. Lots of features and capabilities already in the Firefox and Thunderbird code base, which don't have any sort of Rust binding, or sometimes some kind of painful Rust bindings. I mentioned XP-COM as a benefit, but it also became a little bit of a drawback, particularly in terms of developer experience. Over the years, Firefox has excised a lot of the XP-COM interfaces just because it can be a little bulky, a little bit painful to use them sometimes, even in C++ and JavaScript. That work never happened in Thunderbird. We have a lot more uses and huge uses of XP-COM than Firefox. And so what works well for them in terms of developer experience doesn't work for us. It's really painful for us to use XP-COM at this point. I also mentioned the build system as a positive, but in a big way that became a drawback for us because in order to deal with the fact that Firefox has a C++ entry point, no single point of entry for Rust, there's a hack put in place to build a single workspace and shove that into the Firefox code. That hack, we're built as a subtree of Firefox rather than having Firefox as a subtree of our code, which is a little bit unusual. Cargo doesn't like it when you try to have a workspace inside of a workspace. We're not in the same repository as Firefox, and so we can't change their cargo lock, we can't change their dependencies. We kind of solved this by basically stealing all of their dependency tree and merging it with our own, building from within our code and using a script to keep that up to date and hope things don't break so far, so good. With that, I'm going to pass it off to Brendan because... Now we can use Rust in Thunderbird, we can build Rust in Thunderbird, we can run some Rust code in Thunderbird thanks to that work to integrate it into the build system. What do we do with it now? It is good to answer that question, it's good to think back to where we're coming from, what we're trying to achieve with that, and our end goal with this work is to be able to support Microsoft Exchange in Thunderbirds. We want to support more specifically something called EWS, which stands for Exchange Web Services, that's Microsoft's proprietary protocol for interacting with Exchange. That protocol is based on XML or HTTP, so it's up to be more precise. That means that we're missing a few key code infrastructure in order to make this a possibility. First, we want to be able to send HTTP traffic and preferably we want to send it through something called NECO, NECO is the networking component of Thunderbirds and we already have a well-functioning networking stack, it would be a bit sad to completely bypass it. We want to be able to communicate, to interact with NECO and to do it in a way that is familiar and easy to use for Rust developers. Once we have the capability to send those requests, we also want to be able to fill them with the contents that we need in this case XML. We need to figure out how to serialize and dis-realize XML in a way that scales to a lot of data structures to give an example of scale. EWS specifies about 100 different operations and about 1700 different data structures. We're not at the bottom of the stack which is sending HTTP requests. Because we want to interact with a specific component within Thunderbirds, we want to use XP-com, which I mentioned, the acronym stands for the cross-platform component object and its job is basically to allow inter-component interaction by defining some platform neutral interfaces and that way we can cross the language boundary, which is good for us because we want to write Rust code to interact with NECO, which is in C++. Let's use that except sending, except using XP-com with Rust directly doesn't look very Rust-like. It's mostly designed around C++ APIs and so it doesn't have a lot of the features that we can find in Rust and it means that there's a lot of boilerplates. This is the code to just send a single GET request and print the results in the standard outputs. We need to define a bunch of callbacks, we need to define a bunch of different objects and because we're crossing a language boundary, we at the very bottom, we need to wrap all that into the actual call into an unsafe block. None of that is very ideal and we obviously don't want anyone who wants to use NECO in Rust to have to do that every single time they want to interact with the network. Let's split this issue into two sub-issues that we're going to solve. The first one is we want to do native, to support native async await, Rust async await syntax. The way we do this is we added a new interlcrate to Thunderbird, which is actually the first Rust code to be added to the Thunderbird code base. The role of that create is to translate asynchronous operations in xp.com into Rust's native async. The way it does that is it defines custom stream listeners, which is that big struct that we saw earlier with a bunch of callbacks. What that stream listener is going to do is it's going to buffer any incoming data, call wake on a wakeer when the request finishes and then we can wrap that around another struct which is in charge of triggering the asynchronous operation in xp.com. Then it implements the future traits to be able to query the state of the buffer every once in a while and to return the result when it finishes. In the future, we're probably going to also implement the stream future traits in order to be able to interactively process incoming data. We don't need it immediately, so we just went with future for now. Now that we have this native async await support, we want to build on top of that to have some way to write some idiomatic Rust code to send HTTP traffic. We do that with yet another internal crate which provides more idiomatic and requests like HTTP clients. It's not a one-on-one, one-to-one replicate of requests, but it request-wise uses the main inspiration for this work. Under the hood, that crate is in charge of creating and configuring all the necessary xp.com objects, wrap that into our future and also provides more rust idiomatic error handling as well because standard xp.com does its error handling with just error status codes which isn't the best we can do with Rust basically. So that's all nice. What does it look like? So let's do a demo. We're going to do a live demo because we don't like to leave safely. So here is some code that lives in on my local checkout of Thunderbirds. It's got a bunch of code infrastructure to plug it into xp.com for the next step of the demo, but the important bit is what we can see here which is that with all clients that are from my HTTP here, we can create, we can actually create a PUS request, set it a custom header, set it some custom body, send it and natively await on it, and then we can process the response or the error depending. We're going to run this code into a local Thunderbird which apparently crashed while I was preparing the demo. Let me just do... So this is the Thunderbird DevTools. You might already look familiar because it's also the same DevTools that Firefox uses. We use it to work on the front end of Thunderbirds and access some internals of Thunderbird when we need to. So we're going to instantiate that xp.com plumbing that I was mentioning. It's basically just a dummy interface that just has a thing to do the thing which in our case is sending an HTTP request. We can see that we successfully sent a request through NECO. It's not because it appeared in the network tab which means that it went through the Thunderbird networking stack. If we inspect the request, we can see that it did include our custom header, it did correctly attach to the right content type, and it also correctly sets the right body to the request. And to confirm that once more, the server... That's just a simple stupid server that I quickly wrote in Python that... Sorry for using Python. Which just takes that custom body and that custom header and just prints something. Right. So that works. Now what do we want to do from here? We have requests that we can send and we can process the response to that request. But what do we actually put in that request? As I mentioned, we want to put some XML into that to be able to communicate with exchange servers. So we started with a kind of exploration, kind of a lay of the land of what the status is with regards to desilizing and serializing XML in Rust. And we quickly identified that most crates that we could find had some existing issues. Either they don't provide a good way for handling attributes and namespaces in XML or N slash all, they're very boilerplatey. It's fine for desilization because we don't necessarily need to process every single attribute from the response we... Or N slash all namespace, namesoces. For serialization, it's not really something we can do because obviously if you omit a required attribute or something like that, the XML server is not going to be able to understand the request. And also we not only want but need to have a low amount of boilerplate in all code because N EWS defines a lot of data structures, a lot of operations. So, yeah, dozens of operations, more than 1,000 data structures. So we don't have any small amount of boilerplate that we have. It's just going to make the codes 10 times more difficult to maintain. So we decided to create a new crate. This time it's not tied to any Thunderbird internal, so it just lives on GitHub. And so we use this... And so in this crate, we basically leverage the procedural macros in Rust to dynamically generate implementations for a trade that we also define at compile time. Almost everyone in this room will just be like, yeah, this is just a derived macro. I'm fairly new to Rust, and so when I saw that, I thought this is pretty cool, so I want to mention it. We don't want to reinvent the wheel. So we built it on top of QuickXML, which provides some pretty nice tools for writing and formatting XML. And we try to design it with a fairly low boilerplate... That low boilerplate approach that we need. So what does this one look like? This is a kind of dummy data structure that I defined, and as you can see, I was thoroughly uninspired for the naming. But this showcases some of the features that we can use in this crate. So we can set namespaces, either default or custom ones. We can set namesets prefixes. We can instruct a field to be an attribute. We can flatten some structures, and then all we need to do is actually populate our data structure, serialize it, and in our case, we just want to print it to see what it looks like. And if I run this, it generates valid XML that matches the data structure we defined here. So that's a lot of useful code infrastructure that we have now for our Microsoft extension implementation. Where do we go from there? Obviously, the next step is we want to implement the damn thing. So implement protocol support for AWS in Rust, and hook that into the Thunderbird UI to expose it to our users. We also want, if there's enough interest, to generalize the XML struct crate, the one in this slide, because at the moment, it's fairly designed around the use case of EWS in terms of configuration and defaults and things like that. So it might be something that, if there's enough interest, it might be something that we will look into in the future. And another point, another point that's another next step is we might also start working with people from the XPCOM team in the Firefox developers to try to improve the situation around bindings for XPCOM in Rust and make them just more, well, nicer to use for Rust developers. So that's where we are. That's where we're going. And thank you for listening. Thank you. So we, I think we have quite some time for questions if you have them. Yeah. Well, as I make my way over there, one question I had. If the protocol support is in Rust, do you think it's possible that it could be more shareable with other email clients? Yeah, this is one of the things that we're trying to keep in mind. One good example is we're currently, you might have heard that a few years ago, we welcomed the K9 email clients on Android into the Thunderbird family. And if we're building a new protocol support for the desktop application, we would like in the future to potentially include that support into K9 slash Thunderbird for Android. So this is definitely something that's one of the, one kind of extra reason that we decided to go with Rust is because of the ease of implementing, of reusing Rust codes across multiple platforms. And yeah. And we are going to make the EWS create public as well. Yeah, I'm going to repeat because I have a mic. We're going to make the EWS create public. And yeah. And also, you see to build it in a way that is fairly agnostic to the actual desktop application.
Know Your Ingredients: Security Starts With the SBOM
Okay, great. Good. All right, so welcome everybody. Thanks everyone for joining the hottest room at POSTA. My name is Steven Chin. I'm VP of Developer Relations at JFrog, and I'm going to talk a lot about different projects which help secure the open-source supply chain about why we need security, a bunch of different security incidents, both historical ones, but also new ones which you probably haven't heard about, a lot of new research and things going on. And hopefully we can all help to improve the open-source supply chain together. So I think a great analogy. Can you guys in the back hear me? Okay, good. So I think a great analogy for the software supply chain and how we think about it is to compare it to our food supply chain. And we know that the way that you get great cooking, great ingredients is starting by fresh ingredients. Having things which you know are safe, which come through the food supply chain, which aren't, don't have any people who are interfering with in the middle, who are not following good hygiene practices. And when you have an issue with your supply chain, you end up with spoiled ingredients and you know, kitchen disasters. So anyone here seeing the Gordon Ramsay kitchen disasters show? Okay, a lot of good fun. And these are not the free-range chicken you're looking for. We're hoping we can get better quality and better security out of our software supply chain so we can build enterprise applications which are hopefully very difficult for attackers to exploit. And this is how the USDA looks at kind of you know, the software supply chain, creating a healthy supply chain. But it's somewhat analogous to software. So you have a lot of production. You have you know, farms and things which are producing software. You have distribution and processing. So it goes through a bunch of different tall gates and different people in the process. Eventually it ends up in a restaurant and a retail location and then you have you know, home users or restaurants or other folks who are cooking the food. So if at any point in this process, if you have issues with your quality, if you have you know, infections, if you have bacteria entering into it, then that results in potential issues at the consumer side. So I think when we're looking at the software supply chain, we need to look at it through a different lens. And I think a good lens to look at it through is Salsa, which is one of the open SSF standards. And it really focuses on getting attestations of the different parts of the builds that your software has gone through. Kind of figuring out at each of these different gates, you know, is the source control secure? Have you done the right things with code reviews? Have you done through the right processes with builds? And when you have all of this information about the build, then you can figure out are you actually secure? And a key ingredient to how you know this is the case, and this is why we're all here in the software build materials room, is because you need to have that final index of your ingredients, where it can show you from end to end, and Salsa and both SPDX, Cycline DX, and S-bomb standards go really well together, because this way you have the attestation of what's happened in your build into your artifacts, and then you can put that together into a single document, which kind of shows you all of the things which verify the components, and then the potential issues with them, which you might have. And if you're not following these good practices in how you build software, how you get provenance of your software and how you attest to it, then you end up with issues like, for example, the log for shell incident. Now I think this by now is infamous. It sparked a whole second round of government security concerns over open source software, and really the challenge for big organizations, which we're trying to address the log for shell incident in production, was are my production systems affected? And it depended upon the version of log for J, which you were using. It depended upon whether you're using just log for J core, or whether you're using the full set of libraries. And the answer for most organizations was, well, I don't know if I'm affected in production, so I'm just going to patch everything. And that's very expensive, it's very difficult to do, and when you have libraries like this, which are used so much across the entire ecosystem, it's quite challenging as well. And I think what really started a lot of the government concern around software supply chain was an earlier incident. This sparked the Biden administration's litigation around this, which was the SolarWinds incident. A very different sort of incident because this one was a true software supply chain attack in the sense that they specifically attacked the build system. So they were using TeamCity, they got in right before the certification happened, the signing of the artifacts happened. So to the downstream people, which SolarWinds was providing, it looked like it was signed by the company and certified, and it wasn't a malicious, but in fact they had done a very good job of infecting it before that was properly signed. And so we'd like to prevent these sort of attacks from happening because it causes a lot of damage. It can cause potentially malicious entities to get access to information. It can cause privacy issues with consumers, and it costs a lot of money and cost according to IBM's data breach report in 2023, USD over 4.45 million and a 15% increase over three years. So this is a huge issue and it continues to get bigger for us as a software industry. Okay, so let's talk about some additional incidents. So anyone, which one of these is your package? So when we're talking about like delivering libraries and dependencies, one of the things that majority of software uses is it relies on open source components, it relies on leveraging that because we don't want to write the same code and it's actually more secure if we're leveraging open source libraries that have been peer reviewed, that have been patched, that are staying up to the latest standards. But what if you can compromise the systems in the middle, which are supplying this information? So the dependency confusion attack basically relies upon the fact that a lot of companies, organizations, and open source projects use some sort of package management or middleman. They'll set up repositories which will pull from upstream or pull from local corporate repos. If you can get the information about what the internal names of the corporate repos, this is an example of Yelp, then what you can do is you can upload those to NPM or other public repositories and especially you're spoofing these libraries. So as a developer, as a CI CD system, you're going through a potentially vulnerable package manager rather than getting awesome corporate lib 1.2, which is the latest version of your company's library. It goes and it sees, aha, there's a new version in a public repository. I'm going to serve that up instead. And as you know, bad things happen when kittens get access to nukes. So we don't want this to happen in our supply chain. Fortunately, all of the commercial package managers, including my company's Artifactory, are now patched for this. So by default, they will not go out to a public repository if it exists in a local repository. So this blocks that attack upstream. But Alex Bresson, who did this exploit, was very creative. He took an attack which was theoretical at the time. Nobody had actually exploited it. He attacked Google, Facebook, Apple, a whole bunch of companies and simultaneously claimed about a dozen bug bounties and ended up getting $130,000 USD for his effort. I'm sure you'll see that. Maybe instead of helping secure the supply chain, there's a more lucrative path. But I think it's also like researchers like him also, they're helping to expose the potential issues in the supply chain in a way where they're not introducing threats, right? So this is white hat hacking. And we need people like this to find the exploits. And also, this helps guide us for what we need to do for new standards of SPDX, for implementing things like VEX to make it easier for us to figure out what the vulnerability scope is. So I think that these sort of attackers actually are helping us a lot with the ecosystem. Now, another food example here. So if you have a recipe that calls for different types of rice, like for example, if you're doing a risotto, you wouldn't want to use like a mixed grain rice. Like you need a specific type of rice. And this is something else which attackers make a lot of use in the supply chain. So another common type of attack is called typosquadding. Another variant of this is leaving off namespaces. So as an example, our research team found an attacker which released to NPM a whole bunch of libraries from Azure. And they just left off the Azure prefix. So if you're a lazy developer and just typed in the package you wanted, if you left off the namespace, you would instead get a vulnerable library instead of the actual library you wanted. So a very clever attack. And the way they did this inside of NPM is they actually had a random account generator which would also generate a unique account for each of the different libraries they uploaded. So it also wasn't easy to systematically find, oh, well, this is a bad entity. I'm going to block them. So they managed to spread out the attack. They did it on 280 different packages on Azure, Azure tests, Azure tools, CattleLang. And then they could install any software they wanted on the person's computer. But basically it was set up for potentially exploiting data from personal machines. Later on, so our security research team found this. We reported it to NPM. They took all the packages down. And then we publicly disclosed it. Later on, a security research firm claimed that they were just testing out NPM. So this was like a company testing the waters. So it wasn't actually a malicious payload in any of the packages yet. But it had a lot of potential for doing that. And the security research firm wasn't exactly upfront about what they were testing either. So, okay. And then, of course, if you're building, if you're serving food, you want the ingredients to be very fresh, right? You can't make gourmet food if you start with a pile of rotten food and things which aren't fresh. And I think when we're looking at the software supply chain, actually a good analogy for this is the somewhat infamous picture of a stack of more things and more things and more things with very small, fragile components nested inside of the supply chain, which any of those, if you pulled out the banana, suddenly your whole supply chain falls apart. And I think a great classical example of this is the left pad incident. So basically, there was a package published on NPM under the Keek package for doing left pad. Not a lot of code, so it's not something that's hard to write. But as developers, we are very, very lazy. If there's, if you can possibly save a line of code by including a dependency, of course you would do that. And then this Keek package was later claimed by a company, which wanted to own that domain. NPM sided with the company. Cameron got upset about this and then pulled down, oh, actually the publisher of Keek got upset about this and pulled all his entities down. Later on, Cameron published an identical version of left pad to solve this problem. But this is the source code which caused this huge incident. And this is something that is not worth including a library dependency, a potential vulnerability for such a very trivial piece of code. So this is, again, a huge threat. Now, one of the ways you can find out what all your dependencies are and figure this out in a visual way is using Guac. So this is a new OpenSSF project. It just got added to the OpenSSF suite. What it does is it gives you a visualization of all of your dependencies, lets you see exactly what you're using, how you're importing, and has some nice visualization on top of it. And I think using things like this helps you to figure out what your risk is and what the potential scope is of your application and how vulnerable you are as a project. So everyone knows Coca-Cola and it's very secret, right? So the secret recipe is locked in a vault, very secure, nobody actually knows what is exactly in Coca-Cola, that's their trade secret. I think we pretty much all know what's in it now. But there's this aura of mystery about the recipe and the history behind it. And so how do we as software developers or as projects, open source projects, keep our secrets? And the reality is we do a very bad job of it. So this is all of the exposed secrets in different central repositories, which we found by scanning NPM, PyPy, RubyGems, crates.io, Docker Hub. Obviously Docker Hub being the biggest repository and having large containers, which came in a lot of other software. There was just a humongous number of secrets exposed, 5.78 million. But even the software repositories like NPM had 1.16 million, PyPy had 0.43 million. So there's a lot of accidental exposure of secrets in open source repositories. This is yet another attack vector which attackers get into open source projects and allows them to attack the CI CD infrastructure, cloud accounts, which the projects are using. And even there's often accidental leaks of corporate secrets inside of open source repositories. Because as a developer you're working in the daytime on your corporate projects. And then evenings and weekends you're working on open source projects. And there's a certain amount of crossover in that as well. So the top ways you can help to prevent this from happening in your own project. So first is not using automation to check for secrets exposures. So using something like Truffle Hog, some sort of commercial scanner like X-Ray, allows you to scan your packages before you check it in to make sure you don't have exposed secrets. This is how we found that we basically ran our tooling on top of central repositories to see exposed secrets. Second one is generating tokens with broad permissions that never expire. So you always want to have the tokens scoped as small as possible in terms of what they can do. And then setting expirations in a reasonably short time frame so you're rotating keys at the right times. Third one is no access moderation for the secret. So putting it inside of some of service like HashiCorp Vault or Docker Secrets or something will help to protect your secrets and tokens. Fourth is fixing a leak by unpublishing the token. So this is a really, really common mistake. But you can't simply check in a new revision which deletes the token. Because then, you know, Git has long history, it's going to remember it. Now, if you followed point two and you have very short-lived tokens or very small scope, that limits the damage because by the time somebody finds it, it's likely not useful anymore. But again, a big mistake, you actually have to go and rotate the token to fully mitigate the issue. And of course, you know, exposing unnecessary assets publicly. So we saw a lot of cases where in test libraries and other like code which was not the main library code, there were secrets exposed that were visible to infrastructure. And in some cases, it looked like that the test code or the other like side cards beside the main code base were not even meant to be published. They were kind of, you know, more internal code. Okay, so to safely use open source, we also need standards. I think if we've ever, you know, gone to a restaurant, this is really common. This is in New York City, they have like letter grading on restaurants. They have like, you know, reviewing of the source. And I think a great way of doing this for open source software is the new OpenSSF Score Cards project. So basically what this does is this gives you nice tooling for Git and a command line. It'll analyze your project. It will give you a score. It's kind of like up to you to interpret the score for the different things that it analyzes. But it tells you about code vulnerabilities, maintenance, continuous testing, build risk assessment, source risk assessment, so a wide set of different things on your project. And helps you figure out like how much risk is in your project, but also more importantly how much risk is in upstream projects. Because if you have dependencies on projects which are vulnerable, then your project itself is vulnerable. Okay, and I think, you know, given we're in 2024 and clearly the machines have been taking over. So it wouldn't be complete if we didn't talk about what's happening with security of machines, machine models, and some of the code which we're leveraging to make better use of AI infrastructure. And unfortunately it's not looking that good for us so far. So ML models, so the machine learning models which we all use and publish to public repositories like Huggingface, they are highly vulnerable and this is, we're already seeing a bunch of attacks against these public repositories with malicious actors injecting payloads into it. And it's not very hard to do so the H5 format, the Huggingface format actually gives you the ability to put inside of it information that is basically executable code that sits alongside your model. So the developers have figured this out and basically from the moment you install the model, they can run some code on your system. So as a developer there's always the possibility, there's already the possibility of simply using models inside of Huggingface and other public repositories could expose your development environment to risks. And basically this is an example of the base 64 payload and you can run whatever you want to inside of the model. Another attack for injecting malicious packages is exploiting the generative AI. So if you're using technologies like chatGPT and other generative AI technologies, what they'll often do is they'll suggest packages that you should use as part of your code. And AI algorithms are prone to hallucinations. Hucinations are actually quite predictable and a lot of the standard code queries which people ask for will include perfectly valid dependencies, but they'll also include fake dependencies which don't exist, packages which don't exist in NPM, PyPy, etc. So hackers have already figured out that by uploading the packages and putting malicious packages in the place of the libraries which the generative AI is producing you can effectively cause people using chatGPT to execute malicious code. So another potential exploit and now even the AI is introducing vulnerabilities into your code. So here are some examples of perfectly reasonable queries, for example requesting, generating an endpoint that returns file contents, right? So this code is vulnerable. If you now do a couple dot dot back dot dot slashes you're going to end up in other directories, you're going to get access to files you shouldn't. And now if we again ask chatGPT, like, okay, we'll give us a secure endpoint that returns a file for user input and prevents directory reversal. It gives us a more complicated example, but this is still exposed to URL exploits. So as developers we can't really trust the current generation of algorithms for code suggestions to give us secure code. And the attackers know this and this now makes a very easy class of security vulnerabilities which are likely to get injected into open source projects and other work simply by the fact that it's being recommended. And something we're going to be publishing soon. So this is kind of, you guys are getting the before official publication on this. So basically what we did is we went into hugging face, Kaggle and some of the public repositories, ran our detection on malicious packages to figure out like what the current exposure of developers is in the ecosystem. And we found over 60 models which contain malicious behavior. We analyzed the payloads. Some of them were not truly malicious, but some of them were malicious. And basically it allowed the attackers to run code on local environments. I believe we're scheduled in another week or so on the JFrog research blog to publish the results of this, but we're, of course, doing the right disclosures to the, to hugging face and Kaggle so they can take down the models before people actually extract the data. And we exploited it. And I think building awareness of these sort of attacks helps the entire open source security ecosystem because we're the ones both in, you know, in this room building software build material standards but also in the general open source security space you have to figure out solutions so these sort of attacks don't become the next solar winds. Okay, so you can find a little bit more about the stuff I've been talking about for research with the JFrog research team at our research blog. This isn't our like commercial blog, just the research guys publish here. So it's all the fun stuff. And hopefully together we can create a more secure software supply chain. So thank you very much for having me at the software build materials room today. Okay, if you guys don't mind, I want to do a quick selfie with the audience. So what's a good, what's a good security sign? Log for J, log for J. Okay, let's give a thumbs up for log for J. Cool. Alright, thanks everybody for joining. And I think we have five minutes for questions if folks want to ask questions or if you need a breather because this room is very hot. I feel free to leave the room as well. Any work on combining S-bombs with stored secrets and verification, things like that? I think that's a good question. So I don't know if there's any work going on now about getting secrets as part of software like S-bombs, but maybe that's a good addition for the standards. Yeah, thank you. Yeah, so the question is what kind of vulnerability is X-ray handles. So I would say we're clearly in the application security department, APSEC. So we find malicious dependencies. We find like secrets detection, like I mentioned. We do stuff. We actually can build SBDX Cyclone DX files with both regular vulnerability info and also the new VEC standard. We don't currently do anything with runtime security, although that's coming. Our package manager, Artifactory is open source. X-ray is proprietary. Yeah. Okay, so Kay asked if I've looked at any of the stuff that's happening in AI for SBDX. AI and data. AI and data. And so I know about the working group that's collaborating on this stuff, but I haven't looked at any of the new stuff. Yeah. But I'm very interested to see what you're doing. Okay, we'll do. Okay. Thanks everybody.
Make your software products trustable
Hello everyone. Thanks for coming. My name is Dejan. Unfortunately Marco couldn't be here today. He got a call but yeah. What I want to talk about today is so we saw a lot of sessions today about producing gas bombs and producing the data a little very little. I think only Philip sessions was about managing actually the produce data right. So the challenge we try to tackle with the justification project is how to get all these data that are currently being produced by the more and more organizations like S bombs but also X files and more and more advisory data that we get and get them into some kind of manageable system because without that information is just a bunch of mostly JSON files spread all over the place right. So what we want to what we try to do is to provide a system that will get all this data put it in into a system that can be searchable and queryable and actually get us get us actionable information. So making software development more proactive in managing security but also making it much more easier to respond to the security issues. And yeah as I said these got us to start working on a justification project which basically set these goals for itself. So being able to ingest and store all kind of S bombs and VEX documents are open source but also proprietary company products right. Discover for those ingested S bombs and VEX is learn about all the new vulnerabilities and advisories related to the packages inside of the S bombs and being able to explore and search those information but also create an API that can be integratable in other systems and provide us to share this information with the rest of the developer toolchain like IDEs and CI CD tools. So ideally we would want to mark all the vulnerable dependencies directly in the developer's IDE and also for example fail the builds that tries to build a software that contains some of the dependencies that are known to be vulnerable. When we started to do this sometime last year this time last year we also figure out that there is another open source initiative that revolves around the similar ideas and it's called GUAC. It was mentioned in the previous session as well and I will cover it a little bit more here. So GUAC stands for Graph for Understanding Artifact Composition and the idea is to being able to ingest all different kinds of artifact documents like S bombs and VEX files and advisory data from all kinds of sources and basically create a graph ontology of that. So at first we started just experimenting with the GraphQL database but today ontology is based on the GraphQL API and can be implemented by the multiple persistent backends. That's on the left side right on the right side of the graph we also want to be able to query all these data. So GUAC should be able to provide us with all the answers about what are the dependencies in my S bomb, how these dependencies correlate with each other, so what's dependent on the what, so it's easy to find all the graph tree of dependency in your project but also being able to attach to this particular dependency all the vulnerability and the advisories and VEX data that we can find in additional systems. This is the basic architecture. Let me just see how much time I have here. But I basically explained it in the previous graph. So we can collect documents from different sources, we can certify them against different sources like OSV or DApps Dev, get it all through the GraphQL API ontology into a database. Two currently supported databases today is POSGRES, relational database that we use basically and works just fine and there's an Orango DB back end which is a pure GraphQL back end right and then on the other side provide the GraphQL API to be able to query that and provide a bunch of CLIs that it can be able to extract the data from the system. So in the classification project we try to provide a little bit more functionality on top of that. First of all we want to be able to actually not just ingest all the data about different relations in the database but we also want to provide a central place to store all your documents for the organization. So it provides an S3 compatible storage for storing and ingesting all the company's data into a single place so it can be an S3 bucket in the AWS but also for local deployments it can be some kind of a Minio instance for that. It has what we call walkers for different kind of CSEF repositories so that we can automatically ingest Asgum and Vex files and then provide what we can see on top and on the bottom. So what we call a single pane of glass like a nice UI to be able to search all this data that we have but also the Exort API as I said for integrating the system to the rest of the developer tool chain. So there's a nice VS Code plugin that can work basically with justification today and automatically from the project get all the dependencies and flag vulnerabilities if it's found in the system. So I thought to do a little demo so let's see how it's going to work. So Neil it will be easier. So here we can see the UI with some pre-loaded data and we can see that we have basically what we call six products here which are actually six S-bombs that are already already ingested in the system and a large number of CVs that have been collected from multiple sources and we can see that we identified around 2000 packages for these S-bombs and most importantly from the Vex files ingested here we identified 29 advisories for these. So if we go to a certain product we can see a couple of information obtained from the S-bombs so we can see the basic metadata that we have. Usually we can see all the packages and how they relate to each other. I think this S-bombs is pretty flat in structure so there's no much dependency going on there but the most important thing is that we can see different kinds of advisories that are against and also immediately see which actual packages are being affected by these advisories. We can go back and forth through this system so we can go to the actual package see that it's actually affected by this vulnerability. We can also go from the package and find the S-bombs that it belongs to, the S-bombs or the product but also what we can provide is that nice search capability as we said like maybe at some point you don't remember exact vulnerability we're looking for so you can basically just do a full text search or maybe yeah and find that there's a packages related to that but also find the exact vulnerabilities that we talked about a little bit earlier. So this is just like a basic demo right? I have a little bit more time just to explain so what were the challenges for us and I think we heard in a lot of sessions all about these challenges so it's mostly still early adopters everywhere, tools are immature including the project I'm working on so we definitely don't consider it mature but also there's a lot of inconsistency in the data wherever you look right so we heard today about all the multiple computing formats in S-bombs space and all the work that people are doing to bring that more closer and together over time which I think is awesome. We also heard a nice discussion about all the different kind of identifiers and you can see so if you work only with one source of data then it's easier but then if you try to correlate this S-bomb with this Vex file and this S-bomb is using PURELs and these are the CPEs it's becoming impossible to correlate data and build the graph basically properly. Also what we found is that even all these things are standards there's a lot of unwritten rules in all the organizations about how they are presenting their data so the documents will pass but what you have as an information from the document really depends so I think yeah it's good that you're all here and there's a lot of things to do right because it's early early days. For the project itself we'll try to additionally simplify architecture and the deployment model we're all about microservices and Kubernetes for now which is okay but I think we could reach much more people with simplifying how much resources and where they can deploy a project like this and go into supporting more standards. So you saw here just basic searches and basic correlation I think once we have much more data in the system we can get much more vision from all this data in and provide that as that's the value of the project in my opinion right and continue working on the future integrations because in my mind if you do continue doing this right I think at some point in a couple years all these infrastructures should be invisible to developers right so it should be part of your developer toolchain automatically working in VS code in all the Git for pipelines and everything right so we are just beginning that's it so justification side doesn't have too much data saying about immature projects but there's a dev box sandbox that you can try there's a code there and we always on the metric channel so if you're interested please reach out and yeah. I'm going to ask the question are you using the SPX libraries for helping with the ingestion? No no we're using yeah sorry yeah the question is are we using existing SPX libraries yes we are yeah so there's one in Golan using in in guac but there is also in Rust one using the classification itself because they are good yeah. So why is the reason that you decided to start a project from the ground instead of help at least four or five open source projects big ones that already do exactly what they do but not yet on the level but mostly 90 percent that we are doing today. Why you not helping that one instead of creating one? So yeah why we are starting a new project instead of instead of helping others so first of all we joined the guac project which is also another new project but yeah I can't answer that I mean a lot of people were involved in that kind of decision but we are trying to be as much I mean it's all open source we are contributing to other projects so it's not a closed source product basically yeah. So one of your early slides said this can be used to sort of share S-bomb data can you talk a little about that feature how you this can be used to sort of send S-bomb data around to other projects? So it's not about yeah sorry about it so about sharing the S-bomb data it's not about sharing the data but providing the API so the external systems can query things so basically the VSCode plugin would get all the URLs from the current project and being able to query this and get actionable item back so there's no any distributed sharing of the data just integration API. Okay please thank you. Thank you.
Can SBOMs become first-class citizens in Open Source ecosystems?
Thank you for allowing me to come here, the organizers. I am almost new to the S1 community out here. My name is Sal Nielsen. I am part of something called the SEPA security working group. We work on supply chain stuff and security on the oldest open source software repository system out there. It started in 1995 with lots and lots of... Let's see if we can switch the slides here. There we are. And I am here with software and all that implies with developers publishing there and downstream in Debian and Redats and all the systems out there being used all over the world. And like 14,000 developers still, more than 14,000 packages. So it's a real system. It's out there, it's working and people are earning a lot of money on this stuff so they want to keep it going. And now we are having a new reality coming with legislation. So I am trying to today bring the open source supply chain perspective. This is recently finished slides so please excuse me if I am either finished early or late I will try to do my best to make you happy. So I talked to a bunch of the people who are involved in the middle parts of this chain of events. They often say why should I care about this stuff? We already do a key track of dependencies on that one. We have the new formats and this is not my problem. If you pay me maybe we can talk. This is actually, this is paraphrasing but the essence of the discussions are like some of the blog posts are like I am not your supplier. It's actually like that out there. As I can confirm that notion. Then reality arrives and end users or use all the software are obliged by the threat of fines to keep track of all the dependencies and what is happening with them so that we can't get all those horrible security situations going on. That means they need authority and up to date information from the utmost upstream sources. And to do that you actually have to have the supply chain bits and pieces and steps play along in this game so that we can get all the good stuff. Like figure out where stuff comes from, check it up against very built-in databases and all that good stuff. We like that. So I am researching, looking around what the documentation tried to learn, this whole S-bound thing, reading documents from the US government and all kinds of interesting organizations like many of you probably have done. Very interesting stuff. Then I find this thing. This is from my nice tea. They tried to describe where I spawned the show in, show up and there's something wrong there. There's no supply chain mentioned at all in there. It says third party software enters here and there's no open source or processes or communities or anything. It seems almost like there's a lot of, this is a pattern, it seems a lot of documentation and even in some of the standards it's just assumed there's something going on here. And well, it isn't. There's stuff going on. And I would like you to just get a little picture of what's going on there to draw a simplified supply chain. We have an author at the top who does stuff, publishes something. There's a language ecosystem they publish on. They also collaborate with others on a collaboration platform. So the language ecosystem would be the PIPIs or the NPMs or the C-Pants collaboration ecosystem would be the GitOps and GitLabs and all that stuff. And they are sources for downstream package. Oops, sorry. There we have it. So that one, the red one, that's where I come from. That's the C-Pan and then PIMS. And we care about the infrastructure and how that happens and making sure that only the right people get to upload software and that it's published and available and all that good stuff. But the downstream of us, we have a packaging ecosystems. These folks here, that's the Debian and the Red Hats and all kinds of places that compile stuff for their own environment and make sure it's available in a consistent and available manner. But they also feed into themselves. Like downstream of Debian you find Ubuntu. And sometimes the packages here are patched because of upstream availability or you have to back port security fixes. And there's a package there that sometimes I have to talk with a curator about which of the software pipelines you should publish one package. Because some of them are LTS pipelines. You don't want to do stuff there that you can do in another one. And then of course you have to make it all available so that the developers, some business can do their work and all that magic so that it can put to something production environment and make people happy. All these boxes here, I try to make it so that there are boxes that represent a role that cares about something that is supposed to be in an S-bomb file. I'll try to be quick. So these bits here, that's actually this one except for that the third-party software arrow here, the tiny little grey one there is doing some seriously heavy lifting. That needs to stop, seriously. And there's another one, second-party software. We are not third-party software doing this. We're second-party. We're partners. When people say we can get third-party software from open source, no, we get second-party. When you accept a license, you're actually getting a partner, someone you are supposed to cooperate with. Most people don't but they're still there and expected and you need to know about that and people who make decision management, they have to know about this. That means anyone who writes documentation and teaches this kind of stuff needs to stop calling open source as a third-party source of software. That's just insane. Second-party software means people like these actual people working on infrastructure out there get ignored, basically. And that's not a good way to get the inclusion and the support from that software supply chain people and the ecosystem that you actually depend on. So, okay, who are these people? They're, in fact, your open source colleagues. In fact, they are your unpaid open source colleagues. Just so you know that. So stop treating them as a stranger, start treating them as a colleague, talk with them, interact with them, teach them and learn from them as usual colleagues do in a healthy environment. Of course, if you don't have any healthy environment at work, maybe your work should go do something else or quit or something. So to make S-Bomb become first-class citizens in open source ecosystem, make open source ecosystem first-class citizens in the S-Bomb community. Please do that. Don't just put them behind a miniscule with one pixel wide arrow that says third-party software enters here. That's just so bad and wrong. It's horrible. So there we have that. And don't, yeah, they are your partners. Please, it's a good thing to have them on your team, even if they're living somewhere and you don't pay them. They're competent people and they actually do want to help you. Like if you've treated them badly, they'll just say, this is your problem and see if you can fix it. No, you can't. And in a way, if you want something happened with somebody you don't have a monetary relationship with, you have to treat them as friends and with respect and help them if they have a problem and communicate and stuff. And this is the good way to operate if you want to have a supply chain in on the S-Bomb game. So I hope this is a message that you can find useful and adopt in your work in years to come. Thank you. If you have any questions, maybe we'll have a room for one or two. One question? I've been involved in some of the groups that produce the thing that we should show them. I think there may be a miscommunication between them because that third-party perspective, it wasn't planned to offend anyone. Anybody can be a third-party when you're developing, right? So I think it's just not you. But in fact, some of the work that some of the work that they're doing has been, you know, approaching those language communities and helping them build their, for example, involved with the BIPA Foundation and their efforts to create their own S-Bomb. So I think if it's a miscommunication, then we just need to sit down and talk a little bit more. There might be a miscommunication, of course. You'll have to repeat that. Yes, there's a long comment here. There might be miscommunications out there. And of course, my perspective comes from one community. All the communities who might be more resourceful, like the Python community, are easier and don't feel that it's, of course, not meant as an insult. And of course it is. But I think my point still stands that by treating open-source communities as partners, you get all the benefits, even if it's a small community like mine or a big one. So I think, thank you for your comment. I'll still mean what I mean. You haven't changed my mind. Thank you very much. Okay, that was it. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you.
12 months of SBOMs - an experience report
Right, I'm live. I'm green. Right, welcome to the post-lunchtime slot. Okay, some of you know me because I was here last year and basically I want to say what we've done for 12 months with S-bombs and it came out as an idea is about change. So I'm going to take explain about change. Some of it is about my tooling that's changed but a lot of it is about observations. A bit about me. I'm from Manchester. That's where that B is. I get asked about where's that picture. So that is a security hub for innovators for startups in Manchester trying to grow the ecosystem in the north. There's more than just London about tech, please. Normally most weekends I'm running around muddy fields at this time of the year so I've had a weekend off running muddy fields. My background is mission critical systems. For 40 years I was delivering mission critical systems. Think about big complex systems. So what Nicole was saying was my bread and butter. Those are my, those are what I used to worry about. Now I'm going to start up and I'm known as Mr S-bomb in Manchester. Didn't know what S-bombs were 12 months ago. They do now. It's all about a tool called CVEbin tool. So this has been presented a number of times and it's a binary scanner. It came from Intel and they wanted to understand what binaries were included in their deliverables and were they vulnerable. Common question. It's open source but one of the things we've done is become a Google Summer of Code project and each year we've added more features and I've been pushing the S-bomb world in there. So we added S-bombs, then we've added CVS, like Cizekirv. We've added EPSS this year. We do a very trivial triage. Let's say we might improve that with VEX, the world of VEX and it's got a thousand stars this week. It's very good. Open SSF best practice is interesting. I'm not going to work for Intel. It's a challenge sometimes in terms of do you have multiple maintainers and it's a challenge in open source when it's run by a commercial organization. So generally you see the reeds, calendar dates tend to trigger on GSOC. We found a little problem this week which is why we didn't release it this week but anyway it's very close to having a new version with all the EPSS stuff formally released. And then there's a tool that I write and I haven't got a thousand stars yet which is a Python generator and I take the installed Python and work out all the dependencies and work out all the bills. So think about Python. There's lots of direct dependency. What are those transitive dependencies? I'm agnostic to which version of S-bomb it is. I've always wanted to be Cyclone and SPDX. Initially I have written my own parsers and generators. I do want to migrate to the stable versions. It's all about time. But what I'm pleased about is there's a benchmark. We'll see then and it's the first one to get a benchmark score of 10 out of 10 which is but they'll tell you explain why you get 10 out of 10. It's quite hard because the ecosystem needs to play together. And this is what I do. Generally I just enrich it when I get the time and I've got a bit busy the last six months. So that's why I've stopped a bit. But anyway. So generally the sort of things we've been doing is adding more stuff into the package information using SPDX. So trying to get as many of the package attributes in to the S-bomb because you want enrichment. The more data you have the more usefulness it can be for the more use cases. That's hard because that data is not readily available. So what have we done? And this came out as a conversation, you know, a monthly open source meeting that we have and said it'll be nice to work out how much change do we have. And this is the S-bomb going to tell us what those changes are. So we put a get up action that runs about two o'clock in the morning. We clean the virtual environment, clean virtual environment, the bun two. And that's quite an important thing because that's going to come back later. We then install all the dependencies. And then we generate the S-bomb in the different forms. So whichever version is the latest flavor will generate. Do it. And 3.12 will become, I think, maybe this week, tomorrow. But generally we just ruin it for the supported versions of Python. So a little bit of a digression about Python dependencies. And it's probably if you went in Node or Java and things like that, you'll have, everything's got little quirks. The thing about Python is it tells you what the direct dependencies are. It can tell you a bit about the environment. So if you're working in Windows, you may have different dependencies than if you're working in another environment. But it says nothing about the transitive dependencies. How much is hidden? So let's look at the example. So this is quite, this is a subset of our requirements file. So at the top, you've got AIR at HTTP. It's got a constraint in the version saying the minimum version we require is 3.74. But it's also got optional requirements as well. So straight away you've got two potential two ways of installing AIR HTTP with or without that additional component. You look at beautiful soup, any version will do. And then you look at these down here, the import lib only installs it if Python version is less than 3.10 and it's got a constraint and similarly the import lib resources again only if it's 3.9. Because the Python library changes a bit like the early system, the language ecosystem is part of your partnership. And you can see the number of dependencies gradually change over time as you add more features. But what you get, that's what you really have, that's the hidden, that's the iceberg. It looks quite like an iceberg actually. That's one of my tools. And the green are all the transitive dependencies. Look how deep that is. That was fascinating. Pictures has a thousand words, I think we all agree. That's not really, if you really zoomed in, I've had to put the license values as well. That's an interesting thing if you could do some analysis. But actually that's quite visually, that's quite, that's quite a Iotner. And we've only got 60 packages there. So what have I observed by looking at all the data we've collected? And I want to look at the context, the context, a bit about quality, a bit about velocity which was the original thing, what's about change and then other things that are analyzed that I've discovered. And generally this is all out of GitHub, so I wrote a little utility to download the file history. So I could then quickly analyze it locally. And I ended up writing a little tool called S-bomb trend which then created it into a JSON file so I could then play around with it to generate pretty pictures which you're going to see. So first thing, there's nothing in any of the S-bombs that tells you it's Python. Or which version of Python, or which environment of Python. Now maybe Python, maybe FBH theme might, but that's actually quite important because you're going to see in a minute the difference what that means. Because if you just get an S-bomb and you don't understand its context, how do you know where, what, whether this is a real representation of the environment you're using, pick up what we were saying on the previous one. Cyclone DX has usually defined properties which you could use. SPDX doesn't yet. You could do comments but it's a bit harder, yes. Yes, I'm sure you could, yeah. So I use Cyclone DX properties just to say language, language Python, language version, something. But I think that's quite an interesting thing. It's good as SPDX thea's doing that because I think we need that is quite an important thing. And this is what you get. If you plot all the different versions of S-bombs across the year, the higher versions of the older versions of Python, it stops at p7 in the middle of the year because we stopped supporting it, but you see a trend. So that's the requirements trend and you see it sort of follows it and then there's a few other bumps. We didn't change it, the outside world changed. And sometimes you see it drops and that's because a package ceases using a dependency. It wasn't obvious until I did the digging up, but that's what that was telling me. It's quite interesting. So the lower versions have least a bit of a letter of dependency. You can probably sort of see that with the requirements file, but the requirements file is lost in the S-bombs. It's not there. So there are differences. Transitive dependencies vary independently of your direct dependencies. I think you could probably see that, but actually it's quite interesting to see the evidence. And the later versions of Python have the least dependencies. So that's a good way of saying don't just update your packages, update your language versions as well. So let's look at quality of S-bombs and that could probably have a whole conversation about this and a cold conference about it. So I've just chosen four tools because they demonstrate four different things. So the SBDX1 which is does it conform to the NTIA minimum standard. Look at the scorecard which comes from eBay, not the open SNF, look at something called QS which is from Interlink and look at one from me because it had something else that I discovered on Friday which was really interesting. So first of all NTIA. We are no different from day one to day today. We're still the same because we still fail to get all the suppliers. I would like to see how many people can get that on a real project. You can get that from small projects but not for real life projects. I think we all recognise that. Then the eBay one, one of the things they were doing is they were looking at package IDs, goes back to 10 o'clock call about the pearls and stuff like that. I didn't have pearls at the start of the year. I don't know. So my score went up. Enrichment, messages enrichment. Good and licenses have probably got better as well. SPM QS. This is done by Interlink. I don't know where they came from the idea but they have a whole load of different things they're looking for like licenses. Do you have other licenses still supported or the deprecated licenses? Do you have checksums for your packages, etc? That was a target. How can I get a better score as a target I started? So we get to 9.6. If you go on their website, most of the excluding S-POMP Python are in the sevens and eights. A lot of the containers are sevens and eights. So I'm quite pleased I can get to that level. The reason it's not 10 is because of the supplier failings, same as the NTIA changes. And then I have a tool called an audit. The reason I put that generated this was could you use the S-POMP to drive policy? So if you wanted to say I've generated an S-POMP and I've got a license like a GPL and I don't want GPL in the things, can I have a allow list or deny list of licenses, for example? That was the use case I came up with. But I also do it and I use the latest version of the products was the other thing I wanted to try and check. So I was getting reasonable number and the number of checks increased because I had more packages. Well this is the interesting thing I found. Scan came from last weekend. I scanned it on Friday. I was expecting to get 100% all the files were latest versions. Four of them got updated last Tuesday, which is why the green ones so happy. But there were a couple that hadn't changed. That got me thinking. Why don't packages change? Pinning. The world of Python is probably not to pin. They're indirect dependencies. I've no control of those. And I haven't quite got to the bottom of finding out where the pinning is happening because they're not even on the direct, the first level of the direct level where they are. So that was the reason that was there because I did a scan, an S-POMP scan, and what I got a vulnerability on my RSA. And the reason was I'm using, not using the latest version of RSA. So that was a weird, that was the sort of, could you detect that? So that's something that I only just discovered this week, which I thought was really interesting to share. I mean just if I happen to have that tool. So NTIA is a good benchmark. It's hard. Accurate supplier information. I think we all know the challenges of that. But date of enrichment is good. Can you enrich your desk things? Look for that threshold. Look at that utopia moment where you get 10 out of 10 for your S-POMPs. Because the more information you have, the more useful that's going to be for all the different use cases people are going to use your S-POMPs for. And it is possible. So this was the original use case. What's changing? What we're changing? What's not changing? Who's changing what? So the first thing is, and these are all driven by Matplotlib. So they're in the trend tool. So if you want to play with these as examples you can do. The top is the number of packages. The red line is the number of changes on a week by week basis. Every week one package, at least one package changed. At least. Which is good, the ecosystem's live. Is. Yes. Yes. So but that, you know, it's not, you know, what are the triggers for those changes? Yeah, some of the, you know, you can see some of the spikes relate to when we, when we did an update of the requirements. So that's, you know, you can see that. But generally things are changing all the time. And I was trying to show how to change, what's the, what's the rate of change and things like that. So this, I came up with this, like, train, train, flat diagram. That is showing a steady, steady going like that means it's changing every week. Except for the holiday. What? Except for July and all that. Oh yes. Yeah. Well, I think we can understand why. Yes. Actually, that's, that's probably quite a thing. Look at time. Time's actually a driver as well. Does lots of things happen in Christmas? Does lots of things happen in holiday periods? Yes. Interesting. That's enough. More people work on Christmas. Yeah. Well, I think we've seen problems where people have released something on Christmas day. And it's, but yeah. Anyway, that's a really good observation. I haven't thought that. Well, it's good. So you can see these things. And these are just the ones that have changed more than five times in a year. Because, you know, that's what 20 odd packages, more than 20 odd packages. And then if I look at, well, okay, these are the ones that frequently changes. Quite a few of them are direct dependencies. Why are they changing? Most of them are feature features, not vulnerabilities. Yeah. But actually, you know, can you find them? And there's one, a lot that rich. Why did they change? And they actually removed and unmaintained package, which then got me on another little track, which you're going to see in a minute. So yeah. Security fixes aren't the drivers for many of these changes for features. And then if I looked at the direct dependencies, again, okay, they're going up. Some of them are changing a little bit slower. That's, the case is no longer used. So you've got, again, you're getting quite a rich picture of change, which then says, if I pinned, first of January or second of January, I've missed all these changes. A lot of changes. Which may be, the features may be performance improvements, et cetera. You know, you might want them for good reasons. And this is what's the ones that have only changed once, haven't changed essentially. And the red ones are the ones that haven't changed in, I just took two years as an arbitrary value. And you think, well, okay, there's 10 of them that have not changed in two years. Does that not start linking a belt? It says, is it maybe not unmaintained? Is it now an unmaintained package? Don't know what industries have in terms of looking at the health of an open source project. Are they looking for the, you know, is two years long enough? And it says, maybe we need to look at alternatives. Right at the top there is Tom Lee, which is now a standard library within Python 3.11. Till I did this, I hadn't, I've missed that. So I then raised a pull request to say, if it's 3.11, we want to use the standard, standard library, not the open source version. So again, on the probability that the language ecosystem libraries are going to be probably better maintained or have a greater need for being maintained than necessarily community. Right? So change happens. But we could be very careful of pinning because direct dependencies change frequently as well. So there's a pinning debate. Right. Let's look at data analysis. Let's look at the first thing is languages, licenses rather. I've tried to look at the SBDX license IDs. When I get the metadata, try and map it. And if it doesn't quite match, do I have some sort of a few rules to try and alias them? So is it Apache space 2? Well, it's Apache 2 type of thing. And Apache 2 is a really good example. People don't know how to write Apache 2 SBDX ID license IDs. Yes? Are you pulling this from, from PyPy? Some of these come from PyPy. Yes. Yes. PyPy is a disaster in terms of specific license. Right. You've peached into the converted here. As a community, we should be looking at this and fixing it because many of the packages that have got license failures have been updated in the last 12 months. Probably because they've got features, but metadata doesn't really matter, does it? Metadata matters now in the world of S-bombs. Let's look after S-bombs metadata as much as the code and the tests. Told you I told you. Yeah. Yeah. Right. So I, so I summarise all the licenses and the things like that. And you see, again, you can probably quite quickly get a summary. Have you got a license problem? Okay. CVU and TIL is a GPL. Everything underneath that is okay, but you may be able to see quite quickly to see have you got a license compliance problem. The other thing is you can look at all the suppliers. Do you have a supplier that you really need to be loving and looking after? Because you're very dependent on your packages. This case, we've got 60 different providers, so it's not quite the obvious. But this could be a way of understanding who are your dependent suppliers that you need to be maybe getting closer to. Maybe supporting, maybe helping. And I'm thinking about the world of the enterprises as well, who might be needed, look, needed to do this. So again, but four of these packages have no suppliers. Three of them were updated. Why didn't they update the metadata? And then just a summary, I've got TIL that just differences two S-bombs, arbitrary format, don't matter that you can compare Cyclo and DX and SV8X. Just to see generally what's changed in the 12 months, well, there's 39 of the packages. I've had at least one version change. And we've lost two packages and we gained 11. So that's what 15% growth packages, number of packages that are dependent on. And then I did a scan. And that's the last one is I was expecting the last S-bomb to be as clean of vulnerabilities. The reason I've got one vulnerability is because of the RSA problem that we've heard too earlier. Potentially. So takeaways, I'm doing all right here. Right. Generate your S-bomb for each version of the supported environment you're doing. So if you're doing Python, 38 to 312, generate five S-bombs and also do it for Cyclo and DX and SVDX because the generations may have more, they may be different data, they may be enrichment between the two. Please as a community can we improve the metadata? We have all responsible to do that. Once you've got an S-bomb, that's the start of the fun. Start analyzing it, start using it, start reading it. It doesn't matter whether, you know, I'm sure many of you are quite familiar with reading JSON. Help the people around familiar with JSON. Look at the data, there's some documentation tools there you may find useful. This is the thing that we do when we install. We install with Python with this upgrade strategy, which is trying to make sure we're using the latest version of everything. But obviously that doesn't stop pinning. So it's interesting. I need to think a little bit more about that with Python teams. Keep your package up to date. I have a problem in my things because I just do pip install and they'll say, oh, I've got a beautiful soup. Yeah, that'll do. It's not the latest version, I'm sure. So just that's just be aware and use the latest version of Python. I have another tool called S-pom for files, which looks at the files so you can look at the change of files as well. That's a bigger thing. So it's just a thing. Could you start to see the amount of change in maybe one of your source trees and you repost, you know, or the test files changing, for example. And then obviously add vulnerability scanning as part of our generation. So this is what you all probably want. These are the list of other tools. The presentation is we'll be on the CVU in tool. There's a pool request in there. It just needs to be approved. Those are all the tools. I haven't written all of them. But if you want to follow me, that's me on LinkedIn. That's me on GitHub. And that's me in Manchester. Okay. Thank you very much. So on your list of increasing hidden dependencies, is it both a package or a package version? Okay, this is about the picture. Yeah. Yeah. Okay. So the picture that I showed, this is that showed the hierarchy of all the packages. They are all packages. They are Python packages. So if you have two different versions of the same package, they appear as one or? No, they would be pairs two if you had that. I've never seen that. Oh, okay. Yeah. Well, I would say that you live on an isolated island in Kingdom of South Africa. We are in the Union. And we have a presentation coming up. And I've got the divorce when you say that updates are driven by features. What do you see? Will that change in the future? Okay. The question there is, I mentioned the thing I said a lot of the updates appear to be driven by feature changes rather than security features. The question is, do you think that will change with things like the CLA? Probably. It depends. It depends. Yeah. There's no other one more. Okay. This is about the improving the metadata upstream. Probably the two things I would say is licenses to support the license compliance teams. And secondly, the supplier, because does that identify, do you know where you've got your software from? What can a large organization know that could sue the way it is? What can we do to help do that upstream? Use SKDX tags in your Python modules. Yeah. Yeah. You can do a public request. Yeah. I don't know. I mean, yeah. I think, recognize that there is a community out there. If you've got the effort, do it. Use it. Because, you know, we know the open source community is stretched because of volunteers. If the enterprise is taking value of it, can we use fully use your contribution? Because you're going to help many people. All right. Let's take time for me again.
Sharing and reusing SBOMs with the OSSelot curation database
All right then. Thank you. Thanks for the holler. All right. So I think I did. Yeah. Okay. Okay. So my name is Karen. I work for OSDL and I'm going to take a step back actually from everything we've been discussing here today about what capabilities S-bombs have and what they need, what more capabilities they have to have as well and how about tools to create them and go back to, well, I think what they were originally for, what we thought we originally should use them for with license compliance. And there we have a lot of stuff where we're still redoing the same work again and again and again because creating S-bombs doesn't work automatically, at least for most of the software that we're dealing with in embedded Linux systems. So there's still a lot of manual work required and that's where sharing and reusing work makes sense and this is where the OSDL project comes in. So I think this is fairly obvious. I don't need to go into this a lot why reusing makes sense. We don't want it to redo work that has been done before that is being done again and again and again. I mean, we'll still get these questions every day that why do I have to extract copyrights from Linux kernel source code. Someone must have done that already. Why can't we reuse that? And so why not do that and why hasn't it been done before exactly what can we do? So you know more or less compliance toolchain could look like this. We can't share work everywhere but we can share work where most manual effort is required with scanning and with curating data because as good as the scanners are that are out there, we've heard about scan code, we've heard a lot of tools ort, ort, ort, ort, ort, ort, ort, ort, ort, ort, ort, ort, ort, ort. So all of the scanners are all the tools that use the scanner materials, they're really good but there's still quite a lot of mistakes. So to actually do license compliance properly, we still need to do manual curation of the data. And this is where the Ocelot project comes in, you can find more information on the Ocelot website. The data itself is available on the open source compliance repository and the package analysis repository there, you actually find stuff that you can already use today. License copyright analysis results for various packages, mainly from embedded Linux systems. We have about 320 when I last checked. So different versions, of course, are 200 unique packages, more than 1.5 million files that have been manually curated. For each package, we have some metadata, so where the package comes from, a package URL to find where the package comes from, download location, and so on. Then there's the S-bombs in there. So the S-PDX S-bombs is what we're focusing on for license compliance in different formats as well with the license conclusions in there with the copyright notices. And I think that is probably some of the most valuable part of this with comments on why a particular decision was made. Because sometimes it's not clear, you can find information in a file. I don't know, you know how licensing information is noted in some files. It doesn't really follow any standards, especially in older software that's still being used. And then you have to make a decision, you have to do some kind of interpretation. And this is explained as well as part of the S-PDX files that are available there. Also, the S-PDX files themselves are explained because what we find, even though there is a standard, there is a specification, people still understand it differently. So someone might expect to get an S-bomb from their suppliers and they have a certain expectation on what a particular S-PDX file looks like. But they understand the different tags and they're differently than the customer. So we have a clear explanation of how we understand the S-PDX tags and of course we try to be as close to the specification as possible there. And then also for convenience, there is a disclosure document where if you find a particular package that are reusing in exactly that way unmodified and exactly that version you might for license compliance just use the finished disclosure document with all the license texts and copyrights and acknowledgments and so on, aggregate it. So of course it's not yet big enough to immediately license an entire system, but it is definitely a start. So as I said, the question why this hasn't been done before has been around for quite a few years and why hasn't it been done before? And I think two of the main reasons are liability and trust, which are more or less two sides of the same coin. So on the first hand, who was willing to supply such information, which is legally relevant if we're talking about license compliance or if legally relevant information where companies have gone to court over licenses. So who was willing to provide this and say, look, you can use this and we don't give you a guarantee but we did our best to make this documentation as sound as possible so that hopefully you won't be taken to court if you use it. And then on the other hand, you're a company and you're putting out products and you reuse legal information that you found somewhere on the internet. How can you trust this information? And these are the thoughts that we were thinking when starting this project. So how can we limit liability, first of all, for ourselves and for anyone who's contributing? And of course we asked some lawyers about that. And the idea was to license as liberally as possible. So we went with CC0, 1.0, that gives you as many rights as possible. And it works well for documents as well. In this case, gift regulation supply and liability applies only for willful intent and trust negligence, which we try to avoid. Also, I think the times have changed. So maybe ten years ago there was a lot of worry, especially from the US, that there's gonna be lawsuits in the open source area. But there haven't been any, not with providing legal information, with providing support with licensing, there haven't been any or none that have been known. And so I think people just got braver and said, okay, maybe now's the time that we can do this. And then on the other hand, we have trust. So how can you establish trust in the information? I think that's fairly straightforward is provide good quality. So do the curation conscientiously, diligently, only let people do it or let people contribute to actually know what they're doing, so train anyone who wants to contribute. I mean, it's a bit of a bigger hurdle for contribution, but it's really important as well to keep up the quality. The same goes with review, so the stuff is on GitHub, so we can use that for the review process. And yeah, we'll stand with it also, we'll stand with our name to make sure or to promise that we'll keep the quality as high as it started out with. Let's wait around. So what are the curation guidelines that we established to ensure this quality? Well, we're working with Phosology, I think that's just our preference. You can use any other tool as well. We're using ScanCode as well for scanning and integrated into Phosology. And we use the source code as upstream as possible. So for ideally directly from the project page, so to not go through any of the stages that we've seen on some slides before where stuff gets added from package managers. But we'll try to start as upstream as possible at the moment. And then I think the diffs that you get from what's added by package managers, this is something that can be included as well, but we're not there yet. So at the moment, we're still trying to go with the origin. And then curating the license, as I said, there's manual work in there. And I think that's the valuable stuff of this project. So license findings, copyright findings that the scanners have created are curated manually, of course, with all the help that Phosology can give with that. So with our curation guidelines, I don't have to check the time. I don't think I'm gonna go into too much details on that. I mean, if you have looked at the scanner findings, you know why there is some manual work required still. So with copyrights, it means mainly that stuff that was incorrectly identified as copyright is removed, stuff that is added to a copyright notice that's not really part of it, formatting signs, there's sometimes license notices, just part of code that is identified as part of the copyright notice is removed from the copyright notice. And then there might be references to external files as well, like copyright by the authors, project authors, C file authors. And then this information has to be added as well. With the licenses, again, reviews then on file level. So every file of the source code tree is inspected, if the scanner has found anything, or if it is mentioned in some kind of notice file or similar. And this is done in addition or even though if a package contains some kind of metadata on licensing. Because we've made the experience, and probably a lot of you have as well, that metadata just gets outdated or is incomplete and so can't really be trusted entirely. And I think that might also be one of the reasons I can imagine that this question might come up. So why do we keep all this information in a separate place? Why not upstream it into the upstream projects? And I think there is some reluctance in upstream projects to provide legally relevant information along with the source code. And also because then we would have, again, we would have the same problem that it just won't be updated. It's just how people are. And yeah, okay, so we check, we do curating on the file level. We confirm or correct scanner findings as you do. We add individual license texts as you have, especially with BSD licenses and so on. So this is also something that's not usually done by scanners. We only tag main licenses if there is a clear main license given in the root directory for a package to not mislead anyone that this might be the only license that's in there. And as I said before, the license comments, the license comments tags of the SPDX explains any license decisions, any curating decisions that are not obvious or that need some level of interpretation. Yes, please. What's your correction rate on average? Do you mean how many scanners are finding? How many do you find that you're like we have to step in? Well, that differs. Yeah, sorry. The question was what our correction rate is. Well, it differs heavily per package. So there's some packets that are really good order where, I don't know, I don't have a number. I would guess around 10% and there are packages in horrible shape where it's closer to 80% that needs manual work. Yes. So let's say I processed more than 3000 packages for sovereignty and I would agree but maybe 20% in general. Yeah, yeah, so that was just some agreement about the numbers with someone who clearly has more experience than me. I can't say I've done 3000. Yeah. That's because of the gray hair. Maybe you're just talking in detail about the clearing process at Siemens. Well, you might guess that there might be some connection there as well. Yeah. Okay, so what do these license comments look like? They also follow some kind of heuristic, so the usually says the information in the file is quotes, whatever the information in the file is. And then we give a reason for why we made whatever conclusion is concluded. Example, we don't have a version of a license given. We find this, this file is GPL. And then the license comment would be as no version of the GPL is given. GPL 1.0 or later is concluded. But we interpret and this is clearly is an interpretation. So this is a legal step that when we find this file is GPL, one could also say GPL the most, I think it still is the mostly used GPL, is still version two. So you could also go ahead and conclude, they probably mean version two, because it's the most heavily used. But our interpretation here is if they only say GPL, the author wanted to give us the option to choose whatever version of the GPL there is available, so version one or later is concluded. So this is something that is a step of interpretation, but that is explained in the data. Or for example, a URL is given instead of a license text. And then of course, the URL is checked, a date is given if anything was found more often than not the URL is dead as well. And then maybe additional research is required. And then of course, the information and the date is given as well when that was checked. So what are, yes? But in that last case, do you report it to the packet itself? Yes, yes. So in case we do find problematic things, we report them back. I mean, there are some licenses that have a URL in the license text that is dead. So I mean, and then people usually say this license is outdated, but it's still valid for some files that are out there and that are being used. So sometimes, yeah, there is, sometimes it's helpful and projects react, but sometimes there's not. But yeah, we try to, whenever possible, we try to report it back. That was also the question, sorry for not repeating it. The question was if we push it back into the projects, yes, we do our best to do so. And then going forward on what you need to comment. If the upstream doesn't take it, how much, what's the hit rate in terms of them ignoring it? So what, the question is what the hit rate is in terms of them ignoring it, it's not large. Mostly they do take it. So because most projects are interested in being license compliant as well or making it possible for users to be license compliant. Because that's what we're trying to do. We're trying to do what the project or the authors wanted us to do. We're trying to make it possible for users to be license compliant. So projects are usually keen to take, to take any or to help. So you asked about the rate before. I couldn't give you exact numbers, but I can give you some example of what kind of scanner findings we do have to correct. Well, I think they're fairly typical and if you've done any curation, you'll know most of them. So I'll go over them fairly quickly as well. So we have not a license or something has been found that just simply isn't a license but a bit of code or just whatever. So that's removed of course. It might be not the files license. So it might be some part, some information that's content of the file but isn't the files license. We have that in documentation quite a lot. Then license text. This is something that of course scanners get wrong and I don't think there's any way to fix it either. If you have a file, a license.text file that contains the license text, then of course the license of that file isn't the license text. Most licenses don't have a license themselves. But new licenses for example have a license. But this is something that's corrected as well then. With generic license text, I said that before. So individual texts, if that differ from the generic license text it is. Of course provided we have improvised, imprecise findings in particular those with respect to version of a license. Then dual licensing cases, especially if it's not a single. So an easy dual license where you have this or that license but you have this or that and a third license or this one license and the second or a third license. So these need some manual work as well. We have license exceptions that we handle a bit differently than Phosology does to bring it into one finding as well. But that's maybe particular to Phosology as well. We have external references that need to be checked. As I mentioned before, it might be URLs. It might also just be external references within the package though. And that also there's a lot of problems there because then you have files that are integrated from a different project and then in their file, they say look in the copyright file in the root directory. But they mean the root directory from where they're originally from. So then that information isn't true anymore and we'll need to do some research and then of course explain what research has been turned to find out where the file originally came from, what license it is referencing. Yeah, so that usually takes a bit of effort. And then we have global license assignment or partially global license assignment that we don't usually use. Again, from the same reasons I said before that meta information is usually wrong or that stuff is included from different projects. So if there's a read me file that says all files in this directory are licensed under the following license, we usually don't go with that information. Unless it says in a particular source code file, it says for license information, see read me file. So this is something that just I think comes from experience. Yes. There's a package manager field where there's a specific license field. That's filled out with the proper SPDX and the fire, do you apply that to the. Okay. The question was about package managers that have a license field or that have a tag for what license is. So at the moment that's not come up yet because we're on like fairly at the bottom were from come from Linux based embedded systems. So this would try and so far we haven't gone into much that is managed by any package managers. But the stuff that I have looked at it depends if that's the only information that's there will go with it. But there might be different information again in the source code. And if possible, we'll always go back to the source code. But we do give, so if we do have third party, or meta information, we also add that to the information in the package, license comment I think is where we add that kind of information. Yeah. I think there was another question. Yeah, it's fine. It's fine. Okay. Okay. Yes. So the project seems to be mostly organized for collaboration among humans. Yes. And not really consuming information about machines. For instance, there is no API for media. Yes, yes, there's a REST API. But for instance, package naming seems to be quite vague. So, so there are uppercase, lower case, and this is the. Well, we tried to go with the, yeah, yeah, you pointed out something. The question was about if the project is made for human consumption or machine consumption, automatic consumption. And I said that there is a REST API to call the files, which is not described in the repo, but on the Ocelot website. If you go to ocelot.org. And then the question was about the naming schemes, and we tried to be as close to the upstream naming as possible. But then again, they're not consistent usually. So, yeah, there is no, so we didn't make up our own schemes, but we tried to stay with the upstream and where there's inconsistencies. Well, we mirror that. Yeah, that's right. Do you know where is described API? Because even on the website, I cannot find it. On the ocelot.org, oh, I might be on tools actually, sorry. On the wiki, wiki.oscelot.org. Try that. Yeah. So, lately I found in some license listing that we were, someone who was using libvui.id and that was TPL listed only and that was taking licenses. And then the readme file which tells me, yeah, various source code in this license has different licenses. Various source code in this package has different licenses. So, and looking at the source code, we are presumably using the functions. It says, oh, it's not strictly TPL. So, not poison from a proprietary business point of view. And how do you express that kind of, I mean, we had the discussion before regarding vulnerabilities that you need to get back to function level. Do you foresee that necessity in your work also or do you strictly handle packages because? We handle files. So, the question was about, as I said before, with the meta information is imprecise, let's say. So, we go back to file, source code file level and what we find there, we believe. So, there might, you're right. You might, we might have to dig deeper but then it's over snippet matching and we, so we only assign something to a package if there is a clear main license but we also warn you can have this information and take it but don't take it as the only information that's there. Okay, there's more questions? Yes. Thank you. What you're doing, I think it's great. I was wondering about the upstreaming of the information that your gallery, first you said well, upstream is often not interested in it and then you made a statement like they really like to be licensed compliant. Did you, do you have statistics on that? Like what is your gut feeling about this because my personal experience coming from the videos and that you can list, many people have interest but they need help with it. I would say these two license compliance we needed to recover have to help them but you were having a great data set to help them actually. Yeah, yeah. So, well what our experience is that when, oh sorry, the repetition, the question was about upstreaming, what our experience was with upstreaming, if the, how the projects would react with that and you said your experience was that they're keen about license compliance and I, well yeah, most of them are, there's exceptions always and we have that as well with, like with concrete and particular cases when we say this file, we found this problem, can you, or can you fix it, can you clarify? Then they, most of them, very, really, very most of them are fine but, well I have to admit also we haven't tried with super many projects but if we say we have, we did a complete license analysis of your entire package for this and this version release, here's the SPDX file, then they're not as keen to, to, to provide that via their website because that is legally relevant information so that rather, I think we had one or two projects who were like, oh that's cool, we'll point to your site but we're not going to provide it through our stuff because there is interpretation in there, as I said before which we explained in the comments. There's some interpretation in there so there's some wiggling room and I don't know, maybe we could reach more with more effort. Okay. There's a few more questions, do we have another minute or time's up? Okay, so, well contact me anyway, I'll skip to the last slide so there's some, yep, sorry. No, that's good though, I prefer discussions. So contact me at info at auselaugh.org and we can, we can chat anyway. Okay. Thanks.
Welcome to the Devroom and Announcements
Welcome to this year's deaf room on software defined radio and amateur radio. It's been a bit of, you know, a harsh year for this place. Actually, we're very happy that we made it here. Last year, the SCR community didn't have a deaf room at Foster, which was really sad. And I'm really happy that we didn't have our own deaf room this year, but a deaf room together with the amateur radio community, which obviously has a lot of overlap. And so this will be like a slightly more diverse presentation than we might be used to. This is super nice. I'm Marcus Miller. I'm one of three deaf room organizers. Do you want to introduce yourself? I think if I have to, my name is Paul Merr, I'm from Switzerland, obviously. I'm a software developer and my stuff I do in amateur radio is mostly developing software. I'm also very happy we're together with the SCR guys here because that's a field of activity for the amateur, so it's very interesting. Yeah, so obviously, I'm Marcus Miller, maybe known from the radio project. I'm very happy to work with amateur radio at Jesus because, well, the application follows the tools in other ways. The third person is, I haven't seen him best year, so I hope he comes in, but we'll start without him. So a couple of things that I'd like to ask the audience is, of course, clean up after yourself. So if you leave, look whether you left some bottles or something because otherwise things will get. Harry, we are not overfilled today, which is a new thing for us. Usually the SCR deaf room was so packed that we had to arrange for people to stand not in the escape routes. I'd like to ask you that if you see someone who's blocking an escape route, talk to him. The other thing I'd like to ask is that if we can find and volunteer to occasionally check the online stream for this room and check whether there's something in chat that someone writes like, we can't understand the speaker or something. Let us know. So that would be the organizational thing. So coming to the content of things. Hi. You made it. So this is best year. This is a come over here. I'm a bit late. I was taking for a fit. Yeah, come on. Come on. Come over. So for the speakers in the room, the cameras over there, you can see yourself on the small screen there. So if you're not on, you can't see yourself on the laptop, then you're not on the screen. Content wise, we got pretty diverse collection of things and we tried during selection and schedule of the talks to make them a bit grouped so that we're not people who want to go to other death rooms can leave and stay for more than one talk. So we start off with me, obviously, and I will give a really, really brief introduction to what happened in Noredu since last FOSTEM, which honestly is going to be a bit opinionated because it's what I think is worth mentioning in this context. We go over to Sylva, who's going to talk a bit about using GPUs to improve the throughput in SDR computing. This is very interesting to me. We go over to Mark, who will then talk about a more modern approach to controlling transceivers than most of the tools that we have today. Then we go for the radar satellite group of topics. So we start with Jean-Michel talking about, sorry, lost it, synthetic aperture radar. We'll follow up on the Qo100 payload and we'll close that part of the satellite things with nanosatellites. I'm not going to go through all of these, I just realized. The next topic is basically SDR architectures and SDR application software, and then we'll go into cellular and radio science. So this is our rough rundown. I'm pretty good on time. I can now start doing my next talk, but I guess we'll take your opportunity to, if there's any questions, ask them now. Okay, so yes. If it's loud, then close the door, please. Yes, of course, yes, he did. So that's true. We do have like a schedule switch. So the second talk, the Tetra talk, I think gets switched with Ang's talk on set-dump, right? So that is another satellite. Heavy top, what? Yeah, should be fine. And I'm excited that it actually works out. Yeah. At the room is the wrong side. Oh yeah, we should probably do it. I happen to have like, you're the perfect person. I have some paper. Yeah, yeah, yeah, I have paper. I have paper. I have paper. Yeah, yeah, yeah, I have paper. And while we're at it. Then I'll just, I mean, this will probably mess up the video stream afterwards because they're going to cut it by the minute, but that's something that we'll arrange in post. Um.
Using GPU for real-time SDR Signal processing
Real-time processing. A very brief background. So I'm Sylvain, Fox 4 Golf Kilo Radio, this is my amateur radio call sign. Very briefly about myself, for those who don't know me yet. I'm the founder of a small company in France, doing SDR. The name is SDR Technologies. I'm the most important here to introduce the story about GPU, and it is the next lines. So I was working for ONERA, the French Aerospace and, well, military, I would say affiliated to the Ministry of Defense Research Lab. And that explains how this started, and I will come back to this in a slide. So very briefly, the outline of the talk will be that I will explain the motivation and then try to explain the approach I took when I tested this GPU and why it came, why I had the idea to use this for DDC. And you will see, and I tried to take a few minutes to explain the background, not of the code, because it's on GitHub and you can read it, and I'm quite sure you will improve it a lot, no doubt. But just to explain for those who are not yet, who are not familiar with GPU, and why it can be useful, and what kind of things you can do with it as long as you take the time to write code in this. So very briefly, the story started a while ago when I was working at ONERA where we have radars, and I just took some pictures that you may have seen already. One is the Nostradamus, it's an HF over the horizon radar, and the other one is Grav, very famous. So these two radars were designed and operated in not my team because I was not leading the team, but in the team I was working in. One of the key problems here is that you have a lot of channels. One antenna is one channel, means that you gather a huge amount of data. And one of the key problems is how do you process this data in real time. And at that time, in my, don't remember exactly the year, NVIDIA released the Tegrar K1, which was very small stuff, but looking promising, sorry, in particular for embedded systems. So my boss said, can you have a look at this and tell us if that can bring anything to the game. And just to make also the story very short, the answer was yes, it's useful, and that made my decision to leave the research team and found my company. So yes, the quick answer is yes, it works. Okay, so now let's go back to more serious things. This is from the leaflet, I would say, from the Tegrar K1 at that time. They were promising something like 326 gigaflops, oh, five watts, 99 euros for the deathboard. You say, who? Does this really work? And that was the idea. That was the idea was to test this if that can be used for software-defined radio. I'm assuming here that most of you have a very brief and very quick idea of what a GPU is, so I will just take a few seconds to explain. I'm just realizing that if I move to the screen, nobody will see from the remote, I guess. Yeah, okay, I'll try. Sorry. So just to explain the model, this architecture has two things inside. You have the ARM ARM processor, this GPU, this is the four cores, this one, and you have the CUDA cores, 192 cores next to it. And the good thing is that they share the same memory. Okay, if you have a PC, you have your core, whatever you want, and in one of the slots you have the GPU cards, and they have to transmit, they have to share data through the PCI bus. In this one, it's a bus, it's kind of PCI bus, but you will see that the performances are much more interesting. The second thing is that one core does one symbol operation at a time. So in this very simple example, I'm adding AC is equal to A plus B, and the code is just saying for each CPU, each CUDA core, take A, take B, make the sum, and store in C. That's pretty simple. So the key point here is that there are three things. One is push the data, the second is push the code, then run the code, and fetch the results. Keep in mind that you have to push the data, and this costs a lot, of course. So coming back to our SDR and DSP, what are the things that may need power? Well, just examples. The one I will elaborate this morning is the DDC, digital down converter, but you have many others, like I have not yet, and I will not describe this morning, I have not yet investigated so much. Feel free to take a seat, no worries. Interpolation, decimation, clock recovery, synchronization, pattern detection, and so on and so on. One of the key issues here is that some algorithms are extremely difficult to run in parallel while others, it's much simpler. And some of them just don't work in parallel easily. So in this example, let's focus on something simple, which is multiband DDC. So we'll assume that we have a white band signal coming from a white band SDR, whatever it is. I took DHF example. So here, for example, we have a receiver that is transferring to the memory. To the stuff, the device, a 50 mega samples per second bandwidth. And we want to extract from this small subbands. OK, so I took examples of DHF bands, one at 7 megs, another one at 14 megs, and so on. That's just examples. The core thing is how do we extract the subbands from the single white band signal? So for one channel, it's pretty easy. And that's the classical stuff. This is a DDC. So you basically translate the frequency and then you make some low pass filtering. And then you throw away all the samples you don't need. That's very classical. I have not invented anything here. And I guess you all know by heart what is a low pass filter, but just take a few seconds to remind how it works. On one hand, you have the input, the samples coming from the SDR. On the other hand, you have the filter you want to apply for the low pass filtering. And you make a convolution, basically some modifications and additions, and you retrieve the output. OK. Now let's look a bit more in my example. How many taps do we really have? So for this example, let's assume that we have a 50 megahertz, so 50 mega samples per second bandwidth incoming. And we want three kilohertz just to extract the audio. This is a fully digital system. So at the end, we want audio, plain old voice, someone speaking. And we assume that three kilohertz is enough. There's a lot of different approaches to estimate exactly, to estimate as accurately as possible the number of taps we need. I saw many, I tried to find an example. I saw plenty and pages from you, Marcus. Many of, I was going to copy and paste some of yours to avoid questions. No, I'm joking, of course. Well, there's many ways to estimate the number of taps. And one of the approaches is this, I don't remember, yes, the Iris approximation, sorry. And so if you do the calculation, you arrive at 50,500 taps. OK. 50,500, so what? So what? Now let's go back to this stuff. So to do the convolution with 50,500 taps here, you need to do this 50,500 times for each sample. It means that to get one value out of the FIR filter, the Lopez filter, you need to take 50,500 inputs, 50,500 coefficients, do the multiplications, do the sum. And you have one sample. And you have to do this for every incoming sample. That begins to be a huge amount of processing. Of course, if you have, you have all experienced many low cost SDR applications running on low cost PCs and they do this in real time. So how do they do it? Of course, there are tricks. One of the most, the easiest one is to divide by two instead of going straight from 50 megs to 3 kilohertz, which is nice, but probably not the best one. You do this step by step by dividing by two. So you take the first band, apply a half band filter, so you have less, you have the half of samples and you repeat this several times. That's very interesting because each time you remove a lot of samples and if you do this clearly, you can have 50% of the coefficients that are zero if you compute the fear in a good way. So that removes you a lot of computation. Of course, yes, but this is not ideal because you will hardly be able to reuse the computation you've made for the other channels. You will reduce a lot the throughput, the number of calculations you need for one of the channels, but then the next one you will want to reuse some of the calculation you've made and that's not easy. So at the end, this doesn't work so good. So, so can this stuff help? So I just put two examples here. On the top you have the Jetson XAVNX. I know that in an open source conference promoting a brand like NVDA is probably not the best idea, but just to make the story short, I have no sponsoring from NVDA. Okay, so just to be figures and facts, the first one is the XAVNX, so it's roughly 500 euros, roughly. And this one has 384 cores and the next in line is the NVDA 800, which is not the same price, 20,000 roughly, and has 6,912 cores. Okay, the interesting thing is the two FFT benchmarks are put below it. So if you look at the Jetson XAVNX to perform an FFT of, sorry, I'll say it this way, 2 power 19, which is quite a lot. So it's 310 microseconds. That's quite a lot. But of course, if you look at the most expensive one, you have 170 microseconds for 2 power 23, which is a huge FFT. A huge FFT. You can do this with an FPGA, but to get those size, it's becoming fun and extremely tricky to do it. Okay, so and for the XAVNX, you see that if you go up to the power of 2 at the power of 23, it's 7 milliseconds. That's a huge number. It's quite fast. So how can we use this? Of course, if you look back to your DSP lessons, that's pretty simple, in fact. A convolution can be, I mean, applying an FIR to a signal is just making a convolution. And the convolution, you can use the FFT. That's pretty simple. I mean, that's pretty known. You take the input signal, you do the FFT, you take your filter, you do the FFT here, and then you make a product of the two vectors. There is a bug. It's FFT minus 1. There is a bug here. Inverse FFT. And you get your output. So basically, you do FFT, FFT multiplication, inverse FFT, and you have your output. That is for one single block. Okay? That's quite good. It works well. But this is for a steady signal, not a stream. So if you want to do this for a stream, there is an improved version of this algorithm, which is called the overlap save or overlap hat. I use the overlap save, which is basically sliding a window, sliding blocks, moving the input, doing the computation, and so on and so on and repeat this. The key point here is that you use always the same filter. So you can compute the FFT of the filter once and keep it. And the input, you will see, can be reused several times. So basically, if you do this in the GPU, the performances are quite interesting. And this is what I did. And this is what I'm going to show you here. So the idea is, this is the architecture of the code I'm proposing. You receive the samples from the SDR. You do a first FFT. So you push the samples into the GPU RAM. Okay? Then your code does a first FFT or the incoming block. You assume that you've done previously at the init the FFT for the several filters you want to apply. So here in this example, I have two. You do the connexer product, modifications for both FFT, the reverse FFT, and the decimation. And you're done. There is one trick. I will come back to it in a few slides. So basically, it means that, sorry, if I go back to this slide, excuse me, you do this FFT in fact only once. You reuse it for the different channels you want. You have done the FFT for the filters once. So in practice for each new incoming block of samples, you have to do one FFT here, modifications, FFT minus one, and decimation. And that can be quite fast. All this doesn't need data to move from the GPU memory to the main CPU memory. But that's quite fast in fact. Then one trick and why I have ended with using the CUDA and proprietary API and the NVDA stuff. I've heard from guys in this room that you can now do this in OpenCL. I have not tested, to be honest. One of the trick is that if you don't pay attention to the scheduling, the different channels will be the different codes will run in series, in sequence, FFT and so on and so on. So you will have to wait for the last block of sequence of operations to be finished before you can retrieve all the samples. And you wait, you may end up waiting quite a lot. But if you use this trick just to compile option, switch, then the scheduling inside the GPU is different. And then everything run in parallel. And the difference is quite large, quite big, to be honest. The difference is much faster this way. One last thing is that if we only do what I proposed, then you miss the frequency shifting. There is a problem, the output frequency is not a good one. So you need to apply the NCO to shift in frequency the signal. And of course it's much more efficient to do this at the end because you have less samples. So it's much faster. You do the shifts at the very end and you just use the fact that you have some aliasing. So the code compensates for the aliasing and that's the frequency shift at the very end. Just look at the code. That's easier this way. So what am I proposing this morning? So you have in GitHub a lib, an example. That's a code that is quite old from me, but it works. And the key thing is that you have to allocate maximum number of channels you will use in the beginning, basically because it will allocate in the GPU the RAM for the different operations. Then the code is thread safe. That is to say that you can add, remove, shift, replace, change the number of channels you use, the size of the channels and so on in real time. This is CUDA based. I know that maybe OpenCL could do something that I have not tested. And I have only tested this with NVDA GPUs. So just to give an example of what you can get with this. So I just benchmarked this with two different architectures, the one I had, but I'm sure that I will receive tons of PR to add new figures in the tables on GitHub for sure. So practically speaking on my machine at home, it's a well, average PC with a GT RTX 2060 with one single channel. So throughput is just a bench test code where it's just pushing data to the GPU, making the computation and retrieving the samples. So with one channel, it's roughly 600 mega-samples per second. With two channels, 530. OK. Just as a baseline for comparison with the Jetson XADI-NX, depending on the FST size, that changes quite a lot. And you can reach up to 156 mega-samples per second with two channels. One channel, sorry. And 117 with two channels. The filters were 7200 taps. Excuse me, that's average. You can change this in the code. I'm checking the time because I know Mark will kick me out soon. So just to come to the, just one of the interesting things is that if you look at the figures here, you see that the GPU is roughly 80% used. The PCI is 36%. So there's room for improvements. And if you look at the CPU, one core is 100% and the others are relaxed. So it means that maybe there's room for much faster, in fact, because we are far from overloading the machine. And in fact, if you look in detail, where is the bottleneck? It appears that the bottleneck is the memory copy. The synchronization between copying the memory from host to device, wasting for the threads to start, waiting for the kernels to stop. All this synchronization takes a lot of time. And if you start to plot this in time, NVGA comes with the tool. I don't remember the name, where you can see the different threads in time, how they work. And you clearly see that there's, the bottlenecks come from the synchronization from the host and so there's room for improvement, for sure. So if you want to tune this, you will see that the, of course, the size of the FFT used has a strong impact on the performances. But that really depends on the performances of the GPU you're using. As I said, moving the data from host to GPU is extremely expensive. In the example I was doing, copy from host to device in complex float, I could use complex ints, raw data from the SDR, and there is in the code one example where you can convert the ints 16 to float directly, so it's cheaper. I mean, the amount of data you would copy from the host to the device is much smaller. And I was using LibUSB in real life. I mean, not in the example, but in real life. So it's also very expensive. LibUSB is far from optimized, from optimal, I would say, more than optimized. And of course, one of the important things is that the CPU, as it's not, well, that's the different cores of room for other things. It means that you can do other tasks like paint and eye spectrum on the screen, like send emails, like listen to music, whatever you want. I think that's all thoughts. Thank you very much. I didn't want to spend too much time. And I'll be happy to reply to questions if you have any. Thank you very much. Yes. Yes, please. You said you did the frequency shift at the very end after this, and is it possible to already do at least a significant part of the frequency shift by just offsetting the FFTs? That's what I do. I rotate the FFT. I rotate, yes. But then you have a reminder, because if you do this, you have the shift you perform is an integral fraction of the incoming. So you need a post fine tune. And that's exactly this. Yeah, you're right. That's what I'm doing. Yes? You didn't see an IIR, FIR, or CIC filter? Just FIR. Yeah, because it's just FFT and Chronicle products. That was the simplest approach. Thank you. Yeah? Was there any attempt to match this into the radio? Not yet, to be honest. I'm not good enough in the radio. I had a discussion with Jean-Michel, a side discussion, and there's a plan to do it. The point, I mean, I was not able to do it for them. I don't have enough practice in C-Blocks, so I said, OK, let's do this with the guys who know. So we will come with a proposal. Yes, that's the idea. Typically, the idea would be to have something, if we can do it, that would permit to have messages to change, to add, and remove channels, or tune the channels in radio directly. Because one of the points is that you need to allocate to define how many channels you want to use. So depending on the applications, you might need different numbers of channels. That's why I wasn't able to do it. Any other question? From the audience? Yes? Just a small question. You used a single precision floating point. Very good question, in fact. Single except this one, the frequency shift. Because in CUDA, the sine and cost functions are nightmare. They produce a lot of noise. So in the code, it's written, double precision, don't touch this. Because otherwise, the noise is going up very quickly. Anything else? OK, thank you very much. So there's more folks pressing in. So if I can ask you to give a little bit more space. You didn't need to kick me out. That's quite fine. Bonjour. Thank you.
Covert Ground Based Synthetic Aperture RADAR using a WiFi emitter and SDR receiver
I'd like to show you a little bit how I'm using software refund radio, of course, running radio, for developing a covered ground-based synthetic aperture radar using Wi-Fi as a radio frequency source. So just to see what it looks like by the end of a presentation, this was done with some funding, or leftover project funding, so I put the affiliation of the lab. Actually, it's a hobby project, but I had some leftover contract money to develop this thing that I wanted to show with you. So what is ground-based synthetic aperture radar? So let's start with what is the objective of what we want to look at. When you're looking at a radar, you have a remote sensing measurement technique where you want to do some radio frequency detection and ranging. So you would like to see targets. And in the case of GBSAR, it's mostly used for small, minute variation of distances. So in this example here, which I was lucky enough to visit Professor Sato's laboratory in Sendai, and that's one of his setup where he's looking at lens slides. And when you're looking at lens sighting, ground-based synthetic aperture radar, you're using the range information to detect the distance from the SAR measurements and the lateral resolution is given by the spatial diversity of moving your antennas. So as opposed to it's an active measurement technique. So as opposed to passive remote sensing like optical measurements, photogrammetry, optical satellite imagery, you're not sensitive to lighting conditions or day, night, or cloud cover. And it's an active measurement technique. So you will generate the signal that is returned. But unlike laser detection and ranging, you're not sensitive to weather conditions. So radar is all weather conditions. So that's beauty. Now you've got some commercial systems. I just took some of the European ones I'm familiar with, Italian IDF, Dutch Metasensing. I don't claim to be competing with these guys. These guys have 100K units. I'm not going to show you a 5K device that's compete. I see this as an educational project to try to get familiar with the concept of SAR and trying to do this. Well, I wouldn't say legally, but at least not get caught by using Wi-Fi signals. So what are the requirements for radar? Radar, on the one hand, you want to detect a distance. So distance is inverse of bandwidth. So you need a large bandwidth. So you need some wide bandwidth signal, and Wi-Fi is very good for this. Now there is no reason why you would get more bandwidth with higher frequency. But it's a fact that technologically it's easier to get more bandwidth with the higher frequency. And so I moved to 5.8GHz Wi-Fi because you've got 200MHz bandwidth. So that's kind of nice because your range resolution C over 2B is going to be something like submeter. So you can separate by range, two pixels separated by less than a meter. And then also because I want a mechanical setup, I showed you in the introduction we want spatial diversity. So we're going to have some moving stuff. And the higher frequency, the smaller the wavelength, the smaller the wavelength, the smaller the antenna. So it's going to be easier mechanically to move some smaller antenna, hence the increase in higher frequency. And also the rail, along which you're moving to have the spatial diversity, the azimuth resolution is given by the wavelength over length. So if you're higher in higher frequency, your range distance will be smaller, and the rail will be a bit cheaper. So these are reasons for moving to higher frequency. So the SAR measurement tells you that you're doing spatial domain, which is doing time domain. So you're moving the steps, you're moving the antenna. And I'll show you in the next slide that actually azimuth compression isn't for a transform. So it's really your adding phase each time you're moving the antennas. And if you want to match Shannon's sampling theorem, you show that you must have half distance, half wavelength, is the same as half the sampling frequency. And when the transmitter and the receiver are collocated, actually because they're both moving, it's not lambda over 2, but lambda over 4, because you're moving both the transmitter and receiver. So you need a system that allows you to move your setup by quarter-weaving steps. And because I want to have as little sliding contact, all these electrical stuff that's moving, they have poor contact. So I wanted to put everything on the moving part. So everything that is moving is the Wi-Fi dongle as a transmitter, a B210SDR as receiver. But an important story here is you need a dual channel coherent receiver, because you don't know what the Wi-Fi is streaming. It's streaming a broadband signal, but you don't know what it is. For me, it's noise. And so if I'm sending noise, I need to record the reference signal. And on the receiving antenna, I will look at time-delayed copies of this transmitted signal. So that's your basic passive radar measurement. And this is all running on the Raspberry Pi. So the Raspberry Pi at the moment is Raspberry Pi 4. It's running build routes, running new radio, and I'm streaming 0MQ to the processing PC. That's what we showed a few years ago. So actually, this is the final setup. So I took some commercial antennas here. You want it to be a bit directional so that you can have some bigger range. And this is why I'm saying it's not completely legal, because I'm sending the 10 dBm of a Wi-Fi transmitter. But of course, it's an isotropic radiated pattern. And here, I'm focusing on the 20 dBi gain antenna. Let's forget about this. No one's going to notice. And we do the same on the receiving. So you see here, you have the rail, everything that's moving and transmitting and receiving antenna. Raspberry Pi is over here. The B210 is over here. So everything that's moving heads the cables. And then I'm transmitting. Here, I'm transmitting over internet, but it could be over the Wi-Fi communication. The 2.4 GHz, the stream. Now, doing Wi-Fi measurements, actually yesterday, if you walk in the garden just outside here, you're going to see this poster. And actually, I was reading the poster. For those of you reading French, actually, there's a PhD from Brussels using Wi-Fi for what they call it, crowd safety. I call it crowd control, but he's PhD. So he's still optimistic. And of course, using Wi-Fi is MIT is very good at advertising what they're doing. So MIT has been doing full-the-wall Wi-Fi measurement for a long time. So Wi-Fi is not new, but I'm just trying to show you here how to make an educational system. So the principle is we continuously broadcast Wi-Fi. So actually, you could be streaming a very big movie, or you can take Bach's Packet Spammer. This is what I'm doing. So Packet Spammer will just keep on sending packets over time. And you have this non-cooperative source sending signal. And because it's non-cooperative, it might be that sometimes you will look at Packet Spammer. And because you cannot squeeze too many packets in a second, you'll have some gaps. So you just have to detect the gaps, throw these parts away, and collect enough data that you don't have noise. Now, we've just seen the presentation by Sylvain about GPUs. And just going to this, the correlation, when you're doing correlation memory, you're looking at the time-delayed copy of your signal. And you might think he's talking about correlation. Sylvain was talking about convolution. It was a relationship. The relationship between convolution and correlation is just you flip the time in the argument. Convolution is tau minus t. Correlation is t plus tau. And when you flip the time, you take the complex conjugate. So you see that exactly what Sylvain said. You take IFFT of Fourier Transformers Surveillance times the complex conjugate of Fourier Transformers of the reference signal. And the complex conjugate is to go from correlation to convolution. And the problem with this is that if your filter has some ripples on your reference measurement or on the surveillance measurement, if you have some ripples, then you will multiply the ripples because you're multiplying the amplitude. And what's really important in correlating is that you want the phases to subtract because the signal will be square-irons. And if they are coming from the same side, you have zero phase, or even same direction, they have the same phase. So you want to subtract the phase. And actually, instead of doing the analytical formula of multiplying the Fourier Transforms, you can actually take the ratio of Fourier Transform, which is the same by taking the inverse to take the negative phase, but you cancel the amplitude fluctuations. So that's actually what I do at the end of the day. I take the inverse Fourier Transform of a ratio of Fourier Transforms. Now, each Wi-Fi bandwidth is 20 megahertz. And 20 megahertz is on the one hand more that I can stream from my B210 to the Raspberry Pi 4. And secondly, I told you there's 200 megahertz available in Wi-Fi. And we don't want just to be using the 20 megahertz of one channel. And so what we're doing here, if you look at the allocation of frequencies, Wi-Fi is very broad. It starts at 5.4 gigahertz. Actually, you should avoid the 5.4 gigahertz, that's the C-band radar band. It was also called military G-band. So you would like to avoid this kind of frequency. And C-band is also Sentinel-1. We don't want to be jamming Sentinel-1. So we start working above the C-band radar. So we have all these channels here. And what you do is actually you do what is called frequency stacking. So actually, you reprogram your Wi-Fi dongle to jump from one channel to the other. And then you just keep on sweeping. So this was done of Spectralizer. You see here how you're broadcasting each one of these channels. And I can check that indeed this is working. And so for each channel, I reprogram the dongle. I stream the data for 0MQ. I record the data when I know I reprogram the new channel. And after I collected the number of samples I wanted, I reprogram the new. And I keep on looking like this. At the moment, everything, all the FFTs are done offline. Actually, everything I'm showing here is post-processing. I showed you a very fast movie because a full measurement is taking 15 minutes. And as I've had run the movie in real time, my time would be exhausted by the time I finish the introduction slide. So actually, a full measurement is taking 15 minutes and processing the full data is about an hour because I'm not using GPU. Here, this is all CPU post-processing. But one thing that I would love to see we've seen very fancy GPUs here. I just got two Raspberry Pi 5. And I'm told that we would be documented how to use a GPU of Raspberry Pi 5 to do some sort of processing. So that would be really beautiful to do at the moment. It's beyond what I can do. So this is actually experimental. This is what I do. Each one of the color was a spectrum collected by the B210. And so you see my frequency stacking so that allows me to spend the 200 megahertz of Wi-Fi. Be careful that there are some gaps. I think it's these guys here. So there are some gaps. So when you do the ratio, just make sure that you not a number of the values that you don't divide by 0. It's going to be unhappy about the calculation. Just a little side note. When I bought this rail, usually I tried to do some hack where I tried to find what's in the lab and I tried to assemble. And for this time, I had a bit of money left. So I bought a real rail. And I learned, I discovered, that all these industrial controls so that programmable logic controller are running on 24 volts. That is very standard. And your Raspberry Pi, of course, is a 3.3 volt GPIO. So you will need to have some voltage converter. That's your legacy, ULN2803, open collector dialectan transistor that will convert the input 3.3 volt into 24 volts. And the other thing that's kind of funny for us is in industrial control, they don't want you to do anything you want with the rail. Because if you misbehave, then your rail might go off. So actually, you're not allowed to program the position. You have to pre-program a set of values where your rail can go. And then you say, I want you to go to position 1, 2, 3, 4, and so on. This, of course, is proprietary software from the rail manufacturer. But it does run on wine. So it's not open source, but you can do it. So this is what it looks like on the moving part. You've got the Raspberry Pi with 24 volt controller over here. OK, having said that, what you collect, you collect for each antenna position all the spectra in the frequency domain. Once you've got on the reference channel and on the surveillance channel all these antenna positions and for each frequency, so that's a 2D matrix, you cross correlate each one of these. So you end up having one 2D matrix because you've correlated these two guys. You've got the antenna position on the x-axis. You've got the time domain because you inverse for transforming the y-axis. So this is before azimuth compression. Then you do your azimuth compression by doing the FFT. So this is FFT in this direction. Then you take the FFT in this direction to do azimuth compression. And then the part that I'm not completely used to here, you get sine of theta. You want to have range theta position. And my colleague, Weike Feng, Air Force Engineering University in Shanghai, gave me the algorithm for reprojecting the sine theta range to the range azimuth position. And once you get these maps, well the really beautiful thing is there is no degree of freedom. If you know how you move the antenna and you know the frequency step you use, you cannot cheat with the results. You've got an x and y position that is fully determined by your data acquisition condition. So here is one example from our lab. So this is the rail, this is the antenna. You've got here this round circular building which is over here. You've got the portal which is over here. And you've got the university housing which is over here. So there is no degree of freedom other than positioning the radar at the focal point. And this you know, I know where I'm located. And you have only degree of freedom is azimuth. You can tune the picture so that it fits. In this case, I threshold the pack scatter to make transparent where there's no output. So this is on the other side. So this were closed range, this is further away. So we're looking at the opposite side. You've got this building that is over here. You've got this container which is over here at near range. Again, no degree of freedom. And there is this reflection. And you should tell me, how can you get a reflection when it's a field over here? Well actually that was taken this summer when Google Maps had not yet updated their Google image because this building was indeed built since then. So this is one example where we actually get reflections up to 500 meters. This building here is giving us something. This range here is 500 meters. So it's working, all right, let's say well, at least you can see things with it. Then you might ask, is this reproducible? So last weekend I said, OK, like open source project, you put the GitHub, you say, trust me, it's working. And six months afterwards it's all gonna be crushed because all the libraries change and nothing's working anymore. So last weekend I said, let's take everything out and let's check if it's working. So it is working again. So here you've got the XY map which I project over Google Maps. And the nice thing is Google Maps updated their database. So now the hotel is over here. And here you've got the reflection far away. And you've got something here. So you might say, wow, I get even something even further than 500 meters. And it's reproducible. I took a second image over here and you get twice the same image. Don't be fooled. This here is not if you change the big orientation of the radar and you look a bit to the right, you think the reflection is still over here. This is your ambiguity function. The ambiguity function is you take the auto correlation. So you check, is there some self-similarity? And obviously, OFDM Wi-Fi does have some self-similarity. And this is a repeated pattern every 1.5 microseconds or something like this. I don't have the details. And so be very careful when you're using non-dedicated radar signal to check the ambiguity function because they might create their own repetition, which are not targets, but just because the signal does have some structure Wi-Fi. Looks like noise except when they repeat the OFDM error or something like this. But still you see that even if I try to go back, you see that this guy, for example, is a real target. Because if I move a radar azimuth, you do see the radar, the target at the same location. So I'm not completely lying here. And so finally, I was thinking, why is this reflection so powerful? How come there is one building at 500 meters that is sending this echo? So I went to see. I walked around and I took this picture. And actually what you see here, they've got the windows. But as a shade, as a sun shade, they put something that looks very much like a corner reflector. If you remember what a corner reflector is, it's a free right angled part. And actually, modern buildings, architects seem to love corner reflector. You look at modern architecture, you've got free right angled corners everywhere. So that's very good for radars. So this is actually why this building in particular is returning such a good signal. Finally, I told you that the range resolution is only a half meter, 75 centimeter here with 200 megahertz bandwidth. And we want to detect length slides with sub-centimeter displacement. So the classical method is you do interferometric measurement in SAR. So in INSAR, you don't only look at the magnitude of a return signal, but only also the phase. And the phase is uncertain because you've got two pi phase rotations. So you don't know how far the length slide is, but this you don't care because you got it from the range resolution. And by looking at the phase, you can actually get your distance variation, which is half the wavelength times the phase rotation over two pi. So the only challenge is because it's a radar, it's half wavelength because you've got a two-way trip. And so basically, I'm claiming here, so what I did is I took all the strong reflections. The ping here is misleading. This is not one. This is not a number. And I took the average and the standard deviation of all these guys. And you see that the mean value is in 1 millimeter. So you do get a millimeter on the mean value with 1.5 millimeter standard deviation. So I claim this to be 0 plus or minus 1.5 millimeter, which is probably not state-of-the-art, but that's just educational. So I'm still almost pleased that it works. And if you had seen some of my previous presentation, if you take a corner reflector, I try to do it. And it fails here. If you move a corner reflector by step of 5 millimeter, you do see it. So the phase analysis is working as well. So to conclude this presentation, I wanted to share with you how you can use, I think, affordable hardware for running a synthetic ground-based synthetic aperture radar, especially as an educational tool using commercial off-the-shelf Wi-Fi emitter, in this case, as a cooperative target because I'm broadcasting the signal. And I think it's a great opportunity to try to get started with this digital signal processing. Now, just to give you an idea of the budget, because I told you I had some leftover budget from a former contract that I had to spend by the end of the year. So I bought all this hardware. So the antennas are 1,000 euro transmitter receiver. You've got, actually, no, not two times. So stupid me. It's a 1 transmitter, 1 receiver, of course. No, no, no, sorry. A pair of transmitter receiver and the accessories for handling the antennas. You've got the rail, which is by far the most expensive part. But you do need the accuracy of the rail, the repetibility of the rail will give you your ability to do INSAR. If you've got a shitty setup where you've got an uncertainty of 5 millimeter on the position, well, 5 millimeter with respect to a half wavelength of 2.4 centimeter is significant, and this will blur your image. So that's where I wanted to spend a bit of money to have a good quality. These guys claim to have 100 micron reproducibility, so the sub 10s of millimeter, which I think is really good. And it's kind of easy to use. You've got your Wi-Fi. You've got the passive RF. And you've got the Raspberry Pi. These are all easy to find. And the B210, actually, I had leftovers. I think I have a dozen B210 in the lab. So I just took one of the leftover B210. And as I was doing this talk, I wanted to share with you the fact that everyone could do it. And at the end, we have a 7,000 euro project. And I'm not sure everyone wants a whole B210 7,000 euro. And you do see that the most expensive part here is the B210. So I checked, and I do have quotations from the beginning of last year, last, last year, that says that the B210 was 1,400 euros. In January 2024, it's now 2,100 euros. So I'm sorry for an eye, but I'm not going to advertise the B210 because this is really too much of a price hike. You do have a Pluto Plus with a tool channel, which I can get on AliExpress for 300 euros. And it's the same 80, 90, 360 something. It's an internet output. And when you've got all these moving parts, if you ever did some USB on moving parts, USB is the worst connector you want on the moving part. Internet, at least, you plug it in, and it stays there. So yeah, unfortunately, I wanted to demonstrate this for this presentation, and my Pluto Plus is still in the mail. So I cannot demonstrate that the noise level is the same, that the communication capability is the same, that it runs flawlessly on the Raspberry Pi. This may be for the next time. But yeah, you will save 800 euros on this budget, and it's a 5,000 euro project that I'm showing you here. So you can find all the repository processing on the GitHub. Hopefully, I documented everything. If you wish to reproduce and you miss information, feel free to reach me. I'll be happy to complement any misinformation. Be aware that if you want to use different hardware, the running bastion's packet spammer does require what is called promiscuous mode, and not all chipsets support promiscuous mode. Furthermore, be aware that the chipset of this particular board is not in mainline Linux kernel, so you will need to recompile the kernel. And if you're doing cross compilation for big rules, you need to know how to cross compile your kernel module. And finally, this was all done with your taxpayers' money, so public money, public code. Thank you for supporting our research and my colleagues from the Mechanical Workshop who did a very good job in assembling these antennas. And with this, I thank you for your attention. And I even have one and a half minutes for questions. I guess if you have to tune the gargantuan call to the Wi-Fi's packet spammer, then the gargantuan call is to Wi-Fi's container and to see more radio silence after the packet. The question is, how do I tune the silence in packet spammer? And actually, I did the exact opposite. I wanted to have the packets as close to each other as possible, so I have as little gap as possible. And as I was putting too small a value, if you ask packet spammer to send a new packet as the previous one is still being broadcast, then it will send back an error message. And the Wi-Fi don't really become very unhappy. So I was conservative and I did put excess delay, not that I wanted genuine Wi-Fi user to still have their connection. This I didn't really care about, but I didn't want my Wi-Fi don't go to crash. And so I put some additional time delay, not too much, so by time I'm not wasting too much time. The reason why this measurement is taking 15 minutes is really to collect. I'm taking something like 100,000 samples per position, per spectrum. And really the collection of the data and getting rid of the silence is the reason why it takes so much time. Just if you look at commercial GBSAR, they promote one second measurement duration. And the reason also, I didn't mention power consumption, the GBSAR used to be installed in remote locations. And of course, the longer it takes, the more power you draw. So this device I make a power budget is 25 watts. So whether you have 25 watts for 15 minutes or 25 watts for one second is going to completely change the life expectancy of your battery. So if I had to work on something now, it would really make it faster so that it can run on battery or solar panel and that the energy consumption of each measurement is much lower. So the initial question of putting gaps in packet spammer is just not to crash the Wi-Fi dongle. Have you considered using rails from 3D printers because they usually are cheaper and have still very nice decisions? If I can assemble which part in? From 3D printers like the procedure like rails too, which can look like very precise movements and space speed. So the question is what part can be made out of 3D printers? The problem here, I did not put the weight estimate, but I think the two antennas plus the hardware setup weight something like 1.2, 1.5 kilos. And that's really the challenge in having a nice mechanical setup. You do see that there is a bit of hardware there. And so when you want to move these stable and reproducibly, I went for a fancy. Also, I wanted to do it fast because my previous setup was a screw driven rail. And it would take like 10, 15 seconds to go from one position to another. So just the time to move would be added up at something like five, six minutes in the measurement. So this guy can just move in a fraction of a second from one position to another. So there's many solutions that you could go for. There's also these photographers. They want to do time lapses where they move a camera. Yeah, I didn't trust them. So I went for the more expensive. But yes, there's many solutions that you could go for to get a better solution. So thank you, so much.
Design of a follow-up QO-100 payload -
Good morning, all. My name is Frank. I'm working at the satellite communications group of ESA, the European Space Agency, in the technical centre. And we have a presentation here. It's not extremely technical. It's only, it would like to explain a few initiatives we are embarking on. And that is in the area of maybe future amateur satellite payloads which are hosted on satellites for experiments. It's maybe not so known, but let's say we work quite a lot in commercial communications, satellite communications with companies in Europe. And we finance and co-finance various projects. But it should not be forgotten, we think, that many of the innovations that came to the world of satellite communications are actually coming very often from the amateur satellite communication world. A lot of work has been done and that has now spined off in commercial applications. And we gave here a few examples of things that where the satellite communication was always first. They have been flying the CMOS chips, for example, the first time in the world on their satellites. They're the ones who made maybe the first inter-satellite links. And there have been also companies, for example, SSTL, that started with building a few CubeSubs in Surrey University and slowly became basically a larger satellite company. So that is all heritage from the amateur satellite world. Also what is quite interesting is that the amateur satellite world flew the first GPS receiver at very high altitudes, even up to high elliptical orbit. So these are all, I think, quite nice achievements of the amateur world. At the ISA site, we would like to maybe support initiative that at least thinks of future amateur satellite payloads on future satellites. And we explain a few things on that. And that is specifically also related to the payload which is currently on a geostationary satellite. It's called Q100. Maybe not everybody is familiar with that, but it's a very nice payload which for the first time is hosted on a geostationary satellite. So it's always above you. You can find there excellent videos explaining how to handle that satellite and to how to build up with low cost such communication over those satellites. Let me explain just to get a quick idea. This is the footprint of that payload which is on a geostationary satellite. So it travels with the same speed at the Earth, so let's say virtually it's always above you. And this is a very large commercial satellite, hundreds of millions, but on that there is a small payload with hand-handle amateur communication in S-band and the lower KU band. And the beauty, I think, of this is also that this is enabled because it is the basically reuse of existing costs. Let's see, existing 2.4 gigahertz amplifiers, the modification of low noise blocks that you use for normal television for, let's say, maybe 10 and even less zero, and you modify them and then you can use this satellite. So with relatively low cost you can communicate with the satellite. And this satellite payload, and there is a lot of actually, in particular, German amateur radio and also UK amateur enthusiasts who have been instrumental in getting, let's say, community working over this payload on the ASEALSAT 2, which is the name of the commercial satellite. This was the first time that radio amateur satellites were also able to have a, let's say, more continuous communication between each other. And you can see here the footprint. Let's take the green and the red line there, that is linked to how, what kind of elevation you need for the antenna. But this goes from Brazil to Indonesia. So you can have, say, in a single hop, basically, a user from Brazil could communicate with a user in Indonesia. So there is enormous potential there too for all kind of experiments. And this has also led to more broadcasting using standards like DVPS2, which is very active at the moment in Q100, where, let's say, new technologies went into the amateur domain and the amateur domain is now making very nice open source implementations of mini tuner and all various DVPS2 equipment. That is something we would like to support. But how do we support that? Now, a longer time ago, these processes do not go that fast, unfortunately. We wrote a letter to IARU, the International Amateur Radio Union, from where, I think, Sylvain is one of the bosses, I would say, divided in various regions of the world. We wrote there, or we say, stimulated the letter, say, an IARU basically asked, Isa, could you not help? We need to think at least maybe of a follow-up of Q100. We asked that, as you probably know, we are publicly funded, so the various countries want to have their say. Everybody did their say, say, that's okay. Here, you have some funding. So we have funding to start that process, and that funding is meant to collect requirements and also make a few prototypes, maybe. Basically, it will not be enough basically to host a satellite or a payload on a satellite. That will be not possible. We have to look for other funding mechanisms for that later, but we'll come to that later. So what we will be doing is to identify requirements of all the people in Europe and Canada. Canada is one of our member states to identify what would be good requirements to fulfill for a next geostationary payload. And back to the orbit, geostationary or other orbits, we have been heard that some people would be quite interested also to explore maybe payload in medium-Earth orbit. You can imagine that also then you have a longer contact time, and it still has a bit like a global attractiveness. We are considering that because there might be various institutional initiatives starting soon where there could be hosting opportunities for small payloads on medium-Earth orbit. So we will be looking at various further and amateur community that process will start very soon. We already requested a few inputs, and we still have to process all that, and we will then talk to the various satellite operators to see how we could accommodate then a payload and how soon to get the funding for that. So the first idea we heard already that is a few people will be very interested in, let's say, keep it simple. Actually the payload which is currently on as-heel-sub is, it is fantastic that it's there, but functionally you could say it's rather simple. It's an analog transponder, what you go up in whatever modulation you use, that comes down. And many people like that also because that means a lot of experience maybe at the modulation level, more deep down RF level. There is also a whole community that comes more from the, yeah, that has been raised with SDRs, let's say, starting more at that level. There is a whole community that is more working at the, let's say, the IP level and even higher in the amateur world. And we have to find a bit the balance from maybe simple payload designs to maybe something which is really more complicated. And you can imagine also in the amateur world there are communities that are going up and up and up in the frequency range. And in the amateur community we have 24 gigahertz, we have up to 77 gigahertz, which we all could use for satellite. But you could imagine that going higher up in frequency also means that possibly have narrower beams. And for the other hand you would also like maybe that the satellite community is served by, let's say, more in a larger scale. If we have one very nice spot beam in e-band, let's say in 76, 77 gigahertz, yeah, that will serve probably one country that is not so, I think, inclusive, I would say. So there's a few balances there to be made. But it would be quite nice in some way to come to a combination that already some people have suggested where we have an analog transponder. And actually what you would like maybe is to have in geo-arbit the basically the ultimate Linux brick with everything around it. And then everything can do, everybody can do what he wants. The only also here the disadvantage is again that if we put something like this on board, then we get a certain degree of centralizing things. You need a cis-atmine for this satellite basically. And that is not always to the likening also of the amateur community. One likes sometimes a certain degree of chaos, let's say, and anarchism and so on. That it should not be too regulated. So also that is a bit of a balance to be made. But we have various IDs to put it also a bit more in the 5G area where maybe CUDU, certain splits in the whole 5G architectures could be partly put on board. Because many, many people, even in the amateur world, start to look also at various communications based on 5G, non-NTN, non-trivial network. So very straight-offs to be made. We listed here a few of those and we will now start a larger consultation on all those topics. Back again to let's pick a few, let's say the attractiveness of future user terminals. Like the example here, previously of the ground-based SAAG, 6,000 euros, in this case the opportunity of some taxpayers' money. What is acceptable later on for radio amateur? If we go to the 77 GHz, maybe we can use automotive radar hacks and so on. That needs all to come together. So we would like to request later on in a more structured way input from the amateur community, but also taking into account all these factors. Because it's no sense proposing something where not a lot of amateurs can benefit from. So that we are currently starting and we will show you a little bit what we will be doing in the next month. I will not go into detail of the planning, but what we now are going to do, we are talking already a very small group to get a bit of a sense what we should do and also in particular what ESA should not do. Because some things are far better left to the amateur community. We prepare a bit like a consultation, we talk to the amateur community and we also talk to a number of, let's say, people who would likely build such a payload. It is however fun that would be, it is not so likely that the amateur community would build such a payload themselves. A geostationary operator with a 300 million satellite would like to know what he hosts as the few kilos extra. And that is not so likely that he will accept that that is built by amateurs. How good they are also with all respect, he will not accept it and his insurance company will not accept it. And what we would organize in May is a day in Aztec also with support from our technical people to discuss a few options. Start prototyping, we have the funding for that to prototype a bit what some people call like a flat shot. So it's the model of the satellite but you basically put it on a table. And we would like actually to have a few IDs ready in September. September there is always a very large satellite conference where all the satellite operators are and we are making there the appointments. And we will also pitch this to satellite operators as a bit like, let's say as a good thing. Many people complain in the satellite world about the lack of people that understand RF. That is a real lack. A lot of people, there's a lot of programmatic but there's not so many people that can really understand RF. And also I think satellite operators could take a bit of responsibility to and also the industry maybe to stimulate that the young people start to understand satellite communications. That is at least also one of our objectives to get more people enthusiastic about satellite communications. So we hope in May we will advertise that lately. May or June we will see a little bit on availability to organize a day to go through a number of payload designs. And we're also trying to get some travel reimbursements and so on in place so that people could come to us. And hopefully in September we discuss a few things with satellite operators and even better. We hope that maybe the outcome of such a discussion could be discussed at a next, force them hopefully next year. I think that's it from our side. You'll hear more from us. And as said, it's not such a technical talk. It's a process we're starting. All your technical inputs from the AMSATS, the amateur satellite organizations in the various countries which we are approaching already but also individually would be highly appreciated. That's it. Thank you very much. Thank you Frank. I have one or two questions. Please. Is there any for phase airing or beef steering or is it way too expensive for such channels? Yeah, that would be indeed very nice. Of course, if we look at the, let's take first phase array on board. Yeah, of course that will then be highly dependent on frequency range. Let us assume it would be an event, you know, 77 gigahertz amateur phase array. Yeah, that would be a fairly expensive thing. But I think we are also there to see where maybe certain developments could spin off further in industrial developments. So this would be a good, I only see of course again, but also a phase array. It needs a type of management then of that beam. That comes with that. And of course, it's still a challenge in development, of course, on the ground. Maybe if we take now the scenario of a medium Earth orbit where you would need pointing, then I can see there that also from the amateur community, the various YouTube videos that appear with the educational Pluto beam steering and things like that. There I see a lot of opportunities to do beam forming, to educate people on the essentials of beam farming with maybe an existing beacon that comes from the MEO. I think that would be excellent to do that. Yeah. Oh, please. Supporting Canada. Sorry, X support what? Support Canada and obviously those elements in there. Yeah. Okay. The question was about the geographical thing. Canada is in. It's also part of our, let's say, ISA member status. It's called ISA funds, ISA, ISA of Canada funds, ISA. So we are interested to include the Canadian footprint. We have already received a bit of input on that, but we have to see because you can imagine that the orbital location of such a due station is not always. Yeah, we are not the one picking the orbital location. Therefore, in that respect, a medium Earth orbit hosting would of course be preferential, but again, a trade off. We are not. Yeah, we can't decide ourselves. Sorry. Exactly. Yeah. We are having in ISA, there are a number of geostationary projects ongoing. We are, let's say, trying to see whether we could with some leverage, maybe to host something in the future there. Please. So one would be if you're talking about payload providers that are not amateur radio. Are you talking about the universities or only commercial partners? And then is there like a project in that is not like an ML class, ISA mission also like CubeSats that you would consider for such a like CubeSats, not from the amateur radio site, but let's say I think more commercially from a university where you could also see the situation. Yeah, on the payload provider, the question is whether the payload provider could maybe also be from university. Yeah, now I must say we have seen universities providing payloads to various mission, not always to commercial missions. And that is, I think, where the the satellite operate will always have, let's say, the last word, the last say, because that links them to the insurance and things like that. So unclear at the moment, I would say. Then secondly, whether a payload could also be hosted on more educational missions and so on. I think that that is an option. The only thing is that I think there's quite a lot of amateur payloads hosted in various Oscars and so there's nothing new on low Earth orbit satellites and CubeSats and so on. That the essence would be to either to do something new in medium Earth orbit, but also advancing a bit the payload technology and what you can do with it. And there there is a bit that would be the idea. Perfect. Perfect. The proba missions will be also interesting for this kind of applications. The proba platform itself, indeed. Yeah, I do not know whether the current proba, one of the, let's say, satellite that is used in these several various scientific missions also, whether some of the orbits are always, let's say, appropriate. That is, of course, to be seen from a platform point of view. I need to see no problem. Why not? Why not use that? Yeah. If there are normal questions. Thanks again. Thank you.
An open source digital radio protocol for amateur radio
Hi everyone. Maybe before we get started, how many of you do know about ham radio? This is kind of the topic of the room I know, but still. Okay. And how many of you are ham radio licensed? Nice. Okay, good. I still have included a small introduction. Please put up your hands again. License operators, please. All right. So I still have a brief presentation and introduction to the topic for you. Your experience with ham radio might not be my experience, so I think this introduction is interesting. And then we will have a brief overview of what ham radio and open source means. Not everybody understands open source the same way, especially, I think, in ham radio community. You will see that open source in ham radio did face and does face a few obstacles. We will pinpoint a few of those. We will see the workarounds. And then finally, we will talk about M17, which is the project that I want to talk about today. So first, who am I? So I'm a research engineer at the University of Friage in Belgium. I do mainly embedded systems and RF. I'm a licensed ham radio operator for two years now, called Sine of kind of number four, my Oscar Delta. I joined M17 project one year ago, right after Fuzzdem. Wow. And yeah, I mostly do hardware design, some of which you can see on the table in front of you. We will go back to that later and work on firmware. Okay. Amateur radio, I think you, then almost everybody knows about this logo for ham radio. This is a technical hobby. This is the goal is to experiment, to play around, to have your hands dirty. It allows you to legally transmit uncertain frequencies, which are allocated to amateur radio operations, and which you cannot do if you don't have your license, of course. The hobby is extremely vast. So you have operators which will do what we call DX, which is reaching the furthest away on the globe using lowest power or specific modes, frequencies, whatever, which is called working demands. You have people which are dedicated in antennas, transceivers, reception, transmission, whatever. It's very, very, very vast. And I think most of you also know that the mainstream products come from just a few brands, ICOM, Yezu, Kenwood, and then that's pretty much it. And you have the Chinese brands, and usually your typical ham radio, Joe operator, or whatever, doesn't know about those Chinese brands. So open source in amateur radio, well, this is a bit controversial, a bit difficult to describe, but the ham spirits, which live in every one of us, have always been about sharing design, ideas, discoveries, the problems we encountered, and how we did solve them. You could call that open source knowledge, maybe, which is not to be confused with the fact that most digital voice protocols that we use have published specifications, which means that if you dig deep enough in whatever search engine you use, you will probably find specifications for those protocols. That does not mean they are free and open source, which is very important. And this is kind of the goal of this presentation. So, yeah. Some protocols are free available. Some of you know about a few of those. So AX25, which is a material adaptation of the X25 protocol. Which mainly works on VHF and UHF, so above 30 megahertz, which is not designed for voice, of course. It's digital, but mainly data bits, let's say. And DSTAR, which is what most of us could consider open source protocol for amateur radio. It is the first protocol really created for amateur radio usage. So it is designed from the ground up with amateur radio in mind. It is open specifications. So from the start, they decided to publish the specifications, mostly in Japanese. So this can be an obstacle. Maybe if you speak Japanese, it's easier for you. I don't know. YSF, Yezoo's proprietary mode. Specifications can be found online, but that's pretty much it. You have to have a Yezoo radio to do YSF. FT4, FT8, just an example of a few modes that are used. Very slow speed, very long range, very low power on HF. So very low frequencies. And this kind of illustrates a point I'm going to go to. The new app, of course, DMR, Tetra, P25, all those commercial protocols which have been adopted by amateur radio, which are not designed for amateur radio. The main thing is, and especially when you talk about FT4, FT5 and such protocols, there are many of them, those have only one closed source implementation. It's not easy to play around with it. You can just, okay, I'm downloading this, trying to modify this. Is it better? Is it worse? The way you play around is, okay, which power do I need to reach this country in this weather or whatever, which is not what is suitable to each and every one of us. So we will briefly take these stars as an example. So released by Japanese amateur radio club, JARL in Japan in 2001. It uses AMBE codec vocoder from DVSI. So very briefly, vocoder, you know that voice is very complex signal. You need a lot of bits to transmit the voice, but amateur radio protocols are slow speed, slow bit rate. So you do need to encode the voice into something which is manageable by those digital voice protocols. And the way it is done in the case of DSTAR is using the AMBE codec from DVSI. Specifications are publicly available, but there is no license tied to those specifications, which means you do not have to publish whenever you deviate from those specifications. And so it kind of de facto became ICOM's proprietary mode. It is called DSTAR, but it is not made to be interoperable with other DSTAR implementations, which by the way, there are not really many other DSTARs implementations. So yeah, main obstacles. First, manufacturers exploit the fact that specifications are not really licensed, and so they can find the trick around to lock down their environments. Main obstacle, second part is technical capabilities. Back in 2001, encoding voice in a microcontroller was not really feasible. That's why you needed an ASIC, so a dedicated chip on the board made by DVSI with their AMBE codec to be able to encode voice into bits manageable, but by the whatever digital voice protocol you wanted to use. So I'm not spitting at DVSI and AMBE all the way. There is a whole lot of technical reasons why this is that way, but I think it's good to understand where we come from and to see where we can go from there. So also another thing to notice DSTAR, YSF, DMR, P25, NXDN, whatever, almost all the digital voice protocols being amateur radio or even commercial protocols use AMBE, AMBE to plus basically AMBE and variants of it from DVSI. Basically one vocoder to rule them all. So how does one think are with closed source vocoder? You don't. At least not really. But hey, the vocoder is an integral part of the protocol, so how do you do? What do you do? Is it possible to have what we could consider fully fast protocol if the vocoder is not open source and if so how do you do it? Well, 2001 was, sorry to break it to you, but quite a few years ago. So solution came in 2010. Name is Kodak2. Release in 2010 by David Roe. I want to underline that this was not an easy task. It was the topic of a full PhD thesis. It self-relying on old works and algorithms and so on. So nobody woke up in the morning and said, hey, let's do an open source vocoder. It's going to be easy. That's not how it goes. It's fully open source, no patents, no industrial secrets. That's the point. And since 2001, computing power on the microcontrollers increased by quite a lot. I mean, 8-bit peaks and 32-bits are microcontrollers are not the same thing. So this last brick, which was kind of the missing brick to have open source, fully open source protocols, allowed to the emergence of two main protocols. The first one is FreeDV, which is designed by the same David Roe. He's not alone, but he is one of the contributors of FreeDV, licensed under GPLV 2.1, using Kodak2 at lower bit rates because it's in HF, so low frequencies, narrow bandwidth. You can't transmit a lot of bits. So you just slow it down. You degrade the voice a bit more, but then you're able to do long range digital voice communications. And it's also used as the reference Kodak2 implementation. So again, just like I said about FT4, FT5 a few slides ago, you do something and then you provide your own implementation, except that this one is open source. And then KM17, GPLV 2 uses Kodak2 at the highest bit rate available. It fits in a standard FM bandwidth for VHF and up, so you can't really use that in HF. It's a bit too wide. You are going to annoy a few people published in 2019. So, M17 protocol has all the features you could expect from a digital protocol in amateur radio, which is that you have the packet mode, so you can use it to control a remote site, for example, using just sending commands. You have a stream mode, which is the mode which is used when you use digital voice. It supports AES encryption, which depending on where you live, you might or might not use it. I know I can't. It does have also the specifications for traffic over IP, which I think is a good thing. If you look back, main digital voice protocols do not have that. So, the community will kind of go, each one has its own way of doing it and different implementations, and then somebody comes, hey, I have an idea. Let's try to inter-connect this, and it's just one more break to a very tall and fragile wall. So, here we provide specifications for this, which I think will ease up, does ease up implementation and inter-operation. If you probably know about DMR, to use DMR, you need DMR ID, which is centralized. We don't. In this protocol, you only need a call sign, which if you can use the protocol, you most probably already have a call sign, so problem solved. And the specifications are open source and license under GPLv2, which means that if some big manufacturer says, hey, I like this new protocol, let's try to benefit from it, yeah, great. But if you modify it to make it incompatible with our specifications, we will force you to publish your specifications completely, and we will find a way to make sure that whatever we do next is going to be compatible with you. If you don't want to be compatible with us, we will be compatible with you. Okay, so this was the M17 protocol, but we should go beyond that. There is the whole M17 project thing. More than a protocol, getting rid of the proprietary vocoder allows a load of things. First thing is you can have it running on your computer. You don't have to pay for any license fee. You don't have to have any USB dongle that you have to plug into your computer so that my software has to go through the proprietary chip that you used, blah, blah, blah, blah. You can have software on your computer for Dstar, for example, but you need this USB key on your computer to have the license to use the codec. You can have it on your phone. Same thing, DroidStar, maybe some of you know about it, maybe not yet. This is a very small app that allows you to use digital voice protocols and connect through reflectors, so servers. M17 allows it to run straight out of the box. AMBE codec, there are online implementations, illegal implementations that allows you to use AMBE, but you have to find it, download it, and then it becomes very shady and it's a cat and mouse game between DVSI and amateur radio operators. You can have it for your radio, a lot about that in just 10 minutes, apparently. You can have it on reflectors. Dstar reflectors do need the same USB key that you would need on your computer to translate the voice between AMBE and whatever else you would use. Maybe a small note that if you have a Dstar reflector, which is from Dstar to Dstar, you do not need this key because you do not need to decode the voice and re-encode it. You can just pass the encoded bits around, but if you want to switch from something to something else, then you're stuck. So yeah, it's a whole ecosystem which was able to grow from the ground up because of the open sourceness of codec 2, including in this ecosystem module 17, so which is this board which is open source hardware, open source software, open source protocol, open source, almost everything you can wish for. Let me go in the frame of the camera maybe. So you have the board here which is the newest revision 0.99 because you never do the 1.0 in one go. And then the enclosure that goes around with it because having this on your computer is screaming for I want short circuit as soon as possible, so let's try to avoid it and put it in an enclosure. Difficult thing is yeah, when you have open source hardware making money out of it is difficult, but this exercise is intentionally left to the reader. Yeah, fully open source, affordable. 50 euros about that. Try to find digital voice, modem, TNC, whatever for that price. I think you will come back to us. Open HD, another baby which is on its way not as advanced as module 17 yet, but it's aimed at being a fully open source portable radio. Basically if you can modulate it, we can send it. For now it's only working on the 70 centimeters band, 430 megahertz, and the 2.4 gigahertz. So for those of you who can see Qo100 in the sky, it does the upstream to Qo100. 25 milliwatts. Hey, it's a prototype. Step by step please. With its 3D printed enclosure also open source and very quickly it relies on a deaf board developed by ST Microelectronics and the backside we did ourselves with the power supply, the FPGA, and the transceiver. FPGA, so you can see maybe a few asterisks on the screen. The FPGA tool chain is sadly not open source. You know maybe if you play with that that having open source FPGA is difficult. It is one of our goals, but usually they will provide IPs in their software and then say yeah, if you want to exploit it commercially, please talk to us before that. So always a bit difficult to deal with that. We have plans for the future though. We are starting the work to port it over OpenRTX, which you will know about in six minutes and a half. We want an open source FPGA, so quick note that maybe the FPGA is not open source, but it does not prevent you from building it yourself. Downloading the software which is free as in beer and you can rebuild the bitstream and upload it to the radio. So you can still tinker with it however you want, but it's not strictly speaking open source. We want five watt output. We want USB-C charging. Oh my god, I come here, please. So yeah, we have plans. We are not only pushing our protocol in it, but we just want to make products better and open source for the community. I think there is a big hole in the ham radio community with this and we are here to fill it. A very quick shout out to some very interesting projects close to M17, the open source primary code OpenRTX. WPSD, which is the hot spot software that you can use which supports M17 for, I don't know, a few years, a few months back, they started support, contrary to PyStar which supports M17 for, I don't know, 10 days. MMDVM, the hot spot hardware. So we rely on those to have hot spots which do M17 and there are much, much, much, much more things that revolve around this. Okay, so thank you for your attention. I hope you liked it. I hope it gave you some ideas and desire to join us, help us. We need devs, please. I know everybody does need devs. Check out our infobooth, ham radio infobooth in building AWU, but I think most of you already came and say hello, but if you did not, we are still here today. Okay, thank you very much, guys. Thank you. We have some time for questions. Yeah. For FSK with root-raised cosine filtering. So the the main chip that you would use is CC1200 from Texas Instruments. Yep. Using the packet radio port. Just like you would do for AX25, for example, your old TNC. So this module basically takes the sound from the microphone, processes it, encodes it using the Kodak 2 vocoder, does the protocol framing, baseband creation and processing and filtering. You have baseband output here, feed it to your radio, and then the output is for FSK modulation with 4M modules, for M17 protocol. But if you want more, come to our infobooth. Yeah. What FPGA type radio is in the chipboard? For now it's the latest, I forgot the one. It's not an i40 which has an open source toolchain available. We had some technical issues, we had some technical issues with the FPGA, so the one we use, yeah, is LATIS, CERTUS, something, I guess, but because for the transceiver we use, we needed LVDS pairs for the data transmission. 64 MHz for the LVDS pair speed. Yep. You have been addressing some of the shortcomings of all the other modulation schemes and protocols, so letting aside that they are not all open source, but they have other shortcomings like on UHF, there's reflection and there's fading and many other things that you are experiencing outside the lab. Actually, are you also addressing these things with M17? I mean, is there a better quality of voice? Is there a better cope with fading and reflections and things like that? Okay, yeah, there are shortcomings indeed with many digital voice protocols. We are between a rock and a hard place. We use for-office-scale modulation scheme which basically does not allow you to overcome the multi-path problems, reflections and so on, so we are aware of this. We have to go step by step maybe. Specifications are open, you are free to fork them and I don't know, implement it in OFDM to avoid some problems that you might have with specific issues which is linked to the physical layer. And for the voice quality, some will say that Kodek 2 sounds better than AMBE, some will say that it's worse. It depends, really, but I think it's at least on par with the other protocols. Yeah? So that's a nice thing. For the Internet use cage of UHF, have you curves or measurements that say what isn't that as an R? I can still do whatever analog voice versus M17 and how much further can you get with like bursts? Yeah, so this basically boils down to, yeah, do we have any graphs, lines, explaining basically the difference between analog and digital? Well, this is a wider topic than just M17, but I agree that maybe having those comparisons might be a good idea to push force into those modes. But no, we don't have those curves at the moment, I believe. Many people I think could have, but yeah. Absolutely, question if I can. Yeah. Have you thought about interfacing that or, and I know I'm on your description, so I'm asking you to because one of the things I like as to experiment is that you can put other data on your channel, but on the current hardware you cannot interface with it. Okay, so can we send arbitrary data using module 17? Not yet. Everything is there for you to be able to do it. So the final net is openRTX. There is a USB-C port with data lines connected to the chip, so we use it for final updates. We use it for STDIO output for debugging basically, and yeah, you should be able, in the future it is planned, but Sylvano in a few minutes will talk about that, having communication channel between the computer and the board, and then from there you can send basically whatever you want. So yeah, it is feasible. Yeah. Have any large manufacturers shown interest in M17? Yes. We talked about, we talked with Kenwood, which would be interested in implementing M17, so I pinpoint back to the fact that specifications are licensed, GPLv2, so they cannot lock it down for their own use. We also have connect systems which showed an interest in our radios and Bowfing. With no more questions, and I see no hands rising. Big thank you again. Thank you very much.
Expanding IQEngine into a Hub for Previewing RF Signal Processing Software
Awesome. Thank you. So my name is Mark and I'm here to show off the IQ Engine open source project. I'll talk about where it's headed in the future as well. Also here we have Roman who's involved in IQ Engine as well as SIGMF. And this talk is aimed primarily at two groups. One is folks who are newish to SDR and RF signal processing, students, hobbyists, anyone who wants to learn more about all this software that you're seeing. And then second is folks who run or maintain an open source project that involves RF signal processing in some way. And hopefully even if you're not in those groups you'll still find some interest here. So IQ Engine currently it's a web app that is all about RF recordings. It lets you preview recordings, manage them, analyze them, some light processing, and then most importantly is sharing and all in your browser. So entirely web based. And I'll show a quick little demo of what the current tool looks like. So IQ Engine is, it's available at IQEngine.org. The project runs a public instance of the tool. But in this case I've got one running locally because I wasn't sure about Wi-Fi. So the main screen here is essentially a list of these RF recordings. They're all stored in the SIGMF format if you're familiar with SIGMF. We have some good ones from Jean-Michel and Aang23. A lot of folks who are here today. You can also open a recording that's local to your machine and then all the processing is done client side. So like I can open a local directory full of recordings. Here, recordings and it'll list out them all, generate the thumbnails. So actually it's the same directory that I had served from the server. You can also open just one local file pair. So sort of, anyway, so back to the list here. If you click on one of them, you're brought to a spectrogram style interface where it loads the samples that you're looking at at any given time. So that way you can have enormous files. And then the mini map on the right represents the entire recording. So you can jump to any part of it and the little gray area is the part you're looking at. We have time, frequency and IQ like you'd expect. That's FM. And then some other features here are, so there's time and frequency cursors if you want to measure stuff, adjustable dynamic range for the color, windowing, FFT size, you can add an FIR filter taps and all of that is run client side. The FFTs are done client side, the FIR filter is. But the one part that's not client side is our plug-in system. So if you select a portion of the recording that you want to send to the plug-in server, you can select it there and then, let me zoom in here, choose a plug-in. So this was an FM recording. So I'm going to run an FM receiver that's implemented in Guinea Radio. And it sends the samples to the server that runs Guinea Radio. And then in this case, it's actually returning a WAV file with the audio of that signal. But there's other types of outputs like you could run a block or a plug-in that gives you IQ as the output. So if I do a low pass filter, it's just going to output IQ. Let me give it a proper cutoff frequency there. And then currently we're just displaying the IQ in a pop-up. But in the future, we're trying to figure out the best way to replace kind of the signal that's already on the screen with this new one so that you can chain plug-ins together. So that's sort of the gist of the tool. Now back to the slides here. So everything's, IQ engine's built on top of SIGMF in many ways. If you're not familiar, SIGMF is an open standard for saving your RF recordings to a file. It's as simple as it gets. You have a binary IQ file which is sort of the native way to store a recording and then a JSON file. And the SIGMF specifications mainly tell you how to write that JSON file. So there's stuff like how you specify sample rate, center frequency, data type. And then I'll show you annotations in a second here. And by using SIGMF, you have software interoperability. And then you can also avoid data bit rot where like in five years you forget what sample rate stuff was taken at. If you want to learn more about SIGMF, there's a link at the top of IQengine.org and it also links out to the SIGMF GitHub page. So SIGMF, the standard is managed by GNU Radio. It's kind of a sub-project sort of. Now as far as the IQ engine code itself, it's web-based, front-end uses React, Tailwind, some big dependencies that we get a lot of use out of our code mirror for all of the code editing. PyOdide lets us run Python in the browser. I didn't demo that but there's some videos online about how that gets used. And then Plotly for those time frequency and IQ plots. WebAssembly for FFTs. And then for our documentation, we use the MDX system which lets us write it in markdown and then have it rendered as part of this page here. So this was written in markdown and then it lets us render it as React components. Kind of nice. Now, so that was kind of the introduction but I wanted to start off where I left at GNU Radio conference last year. So what have we done since then? Well now it's possible to run a local instance of IQengine like if you want to run it within an organization or whatever to share things privately. You can run an instance and you can put the recordings on the same server. So easy enough. Or something that's mounted to the file system as long as Python open can see it and then it can serve the recording. And the other option is to use Cloud Storage which is what we do for IQengine.org. And as far as how to do that, so the general idea is you pick a directory on your server and then you can run IQengine with the Docker images. So if you go to the install with Docker page, you really, all you have to do is change the directory that's mounted into the container. So pretty much this part here of the command. And then the rest of this command will pull the latest IQengine Docker image and it will run it. And you should be able to see your recordings. They'll look like this because they'll be local to the back end. Versus IQengine.org which has a few different data sources that pop up here. So and that's, yeah, fairly new. If you end up using this and notice some quality of life issues, definitely reach out on Discord or GitHub. So next up, I'm going to dive into the plug-in system that you saw me run with the FM receiver. So the idea is any RF signal processing that you want to run on a back end server but triggered from the browser. So what we have within our project is this rest-based API and it allows for someone to write the plug-in server in any language they want. We have an example in Python and then Loic wrote one for Rust. The Python one can run Gini radio flow graphs. It just pretty much runs the Python flow graph and then uses ZMQ to get samples in and out of it. But in the future, there'll be more languages and by using this rest API, it doesn't matter. It could be, really you can deploy it and implement it however you want to as long as it supports this interface. I'm going to show a little demo later running SatDump which is kind of an example of a whole separate project, not a Gini radio flow graph or anything, but a piece of open-source software that you can trigger from IQ Engine. And then Aang will be presenting more about SatDump in like an hour or so. So as far as how the plug-ins look, the Python based ones, we tried to make it as easy as possible to create a new one. This isn't the actual rest API, this is just how you would make a new Python plug-in and then you would use an existing server code that we already have. So you can see you have to specify your custom parameters and then there's a run function where you're given the samples and you have to return several different data types. As far as Gini radio, you specify the flow graph, but the only catch is you have to substitute your file source and GUIs with the ZMQ source and ZMQ sync. That's how we get samples in and out. Not the most performant thing, but it gets the job done. So you can see these first couple blocks are the ZMQ ones and then the rest represents the flow graph. So we have a Python flow graph that implements an FM receiver in this case and that was the plug-in that I ran earlier. So the kind of the motivation here is if you are an author of an out of tree module for Gini radio, you probably already shared the code somewhere like GitHub and created some examples, some example flow graphs, but the next step would be making it more accessible and easy for folks to find and play with and I think this could be an option there by exposing it as a plug-in. Now, let me go back to the plug-ins. So I'll go ahead and run the SAT dump one. So I've got a recording of NOAA APT right here contributed by Aang. So I can click that, I can browse around the signal. I'll notice so it's actually offset, but I believe this is the APT signal. You could jump to different parts of the file there and then as far as running it through SAT dump, I want to run the entire file because it needs a decent amount of samples. So I'm going to select the whole file and then under plug-ins, we've got the fresh new SAT dump plug-in already preloaded with the pipeline for APT, but you can put whatever pipeline you want. So right now it ran SAT dump under the hood. So here's one of the images that comes out. I think IQ Engine still has some work to do as far as if you have a bunch of different outputs, how do you present them all to the user? There's a lot of web design that can go on there. So either it pops open something or it saves a file and it supports all the different MIME types. If you're familiar with web, it sort of just uses MIME type and then we added some custom MIME types for IQ, like the different data types for SIGMF. As far as other plug-ins, I think, yeah, pretty much, we have a detector as well. So let me go to a recording that I give my students when we study signal detection and classification. This is kind of like a toy example meant for testing a detector where you have a few different signals here. IQ Engine's not about implementing RFML, it's about sharing it and making it more accessible. So we made a very simple detector just to have an example. It's written in Python, you're welcome to check it out in the source. It's called Simple Detector. We also have Marco's detector, he was someone else who was working on it. Simple Detector was pretty quick for that number of samples and it did a decent job. There's one extra little detected emission there. Now the results are in the form of SIGMF annotations which are bounding boxes in time and frequency and that's how the results are shared from the plug-in. So if you wanted to download the raw metadata file, the SIGMF file, you can go to the bottom here and here are the annotations that the plug-in created. So we sort of copied the SIGMF format for the return data type. And if you wanted to perform classification you would simply fill out the label and they would show up. Within IQ Engine you can also edit the annotations and edit the labels. So if you wanted to manually tweak stuff like you were making an RFML data set, sort of like a golden meta file, you could do that here. What I find most useful is simply to have a quick glance at how well something worked. If you had tons of files to run through you wouldn't want to do all this clicking, you would just make a script and you could certainly run the plug-ins from a Python script. It would just need to call the REST API. Back to the slides. Alright, so I want to take just a really quick tangent to mention, remind people about what Gini Radio provides and then how it relates to this plan that the project has. So Gini Radio, it's a way to implement your RF, DSP and C++ or Python. It gives us a standard framework for doing that implementation and it's easy to get annoyed at the boilerplate and how to install everything. But in the end if you use that framework it means that other people who are familiar with Gini Radio can then install your Aditree module. They sort of know the standard usage of your blocks, where to look for the example flow graphs, how to connect your application with their SDR sitting on their desk. And that's an enormous value, that's in my opinion one of the main values of Gini Radio. And then the GUIs are nice as well, it's not always easy to program GUIs. So if you're curious about learning about different Aditree modules, C-Gran.org is where we point people. And I mention this because C-Gran represents a centralized location for Gini Radio applications and libraries, what we call out of tree modules. But kind of zooming out one more layer going beyond just Gini Radio is what I'm going to talk about here in a sec. So let's say that you're a developer of open source software that involves RF processing in some way, like you wrote SAT dump and you're doing satellite signal processing. You build something, you want to share it, you want to keep it easy to demonstrate and show off to people, easy to use. Those are sort of the main steps you might take. Now on the other side of things, you have users out there, whoever they are, individual students, organizations, who first they need to discover that this software exists. That's like the very first step. And then how do you install it, how do you run it properly, how can I evaluate how well it's working and use it with my SDR or my recordings. So kind of a duality here. On the developer side, you might post your code to GitHub, you might share it as part of a Faw STEM talk. That's kind of like the current method that we use. On the user side of things, you might Google the topic you're interested in, like specific satellite, Wi-Fi, whatever. You'll probably come across what's out there. But it's not the best way to do it, right? Just by Googling. So installation can be an enormous barrier. When I teach CS students, it depends who you are, but some students and some folks are better at getting this software installed than others. Obviously having a lot of Linux experience helps folks who are new to Linux but want to dive into signal processing, they can struggle here and there. So it can definitely be a barrier. Now how do you actually run it? If it's a new radio flow graph, you probably know how, but not everything's easy to use. There's RF libraries out there that are not clear how exactly do you use it, but you know it's powerful. And then lastly, evaluating the software. Maybe you're going to use it as a dependency or use it as part of a project. So this idea to sort of evolve IQ engines, so instead of just being a way to share and evaluate RF recordings, it can also be used for just RF open source software in general. Sort of like a central hub, community driven for devs to share stuff for users to find and discover software. And then by exposing the software as a plugin, they can try it out on recordings that are already on the site or their own. And then one side benefit is university isn't anyone else who wants to show off their expertise and creates open source software. They can use this central hub as a way to do that. Now this is all in the browser primarily for accessibility sake. It's not the most performant way to do something like this, but it's extremely convenient. Really, it removes a lot of barriers. So users would be able to play around with a certain function using a variety of recordings. And it's more than just using recordings. They can try in the future, maybe there's a way to lower the SNR, like add noise and see if it still works or what not. Add a frequency shift, see if the RF function still works. And then on the author side of things, all you really would need to do is add this REST based interface or at least make it easy to call with CLI and then retrieve the results. So like Sat dump, I'm not using a REST interface. I'm just running the CLI in a way that's easy. Anyway, now one design decision that was made was to allow multiple plugin servers to connect to a single IQ Engine instance like at IQEngine.org. That way, like a university could run their own plugin server, have total control over it, but they could share their expertise, everything they want to show off. And this is really just a concept. So right now I showed you how IQ Engine lets you preview RF recordings and RF data sets. Well, I think in the future with these building blocks that I showed through the plugin system and this REST interface that we're designing, you could have a tool that would be used for previewing what I'm calling Functions App Software, really anything that involves RF signal processing. Now there are limitations, so a lot of RF apps can't simply be run on a recording. So SRS ran excellent LTE and 5G radio stack. Because of LTE and 5G's strict latency requirements, you can't easily just play it back. It's not straightforward, simply running it on a recording. You sort of want to simulate that closed loop system. So not all RF functions and apps are going to be shareable this way, but I think a vast majority of them are definitely GNU radio apps and those kind of processing applications. The other thing that you wouldn't show off is like an SDR interface, like a GUI, that wouldn't make any sense. Now if you're interested in contributing, it's a community led project, so we can always use more web devs. It turns out that the kind of folks in these RF circles tend to know C++, Python, but less so on the website. And I know I've had to learn a lot of web development to get this project moving more. So even if you're not a web developer, there's plenty of other ways to contribute. We're always looking for more interesting RF recordings to share. If you have an entire data set, we can add like a whole category here on the left. So we have Daniel Estevez's awesome satellite recordings as an example, where we can link off to your website. And so if you want to get involved in any way, there's a Discord link at the top of IQengine.org. We have a little community that's slowly building. And with that, I will take questions. Yep? So yeah, the question was related to geolocation data, like running it as a plugin, I assume. Yeah, yeah, while I explain that, so there actually is already a maps-based interface for, anyway, when we designed the API mentioned, we made sure to allow multiple channels of RF. So those channels could be time synchronized recordings from different sensors. That way at least you could run it from a, the backend perspective. And then, yeah, I guess we would need to have a maps interface to the spectrogram page to make that fully happen. So yeah, I would need to make that fully happen. But good, good, great suggestion. Yep? Well, so Guinea Radio has some Azure credit that they got, and that's what we've been using for a lot of these recordings. So, and we can use that for other folks' recordings if they want to share it publicly. Yeah, you can reach out and we can transfer it over. Yeah, I think it would fall. No, no, no, like I could like upload it for you. So the Guinea Radio has a blob storage account, so I could, I could give you a SAS token for you to upload it yourself or I could upload it for you. Yep, I think there was one more. Yes, there is something that's a work in progress, but I guess I'll share it. So there's an upload page. Oh yeah, yeah, so, so IQengine.org slash upload should allow you to upload a recording. The Wi-Fi's not great, but yeah, that would be the first place to go. I think we're out of time. Any last question? Yep? So how well does it actually handle everybody's thoughts? So, I mean, it was designed to deal with terabyte files from the start, which is why we have that minimap, and when you open the spectrogram page, it's only loading what you're looking at at any given time. So it's sending the IQ samples to your client, to the browser. The browser's doing the FFTs. So it's sending maybe a few million samples to get a spectrogram like this, but if it's a mini terabyte recording, you'll just have a smaller, like, gray window here, because it'll represent a smaller part of the whole recording. Yeah, I mean, you have to store the recording, but it's not all, we have no part of the code that sends the entire recording to either the client or the backend, because we know it's not going to fly for huge stuff. All right. Yep? Yeah. Yeah. Actually, SIGMF has a lot of, there's even an extension for more details about the hardware involved. Definitely check out SIGMF, the specs. So if you want a five minute introduction to SIGMF, that's what we have here on IQ Engine, but I would, yeah, go ahead and go to the specs and dive in, and you'll know a lot of the parameters that you mentioned. All right, thank you very much. Wow. Thank you very much. Thank you very much. Thank you.
DAPNET: Bringing pagers back to the 21st Century
Thank you very much. Hello. Good afternoon. Hope you're all well. Not cooking up too much in this room. My name is Manuel, so I'm a radio amateur. I renewed Nerd if you prefer. I like experimenting with new or older equipment, see what we can do with it, or use existing software or hardware and deploy it as widespread as possible. If possible, within the amateur radio community and keeping things open source whenever I can as well. So today I'm about to talk about cutting-edge technology straight from the 1990s, pagers. So if you've seen those things, well, it might bring you back some memories because they were heavily used in the 80s and the 90s. It was used by mostly doctors, drug dealers, or businessmen, sometimes the three at the same time. So basically they were everywhere in the 90s and started to disappear later on when GSMs made their operation. But this was something really common. In those times, you can still see it in TV shows, medical TV shows, getting paged, the doctor's getting paged because there's a code blue, whatever that means in room 204. Now, I'd like to explore this thing because behind this hallmark from before, they're extremely simple communication systems. And I think it's worth exploring them a bit more and see what you can do with it today in the open source community and the amateur radio community. So today we'll be looking at what is paging in itself, what does that mean, how does it work, generally speaking. We'll go a bit into the technical part of it. So how it works, the modulation types, how you can make a pagering, and then we'll bring that into the amateur radio context. We'll talk about the DapNet project, which has been around for a few years now, what you can do with it, how you can get started, and then I'll be open for questions if you have them. So coming back into the techniques, let's talk about paging in simpler terms. Paging is basically sending a message, making a small device ring one way or another, very often to small, low-powering, compact receivers. Most of them use a standard which is called PoxSag, which was developed in the 1980s, but much older standards exist and are almost not used anymore. So PoxSag is one of them that remains. The other one being developed by Motorola and proprietary, but we don't talk about that here. The topology is always the same. You've got one big transmitter, high power, and then you've got your receivers around that receive the messages whenever there's one. So the frequencies, you have them starting in 8-chef, I should have put that. So you've got pagers on 27 MHz and then all the way up. Here in Belgium, the national services use 160 MHz. In other countries, you will see them in 460 MHz and sometimes even higher in the US, they're all up to 900 MHz, if I'm not mistaken. So you see them on a lot of different frequencies, and you also see that compared to a classic two-way radio, the antenna is built into the device, which is itself a challenge because it means that your signal needs to be higher in intensity to be received by those antennas because they perform a bit less than the standard WIP antenna. Use cases in the commercial world, you'll find them, for instance, for one single hospital to be able to call doctors or industrial scale systems or sometimes a bit bigger. National scale being one of them. Here in Belgium, we have one single frequency for a distributed system of transmitters operated by Astrid, which is used by firefighters, ambulance services and others, so it's still being used today. You will see them also in foot trucks or in order to take away food courts such as if you've been to the wolf two days ago, you'll receive a little pager that would have rang, sorry, whenever your food was ready. So this is also a pager in itself. How does that work? As I said, it's using one single frequency, a specific carrier that we modulate in FSK, so simple frequency shift keying. You send a zero by shifting one way, the other way is a zero, so just by shifting one from one to another, you send one zeroes and then format it into very simple packets. So very often, please mute your radio, just saw it. If you want to send a packet usually, you send a preamble that wakes the receiver up because those receivers usually sleep for long periods of time and wake up from time to time to see if there's not a preamble for them there. And once it wakes up, it will start decoding the signal and then it sends an address and the linked message. And if the address doesn't match the pager's address, it will just shut down, go back in sleep mode, so that makes for very power efficient receivers. So this thing can last up to one month on one single AA battery. So yeah, that's pretty much the idea. Again, if you want your pager to receive a message, you put the address into it, basically. So if you want, for instance, to program this message, which is aimed at the pager with the address 101, you put the address 101 in the pager. If it receives it, it displays a message and rings. Otherwise, it will just stay asleep because this is, for instance, a message for 102, not aimed at the pager itself, it stops ringing. Now, you can also make group alerts that way. So it's quite simple. You just put the same address, recall the RIC. You put the same RIC across all pagers and if they receive it, they will just ring altogether at the same time, displaying the message. So that means that for individual or group calls, you basically address one individual ID to a single pager and then you put one common group ID across all pagers. So you can select if you want to address one person or a specific group and you can organize your system this way. So it makes for a very simple type of receiver and then you can see yourself when you're building the network how you're addressing each pager or each group of pagers. Poxa-Agonamator Radio is not new. It's been done since the 1980s. I think it appeared at the same time as paging itself, so we started filling with that a long time ago. We use the TNCs connected to old VHF systems and yeah, the thing is you had to modify the pagers themselves very often by changing quarts and retuning the receiver loops to make sure that it felt between the amateur radio frequency allocations. But very often those were individual stations used for bulletin board systems at the time, for weather alerts or that kind of messages. So it kind of disappeared when packet radio really folded after the 90s. So right now the only thing we know on packet radio is mostly APRS or the more widespread you use is IPRS. So BBS, you don't see them anymore and the technology got lost in the ages. But now we have easier ways to interconnect stations together using HandNet for instance. So you now have IP links that can be made on amateur radio frequencies quite easily with modified Wi-Fi equipment or others. And there's a team from the Akan University of German Radio Amateurs that developed a network of internet-connected Poxa-Agonamators using free and open-source software and that is the DapNet project. So DapNet stands for Decentralized Amateur Paging Network. The idea is to have various core servers that are geographically separated, interconnected via HandNet, that exchange the messages through multiple nodes. So if you have one, fails, the others will take over. Now of course if you're outside of that HandNet link, you can always get a bridge through internet and this is what I'm doing here because I don't have an HandNet link here. We still haven't brought the HandNet links from Germany up here until unto Brussels. But you can go either way. The frequency is almost universal. Depending on your regulations we try to stay on the same frequency everywhere which is 439.9875 megahertz. That's a mouthful but that's the one we try to use everywhere. The only exception right now I see is the Netherlands because they don't have the access to this frequency so they're using a frequency on 432 if I'm not mistaken. But I mean from with this pager I can use it basically in Belgium, in Germany, in Switzerland. There are some transmitters in France as well so it's growing little by little. Now for addressing transmitters they have to be synchronized one way or another otherwise you have several transmitters that start keying up at the same time and then we'd interfere with on another. So they split up in time slots so if you have two overlapping transmitters you'll put one that transmits on one time slot and the other one that will transmit on another time slot. Just make sure they don't transmit at the same time. So what happens is you send a message on the DapNet infrastructure and as I said this only records basic numbers so there's no call sign you can encode in there. So there's a database on the DapNet infrastructure that links your call sign to an identifier. Very often we put the DMR ID because this is a way to identify hands with numbers and then it matches to this specific rig, this specific address, sends it to the transmitters that are linked to the area we selected so you can key up all transmitters or regionalize your calls. So you can say that if you know that the person that you're trying to reach is in Belgium you put Oscar November dash all. If you want to reach an area in a specific province well you can narrow it down and try not to use the network as extensively and just try to reduce the load if you know where your person is. Same for Germany, the Netherlands, Luxembourg, France there's this same kind of geographical way of cutting the transmitters. You can also make group calls so there are we call them rubrics and you have some for weather alerts, DX clusters etc etc. I'll come back to this in a moment. So what can we do with that? Well pretty much whatever you want. You can send messages manually to a specific pager via the handpager.de website. There's an Android app, I think there's even an iOS app but I don't know what's the status on that. Via the DMR infrastructure from Brandmeister, from APRS, from Tetra, so basically sending a text message from your radio will make it land on the DapNet infrastructure and then it will relay it to to your person you want to call. Then there's an API you can use to send weather alerts. There are automated messages for urgent alerts which will make all the pagers ring for example. DX clusters as I said or status on space solar flux conditions etc. That is something that also is also sent every four hours on the platform. You could also build something for a repeater telemetry or any IoT advice that you want but again keep in mind this is a network aimed at amateurs, for amateurs non-commercial and please keep in mind that is maintained by volunteers that do it on the free time with servers they have access to so don't start bombarding the network with telemetry that sends every second to the status of your fridge because that would be kind of a problem. So say reasonable but this is the kind of thing you can do as long as it's non-commercial. Now how can you get started? As long as you radio amateur with a call sign you can register right now there's a website to submit the tickets and we'll create your account. So once you do that you have access to the platform you can send messages if you want to receive them you'll have to buy, modify or build your own pager for the 439 megahertz frequency. That's one thing but then you need a transmitter somewhere. If you're lucky enough to have one within your living area you're good to go, enjoy. Otherwise well you can go your own way and install a hotspot at home or you can make it a nice project for your local radio club and build a wide range transmitter for everyone to enjoy. So there are two ways you can go. Speaking of specifics, acquiring a usable pager is relatively easy today so as I said before that you had to buy second hand pagers, replace a quart, retune the receiver chain but today we have more frequency agile receivers that have PLLs instead of the quarts that can be directly retuned or directly bought to work on those frequencies. So one of them is the AlphaPok 602R which I thought I had, yeah I have it on loan but here it is. So this one costs about, I think it was 90 euros when we checked on the AlphaPok website directly from Germany. You can buy it on Aliexpress but your mileage may vary. So that's a way to get quickly into it. You could go higher range and buy those commercial ones which are a bit more expensive but work as well or you can go the DIY and free open source routes and build your own using open source software like a project I've been working on which is the ESP32 pager which Bastia also improved a bit on the UI side because that's suck at UI but basically using a ESP32 Lora deathboards you can make it a Poxhack pager and have a receiver for quite cheap. I think those deathboards are about 15 euros on Aliexpress as well right now. So it's built on on radio lib so also freely modifiable so have a look if you're interested. As for transmitters you have two options right now for hotspots you can use if you already have an MMDVM hotspot well you're all set you just need to register it on the DapNet and activate the transmitter so that's one way if you want to build a wide range transmitter things can be extremely simple because you just need a small single board computer such as a Raspberry Pi, an FM transceiver you fit it directly into the audio unfiltered path of your transmitter and then well you're good to go so basically it requires four components the transmitter the Pi a transistor and one capacitor so you can get on the air quite quickly. All our transmitters are being worked on again Bastia is working on a ESP32 transmitter to make a small hotspot even cheaper if possible so again quite easily reachable. So where does that leave us? For me it's quite an elegant solution to receive text messages on our own independent networks having fun in the way learning how to use basic systems implement them and deploy networks that everyone can enjoy and it has its uses for telemetry or others you can do weather reports emergency messages text your friends via pager send silly jokes and the challenge of having it to fit within 80 characters so there are ways to make snappy jokes and intelligence ones at that but I think that thanks to the DapNet network the arrival of audio cars that can act as TNCs instead of using an external module it really made the thing much more accessible so if I'm able to SSH into my hotspots I can make you a quick demo of how that works so give me a quick second who's got a pager here? One? Nice? Nice? Nice? Very nice that's already one depending on what you registered in it I don't know if I'll be able to make a drink but so basically here you have my personal pager this is one from a friend I just borrowed and this one which just died on me which is not a problem in itself I'll just make this presentation shorter oh no it's alive there you go so those all have their own individual addresses so this one is 2069009 206500 sorry this one I don't remember this one is address 100 so I can make this one ring specifically I just key the transceiver up say I want to make the trans the pager number 100 ring please work don't make me look silly there you go so right now you have only this one ringing so I just made an individual call to this one now let's imagine I want to send a group alert for I don't know some weather storm weather coming up or a rare DX spot happening right now on on 18 meters 18 megahertz sorry so then I can make everything ring at the same time 1040 and then everything rings and it's just a nightmare and I need to confirm that otherwise it will ring again so there you go quite simply using basic addressing basic open source software this is just the hotspots just an MMDVM here running in the background and I can directly key the transmitter up if you have access to the DapNet system right now I think at least two or three of yous have access make you can make an individual call to myself there you go he just sent me a message on my on my pager and what did he just say how does a SQL expert get it how does a SQL expert get a date okay nice very nice nice so there you have it if you have any questions oh yeah there's another open source project that is just coming up with it where is alexander hello didn't see you yet if I'm not mistaken you worked on a pox hack decoder which is getting finished up as we speak for sdr plus plus so yeah I think it's important to report that as well sorry I didn't get the time to fit it into the the slides but again if you have any questions I'm I'm just Jesus Christ thank you for your attention and yeah I'm only yours all right I hear a question do we have a microphone or I'll repeat the question so I live quite close to an old school pager site yep they transmit very high power on vhf they do what causes interference a lot of other stuff whoo okay so you know in practice how much power do the transmitters in this network need to be useful and what happens when the pager misses a message is there a transmit or do you go to get one shot um well very often in professional network I'll start from the I'll repeat the question first so you have problems there's an interfering pager transmitter next to you because it's using high power so how much power we're using and second question sorry short term memory is what happens when the pager misses a message if you what happens when you miss a miss a message so the first one being yeah for commercial systems very often they who use 200 300 watts for transmitters because it needs to reach inside of parkings and the antennas are lousy at best so you need high power to get through for amateur radio systems it's less of a god now everyone is trolling me now in um yeah for amateur radio systems very often we don't go to that imperative of being able to reach everyone to through parking lots so very often the transmitters are 25 watts to 50 watts I mean higher up would cause problems such as what you're talking about but yeah usually we keep it low and we just add more transmitters here in Belgium is a problem because every time you add a transmitter you need to pay for an extra license so I mean we're still limited a bit legally speaking but it's not a problem in Germany or other countries where they don't pay repeater licenses or they're much cheaper speaking about missed messages there are two mitigation measures well actually just one is repeating message if it's lost it's lost if you don't get it that's it because there is no way to send an ack so either you receive it or you don't and that links to the first problem that's why the commercial systems use high power so there is no store and forward system in paging so yeah that's a small small limit other questions yes you don't need a call sign to receive signals specifically on radio amateur bands so you could perfectly use an sdr or I don't know buy a pager and put some public public messages but to be able to receive individual to you or to be able to transmit or you know at least access the platform you would need an amateur radio call sign but I mean radio amateur is much more than paging and I think it's worth looking into it if you don't have a license yet I'm not going to start into my big talk about about that because I've done it for about 25 times today but yeah there's a lot to discover and that hobby and might be worth looking into it if you have the time to access that hobby other questions yes yes it does it does you can change the ringtone you make it make it go beep the blue whatever you can even compose your own ringtones on some of them yes p32 pager actually has a provision for you there are different tones and you just compose a music you want so if you wanted to make play Tetris go ahead there was one question here and then you no question here okay what's the frequency range of the receiver the receiver so the receiver itself could be tuned pretty much anywhere on the UHF band so 43440 but the problem is it's using an antenna loop so a loop antenna so it has a very high q so you need to tune it yeah it's 70 centimeters yep yep there is one there is one I would need you have internet you're connected to the custom network here yeah if you're looking to hand pager dot de you should be able to at least get the address book so yeah my time is up thank you
SMB for Linux with SMB3 POSIX extensions
Yeah, thank you. Yeah, just to introduce myself, my name is Falka Lendeker. As you can all see, I work for Samba since the mid-90s, last century actually, so for quite a while. And I think I don't have to introduce what Samba and SMB really are. They are file-serving protocols. And what I would like to do eventually is kill NFS. And I know this doesn't go down well in some communities, but this is what I'm working on in my spare time, when I have spare time. In the last few months, unfortunately, it was a bit limited, but still, some of you already have seen this talk at Samba XP or other conferences. There's a little bit of new stuff, but I think it's still interesting to see that you can actually serve SMB clients or Linux clients with SMB. So what is it all about? You want to share file systems, directory and files across a network. So you have one server where you have a directory, where you have a file system. And you want this to be shared across a network to possibly many, many clients. If you go Linux to Linux, you typically use NFS. And one of the reasons is it's so simple. What you do is you just add a line to your ETC exports. Maybe you have to kill or restart a demon or whatever. Then you issue just amount command on your client and you're done. That's about it. However, it comes with some downsides. First, there is essentially no real metadata or data caching in NFS. This means that it can regularly happen that you create a file somewhere and it doesn't really show up until a bit later, some on other clients. If you just write to directories, if you just write to files, other clients don't really see the M time or size updates really precisely and so on. So this is kind of problematic. Why does the mail format actually exist? Because locking doesn't work over NFS. And yes, NFSv4 has locking, NFSv3 has external protocols to do locking, but you can't really rely on those. And it's really, really complex to set up locking properly and to get failover done and so on. This initial very simple setup, and I love this acronym of NFS, it's just no file security. Because essentially what you do is you trust your clients to assign the UIDs and GIDs and essentially the group permissions and whatever you assign them correctly on the client. And there's nobody in between who actually checks. I know there are these days there are protocol extensions to do NFS over TLS, so at least transport is in a standard way protected. You can of course go and enable Kerberos for NFS, but this is also pretty complicated and we have done it in customer scenarios. The client at least is buggy like hell. And you get incompatibilities all over the place. You lose keys, you would, you lose anything. So it's really, really difficult to set up. As I said, clients have a very bad day when you Kerberize them. SMB however, it really comes from the Windows world. And if you look at the, there was one talk by the original SMB implementer, Barry Feigenbaum. Is it online available, Günther, do you know? So at one of the conferences that we regularly go to, there was actually a talk by the original inventor or developer of the SMB protocol. And essentially what they did is they took the MS-DOS interrupt in 21 and put the arguments on disk and let the server take care of it. And this means that they have to be compatible with a lot of applications on DOS. And DOS means that applications like Word 5.5 or whatever believes it's alone on the machine. So this means you have to get locking right. If Word opens a file and it believes it's the only one editing that file, you better make sure that nobody else also edits that file simultaneously. So they had to get locking right from day one. The other one is cache coherency. We have protocol for this and this between Windows and Linux, this actually works. So if you open a file over SMB, typically what you get is a permission to cache stuff, to cache your updates, to cache reads and so on. This leads to much, much better performance. And if somebody else also wants to open the file, you get notified that, oh, no, you're not alone on the file anymore. Please drop all your caches. Please write back your caches. And you tell the server, hey, I'm done writing back. Now please let the other one in. And then they have to agree that they all have to write back their changes and so on, read new data from the server. And the other advantage is, one of the advantages is that SMB servers, they are everywhere. Every home router in Germany, the Fritz box has an SMB server in there. All NAS appliances have SMB, so it is everywhere. And you can access it from almost any place. Whether all the features that we are talking about here are correctly implemented everywhere, that's a different story. For example, Fritz boxes don't talk to my mobile phone properly, but that's a different story. But essentially, it's everywhere. The SMB protocol is very flexible. There were very, very early extensions of the SMB1 protocol. So like every protocol, you have a lot of requests going back and forth, and there is unused protocol space. You have whatever, a create request, a read request, and so on. They are numbered, and there's number space that you can take and so on. And this is what we did early in the, whatever, 2000s or so for the SMB1 protocol. There are UNIX extensions that match all the UNIX semantics in the SMB1 protocol. This was never transferred properly yet to the newer and now only SMB3 protocol. And what we are working on actually is we want to extend the SMB protocol with all the behavior that a POSIX client expects. How is that done? The first packet that is sent between client and server is called Negotiate Protocol. And it exactly does what it says. It negotiates different flavors of the protocol. For example, it tells, hey, I'm SMB1, I'm SMB2, I'm SMB3, and I have this and this subfeature and so on. I can do these capabilities and those capabilities I can't and so on. And essentially what Microsoft has done with the SMB3 protocol, they did the smart thing and made this request extensible. Essentially what you can do is you have this Negotiate Protocol request and you can add what I would call extended attributes to this request over the wire. I mean it's not an exact file system, but you can just extend the request in a standard way with a new Negotiate context. So you have a ton of Negotiate context that say, okay, I can do encryption this way, I can do whatever. And we just have an additional extended Negotiate context that says I can do POSIX in this version. So the client tells the server, I can do POSIX, server tells client I can't. The default behavior is for unknown extensions is that the server just ignores it and doesn't send a reply. If the server sends a reply, I know I'm talking to a Samba who is able to do all this stuff that I'm talking about here. File name handling. This is really painful in our case because Unix file systems are case sensitive. Windows file systems in particular NTFS is not case sensitive. What does it mean? Under Unix you can have two files, Make file and Make file, one with capital M, one with lowercase M, and under Windows you can't, under NTFS you can't. When now a Windows client comes in and says I want to create Make file, what you have to prove at creation time is that no other uppercase, lowercase combination of Make file exists in the file system to fulfill this promise that this is case insensitive. What do you do by default? You scan the whole directory. And this leads to an O to the order of N square performance behavior. If you just drop a million files into a directory, file number 900,000 takes a lot longer than file number 1 because I have to scan the whole directory to prove that no other uppercase, lowercase combination exists. And what we can do is we can add a new create context, not only the negotiate context, but also the open file and create file request has these extended attributes. I can say that I want to open a file POSIX style by adding one of these create contexts. And we have defined a create context so that clients on a per request basis can say I want POSIX behavior, I want case sensitive behavior, I don't want file name restrictions, I want double quotes in a file name which Windows wouldn't allow. I want them. I know what I'm doing. I'm POSIX. So what we also need is POSIX metadata. If you look at the properties of Windows client on a Windows file, sorry. So we are here, Windows Server, I say properties. There's a lot of stuff. In particular, there's timestamps created. We have four timestamps in Windows that are roughly similar to what we have in Linux. We have attributes and so on. There's a lot of stuff that Windows has as metadata. However, the semantics are a bit different. In particular, they don't have a good notion of UID and GID. And they don't really have a good match right now for POSIX permissions. So some of the ones that we have in struct stat, like file size and so on, they are the same in Windows but in particular UID and GID, they are not. They are not the same at least. So we did. We extended the protocol. And if you, for example, do a stat on a file, if you ask for get file information, you can say I want this info level and there's a 16-bit field for info levels. And we just added one. We talked to Microsoft. Hey, get us this additional... Don't use this additional number that we use for POSIX information level. They agreed and so we have an additional field that we can use to fill in all this information that a client might want to use. However, second-last line. None of this is really the topic of this talk. It's about file types. If you look at the Unix file system, you have seven types of files. You have a normal file. You have a directory. What else do we have? We have block and character devices. We have named pipes. We have swim links. Oh, shock and horror. And we have sockets. Unix domain sockets. Samba can handle regular and directory files extremely well. Oh, there's a typo here as you find out. So we can handle directories and file. I mean, that's what we are made for. We have file servers so we better handle directories and files well. What do we do about the other ones? If you go and share ETC in Samba, sorry, share slash dev in Samba, something you probably shouldn't do, but if you do, Samba will find a lot of stuff that it can't really properly present to Windows, to any client. It will find character, block devices. It will find all sorts of stuff in slash dev. Or it will, if you just share a home directory, you will find sockets for GPG agent, SSH agent and so on. You will find all sorts of stuff that doesn't really fit into the file and directory schema. In particular, for example, you find files. And in previous Samba version, this used to work, that from a client you came in, it could open a file for writing, hoping that the server side on the Unix machine still exists, the server side process on the Unix machine still exists, and you could write into that and the server would get the data that you write into this. This can't be very popular because, I mean, many versions ago, we broke it and nobody noticed. Alexander is confirming. You're using it, Alexander? We have a lot of tests, but Alexander's comment was that we don't cover this, which means we didn't notice. Why didn't we know, or why did we break it? If you open a file for under Unix, all you can do is issue read and write syscalls. We don't do that in Samba anymore because whenever we get a read and write request from Windows, there's an offset attached to that read and write request, like in NFS. And we do the natural thing. We p-read and p-write, like what you do normally in the process where you have an offset. This is all from times when you couldn't really expect p-read to exist, but these times are long gone. We have some very special support for sockets. What's a socket? That's essentially a... It's a 5.0 on steroids. And what we do with sockets is we implement the Microsoft notion of RPCs. What is it? A Microsoft Windows client over SMB can open a file and transfer data over this file, special file, on the share IPC dollar, slash pipe, slash semr. What you do is you're win-redge. You open a file on the IPC dollar share, win-redge, Windows registry. You talk to the server side registry over RPC calls. And we implemented these days since 4.16 that our Windows registry server actually listens on a Unix domain socket and the SMB server connects to that Unix domain socket and just passes on back and forth requests. And so this is what I mean. We have limited support for sockets, but this is not what somebody would expect if on the client we would run a SSH8, for example, that clients connect to because this all needs to be done on the client side then. Block and character devices, I mean, we find them server side, but they don't make sense at all over the network. You don't want to whatever read and write to DevSDA over the network. You just don't want this. You could, but why? Enter NTFS repass points. There's a Wikipedia article actually on NTFS repass points. Repass points provide a way to extend the NTFS file system. A repass point contains a repass tag and data that is, and data that are interpreted by a file system filter driver identified by the tag. What does this mean? One use case is HSM systems, hierarchical storage management, where you have a huge file on NTFS that some software just pushes to tape and leaves a stab inside the NTFS file system that is just visible to the client as normal. And now when the client opens the file, the open code sees, okay, this stab is a repass point and the extended data that the repass point carries points at the place somewhere on tape. It's on this tape at that offset. And what you can do then in Windows is install a driver that when a client opens this file, the Windows kernel goes to the tape library and says, get me that file back. So this is software that you can install in the Windows kernel to extend NTFS semantics. And this is what, by the way, the NFS server uses. And we will see in an example of this. So applications can use this for arbitrary blobs. It's a special marker for a file, for a normal file that says, oh, I am a repass point and you can store stuff in there and essentially it's an extended attribute. When opening a file, NTFS filters can interpret the contents. This is what Microsoft also actually uses for sim links. Windows has symbolic links. They are stored as repass points. If you double click on that repass point, and I can demonstrate this here, I know demos never work. I have a file for and I will show you how I created this. I double click on the file for. Oh, okay. Wait, oh, I have a, as I said, I should never do. Ah, file for .text. Here it says text document, which is just a description of this is a .text file. I double click on this and what it says is the file cannot be accessed by the system because this is a repass point that happens to be named test.text or something. And they believe, oh, we have to open notepad, but it can't access that file. So the error message that you get if you double click on that file is, status I owe repass tag not handled. You have to tell the server that, oh, I want to open this special file in a special way. You have to set a flag. So a repass point, as I said above, has a so-called repass tag, which is a 32-bit integer. And if you look at the Microsoft documentation, Microsoft uses these repass tags and documents the use to a certain extent. And there's a lot of those. If you go here to that website, there's a ton of repass tags, reserve 0, reserve 1. What you see here is, I hope you can read that. No, you should be able to read that. That's HSM. That's HSM 2. And so on and so forth. Filter manager, repass tag, swim link. So this is what Microsoft defines in their spec, that they are using these sets of swim links. These sets of repass tags, and you get the integer there. Swim link is 0xA and then a C at the end. And we're using this. We are about to use that. So now we have two kinds of users of these repass tags. Do you remember WSL1, the version one of the Windows subsystem? They try to run Linux applications on Windows, and they face the same problem. Windows applications expect sockets and fee force and swim links to work. And in version one, they used NTFS actually for your home directory, for your local files. And what they did is, they have this repass tag address family unix. They use that. And what you will see here is, it must be somewhere. But if you dig a bit deeper, what they tell you is, the contents of these repass tags are not meaningful over the while. They were intended just for the WSL subsystem, Windows subsystem for Linux, server side. So they define as part of the data that is stored in this repass tag, hey, we have a block device, a character, a FIFO, and so on, with the obvious counterparts on Linux. So what they did in WSL1 is, when somebody didn't make FIFO, they created a file with a repass tag. And they, in the content of the repass tag, said, hey, this is a FIFO. None of them are actually documented. And because that costs so much trouble, the version two of the WSL, which I actually, is anybody using WSL? Some are. It's actually usable. I would say it's actually usable. You can't really tell the difference from a real Linux. At least I can't. I mean, if you look at Pock, of course, you will find differences. But for the normal day-to-day use, it actually works pretty well. Because what they are using, they are using a real X4 these days. Then there is a Windows NFS server. Pardon? Why? The question is why. I don't know. The Windows NFS server, this is what I'm going to present here, hopefully in demo. They also have the same problem. A client does a make link or a sim link or whatever. It doesn't make FIFO, and they have to store the data somewhere. And they define yet another set of repass points. And if you look here, they actually have a definition of what goes into the data, into the data field. Repass tag, repass length, and so on, in general. So they define sim link, character device, block device, and so on. And they actually specify what goes into the data field. For a sim link, the target goes in there and so on. And for the character device, you have two UIN32s for major and minor and so on. So they define what goes in there. I mean, you would have thought that these guys, talk to these guys, to share an implementation, but no. Why? The interesting thing is, if you look at, and I created a FIFO server side, and if you look at the properties of this FIFO, and you have to trust me, the one in the first row, can confirm that you have an L here. It says archive and L, L for sim link, if you look up the documentation. No, it's not a sim link, it's a FIFO. So their GUI is not really prepared for this. They believe, okay, all the GUI believes, all files that are repass points are sim links. Alexander? Is this client side? Client side? I can demonstrate that I see it. Because this is a local file, right? That's a local file that I created over NFS. Okay, so the NFS server created on a local file system something with this associated repass. Yes. So this directory here is local disk share. This is a local NTFS file. And what I did is I exported this via the NFS server. I mounted this from the client. And why don't I show it directly? I mounted it from the client, which is here. That's my client with a mount. If you look up at the top, NFS. And you can see in the left column here, I have sim links, I have block devices and so on. And I created them with normal UNIX commands over NFS. And this is what ended up on the NFS, on the NFS file system server side. Repass points. And so this is not too popular with Windows applications. So the Windows Explorer believes all files with repass points must be sim links because that's the most popular use of repass points in the Windows world. OK. So they don't look at the repass tag. They just see that this is repass point and it must be just one type of question. Yes. Alexander's comment was, and I will show you that in the Wireshark trace in a minute. There is a special flag in the metadata of the file that says I am a repass point. And you can of course get into the details of that repass point if you wanted to. But if you're an explorer, you don't care. You say it's a sim link. OK. Now this is a discussion. Do we use these guys or do we use these guys to represent or to present to the client when Samba finds a sim link? Samba site. Or for sim links, we even have three options. And so WSL version one has reserved repass tags. And if you look at one of these lists that I've shown you, you have repass tags for the individual subtypes, but they are not documented. They are not used anymore at all. You don't have any interoperability with anything else. We could of course use them. So in the case when Samba on disk finds a sim link or a block device, how do we present that? We have to make a choice. And WSL defines repass tags with undocumented comment. NFS only uses one repass tag. Pro NFS would be we have documentation available. And so on. And what we can do is we can write protocol level tests against the benchmark, which is the Windows NFS server. So we have ways to create these things on Windows and write just tests, which is very good. Also, if you now say, OK, I want to create a FIFO from my Windows client. That has mounted the home directory of a user on a Windows server. If I do that, the Windows client will create a repass tag that an NFS client talking to that same file system on the Windows server will also see as a sim link. Or as a FIFO, whatever. The same thing. And so this is why I would say, OK, I would like to use NFS repass tags. I have to talk to the CIFS kernel developers. I think with Linux 6.8, they went to different route. Andreas, do you know? No. So I think they went a different route, but we need to talk. Coming to sim links. Symbolic links in the BSD, UNIX, depending on how you look at it, are the best ideas since sliced bread, or the worst nightmare that everybody falls over security-wise. Even the Rust infrastructure. I mean, Rust being a language very security sensitive, they had their sim link race security bug. But we have to deal with it. We have to live with them. They are there. And so what do we now do when we see a sim link on the summer server side? Yeah, we can do that with the two ways that I presented. But as I said, Windows even has its own notion of sim links. So if you create a sim link, depending on where you come from, you get one out of three versions, three ways to represent them on an NTFS. And if you look at it, this Windows way of sim links, they actually work pretty well over SMB in the pure Windows world. For example, what you can do is you can have a sim link on a directory on an NTFS that is shared over SMB. And the sim link target can be backslash backslash IP address backslash share name backslash directory. And if you want to cut that file, or if you want to CD into that file from a Windows client, it will redirect to that server. So you can have cross server sim links with the Windows NTFS notion of sim links, with the pure Windows notion of sim links. Even better, if you try to open a sim link the Windows way, you double click on that file and under POSIX, you typically follow that sim link directory. If you mount that over NFS, the NFS client will have to take care of those and follow client side. But Windows does it a bit differently. When you double click on that or when you open that sim link file, they tell you, hey, you hit a sim link. And they will even in the error response, they will tell you, and by the way, the sim link points there. That saves at least one round trip, or several round tips, that if I hit a sim link, then I know where to go directly on the client side in the response. And Windows typically is completely path based, so if I open a Windows file, slash A slash B slash C slash D, and somewhere in the middle there's a sim link. They don't follow that server side, but they tell me, hey, go there, and by the way, I have passed slash A slash B already, and C was the sim link. So if I have a long path with many components, they tell me, okay, the third component is a sim link. Okay, how do we create these special files? Protocol-wise, there's a special flag to the open call, and we just set the contents. And yeah, what we can do is with Samba, what we don't want to do and what we will never do, if a Windows client comes in and creates a sim link the Windows way, we will not create a sim link server side. What we will do is because Windows sim links are also represented by normal files, 10 minutes left, they are represented by normal files with some special contents, with some special whatever extended data. We will do the same. So if you do a make link from Windows, then we will create such a file telling the client, hey, this is a sim link file, and the Windows client will just work as it will. And we will just open GIFs to the NDI interface and we will create Web notEP Things like R!! Or maybe Mint Speedusu advert. Okay, you know how much space here, I ran over the interlocking Mobile T share line for this. Okay, a shell. So what we will do for that Fashionbt is, I will use the three space walk way, and just kind of Diellow, And what you can see is ln minus s foo bar. This means I do a sim link from bar to foo, I believe. I always get that wrong. Yeah. So what I did is, and this is a file that actually lives on ntfs shared via nfs. And what we should see here now is that this is file bar. I created that. Now what I'm doing is I have my little user space tool, test start, that, and you know my password now, that I use always for Windows boxes. OK, what does this do? It creates a connection to that Windows server over SMB. And I just get all the metadata over SMB. And let me just TCP dump that. Oh, this is the wrong. TCP dump cannot override its own files. That's very strange. I know. OK, let me wire shark this. And what I see is a sim link. It's a sim link. What I could have done actually is to extend this command output with the sim link points there. Haven't done that yet. Maybe on the train back home. Let's look at that wire shark trace. Oh, TLsR. In the background, I have my connection for RDP running. SMB2. So here you go. It's a bit verbose. But what I want to point you at is I try to open the file bar, which is the sim link. And it says, repass tag not handled. Then I open the file again. And don't be confused by the create request. Create request is all catch all open file thing. And there I tell you, I tell the server, hey. I want to open this file. And I don't want you to interpret it. I don't want the HSM engine to go running. I just want to open the file. I want to open the HSM stub or the sim link as such. I want to see the metadata. I want to see the metadata. And it gives, OK, here you are. And a bit further down, what I can say is, OK, what I can say is, I can get the repass point data out of this file. And what I can see here is, I have, oh, this is a repass tag NFS. And by the way, this is a sim link with a target foo. So this is data that the NFS server gave me. This blob here, which is somewhere here, that was created by the NFS server. And so we can just utilize it. And we will utilize this. Before I take questions, I have one more slide that I want to talk about. Long running compute jobs. Very quick overview. If you have your HPC job farm, the one thing that gets in your way is all this file security. You want NFS, no file security, on long running compute jobs, because if a machine dies, it just, yeah, you don't want really, this is a trusted environment, and you just want your jobs to continue existing. SMB actually has secure provisions for this. What you can do with SMB is, you can create a machine account, you can give the machines a password, essentially something like key tab for Kerberos. And you can, this is standard Windows protocol. You can extend the connection to a share with yet another T con context saying, OK, dear server, I know what I'm doing. You trust me by my machine account. For this connection, please use this UID and GID to this share. This is a standard SMB protocol extension, and this is what needs doing before we actually can claim success and say, OK, we can also cover this long running compute jobs properly like you can with NFS or any other file sharing protocol. Not implemented yet, another server side, no client side, but it's there. Yeah. Mark. The machine account to the machine accounts authorized to protect any of the IDs. Correct. The comment was that SMB has a provision that you can trust a machine notified by the machine account database, whatever, you know what I mean? You have server side, this machine is trusted for doing the no file security thing. Send a protocol extension. OK, this is not really, thanks for your attention. Any questions? No questions. This is not good. Fun? Just an observation, the WSWSL version 1, are you really wanting to implement that more obscure data remaining on some obscure machines? I would suggest forget about it completely just because it's that. The comment was questioning whether we want to go the WSL1 way with these repass. Talk to Steve. Talk to Steve, French author, main author of the ZIF client. I mean, it basically is him. Steve French and Paul Alcantara, those are the ones who I believe for Linux 6.8 have implemented the WSL1, here, this one here. If you look at LWN, they can now create block and character devices and I think they went the way with WSL1. But I mean, talk to Steve. The comment was WSL1 is the only one under Windows Server 2019. There you go. Any other questions? Good job. You also, what? The question was how the current ZIFS client deals with these repass points. That's actually what is covered here. They start to properly implement that. So they already have some links. They have support for some links, the Windows way, because I mean, they are there. But they start to, they start working on all the other ones that we were talking about. Mark. It's work in progress. So I mean, parts already exist. Can you repeat the question? I was pointed out. Mark was asking about the, the, the new data, the data that is used in the system, and the data that is used in the system, and the data that is used in the system. Can you repeat the question? I was pointed out. Mark was asking about the status, what's currently implemented. It's a slow progress. And parts already exist. Other parts don't yet exist. So I don't know when we can actually claim that we do full SMB3 Unix extensions. I can't promise anything. One more, there's time up. I think we are pretty strict here. Just come to me later.
MicroCeph: Get Ceph Up and Running in Minutes
Hello, welcome. My name is Peter Sabini from Canonical. I'm a software engineer there. I work on various CEP stuff and I'm very excited to present Microsoft with the tagline Get CEP up and running in minutes, unlike my slideshow. So problem statement. Microsoft packages CEP. This is a big complex system with distributed configuration, distributed components, complex bootstrapping, procedure and complex operations. It also has non-trivial hardware requirements. It's not just like you can download a package, install it on my notebook and be ready to go. It also has impact uptake and adoption among users. So if you're, for example, a famous physics research organization with thousands of nodes in your storage cluster, you probably have trained staff on hand 24-7. So you're good, you don't need Microsoft. If not, if you don't have a team of trained experts on hand, maybe Microsoft is something for you. So what is Microsoft? Microsoft is a single package staff cluster. Everything is in one file. We're designed it to be a simple setup so you can get a running staff cluster for command lines. And it runs on your notebook. You just need one node with obviously one hard disk. So simple possible staff cluster you can do is install Microsoft, putstrap the cluster, add some simulated OSDs, disk drives. So this is loop files in this case. No extra block devices required. And then wait a few minutes and your staff cluster should be ready to go. How did you do this? Microsoft is a snap package, as you might have guessed. Snap packages have the benefit that you're completely isolated from the host system. All the user land is in separate namespace. You just need a kernel, network devices, block devices, hardware, etc. to get up and running. This gives you a good isolation from the host system and gives a consistent environment across different operating systems. Some other goodies, it's isolation from the host system also means its access is isolated. The snap package just cannot do anything it wants on the host system, which is good for safety, security and robustness reasons. And you have standardized risk levels. So if you want to install release candidates, etc., there's a standardized way to do this. A little bit of overview of the Microsoft architecture. You have a service management demon that manages the standard CEP components and has a distributed database, a DITS proposal for storing configuration and no topology. Also included in the snap package is a CLI that talks to the service management demon via an API. All this is just a standard Ubuntu devian packages, no special binaries here involved. I mentioned the service API, so everything in Microsoft happens via this service API. Things you can do with the API, like listing block devices, adding or removing nodes, adding and removing disks. Everything works via the API and the included CLI is just another client for this API. So this is obviously great for integrating it in other systems. Some more internals. Microsoft is built on the micro cluster library, which provides this distributed configuration database, which is using RAVT for consensus. It also provides cluster membership and API framework. I already talked a little bit about scaling down, so single node systems work. One important component here is that we automatically manage the crush rules from CEP. So this means that as you start up with a single node, you get a failure domain of OSD, so in effect your single node clusters work out of the box, but if you add more nodes, your resiliency and your failure domain gets scaled up automatically. It's also possible to provide custom crush rules. This is important, for instance, if you go for larger failure domains, for instance, if you have a failure domain of rooms or racks, you can implement this. Microsoft itself doesn't know about your rooms or racks, but it won't step on your toes if you provide a custom crush rule set. Microsoft is famously scalable to thousands of nodes. Microsoft's scalability upwards is primarily bound by the RAVT algorithm used in the VQLite database. For performance, I would like to note that we're not sitting in the data path anywhere for CEP operations. You get the standard CEP performance behavior, also with Microsoft. Some integrations. Microsoft is the back storage back end for a number of projects in Canonical, for instance Sunbeam, MicroCades, MicroCloud and LXT. There's also, if you're running Juju models, there's a charm available currently in beta to integrate Microsoft into your Juju Clouds. Last but not least, there's a nice little GitHub action that we provide to integrate Microsoft into your GitHub CI workflow. So if you need, for instance, a S3 endpoint for your testing pipeline, this is an action that would help you with that. Okay, on for demos. I prerecorded these demos for time reasons and also because I'm very bad at talking and typing at the same time. So let's see how this goes. So this is the single node setup we talked about before. I'm going to install the single node. Microsoft Cluster gives it some simulated disks and enable a Rata's gateway, which would give you an S3 endpoint. Yeah, installation. We have the standard stable risk level here set. So this is what you get by default. You see my capital DSL connection here. Yeah, so we bootstrap the cluster. This is done pretty quickly. We can see now that we have a few services running already, but no disks. Then we add some simulated disks. These are just loop files. This is useful for lab environment or for testing. Don't use it for production. For production, you would use separate block devices, obviously. But if you want to get going on your laptop, that's the way to do it. We enable Rata's gateway. You can see it is active here now. We create a Rata's gateway user. This is just a standard safe way to do it. It's a little ugly command line here. And yeah, and we're done. We can use our favorite S3 client to access our new Rata's gateway endpoint. So just to prove that it works, we are creating a bucket here and put some image up in this bucket. Yeah, so that's for this demo. So this is the simplest possible case. Let's do something a little bit more complicated. Say we have got a few extra nodes now. We want to in an expander cluster and provide it with real block devices. This is the way we will do it. I'm now using the candidate risk level because I want to use some features from Microsoft that didn't make it to the stable risk level yet. So to cluster Microsoft, you need to get the token from the bootstrap node. So the first node that we provided, like this. Name the node you want to add and get the token for it and provide that token to the node that you want to add here. So, and yeah, small typo. These have happened as well. And yeah, and now all our nodes are clustered. Let's check Microsoft status. We can see all our new services here, but all the new nodes don't have any disks yet. So let's add some disks. So what I'm going to do here is add a user feature that comes from the release candidate that is automatically pro for empty block devices. So anything that's not mounted is clean. We take as a block device here with this switch. Let the thing settle a little. You can see there's lots of virtual disks from Kime. And we have a lot of disks in our cluster. So the safe cluster is still setting a little bit, but we suddenly have a lot more space available. So one thing we can do is provide a second radius gateway endpoint. Now we can see that the data we put in before is still here. So that's reassuring. And what we'll try to do now is we put in another OSD on the first node, but this time we want to make it encrypted. So full disk encryption is something we provide here. It relies on the dmcrypt kernel module. Not all kernels have that, so that's a little bit of a gotcha. You need to make sure it does. And also, this goes by so fast, also this is something that the snap is not allowed by default to do. You need to connect the dmcrypt module explicitly to make this happen. But once you do, it will give you an encrypted OSD device. That's the one up there now. Just to prove that this is a dmcrypt device, there's a setup here. Well, we all have the loop file for OSDs from the first node that we originally installed. Let's remove that. We have plenty of block devices now, so that our cluster has real disks. So as a last step for our production cluster, something that snaps to by default is auto refresh. This is something you don't typically want for your self cluster. You want to control updates for your self cluster. And that is a step you do, is hold all the snaps and prevent auto refresh so that you can refresh or update your software to your own leisure. So, yeah, so that's what's for the demos. Short outlook, what comes next. We want to make the clustering experience a little smoother still. No passing around of tokens. So one thing we could do with this or we planned to do is on the local network use MDNS to determine new nodes. Another thing that we want to do in the near future is provide built-in HAA and load balancing for other gateway endpoints and also RBD mirroring support. So that was it for the demos and for the... Thank you. Any questions? I don't know if we have time for questions. One question maybe. Otherwise, I'll be outside. Just talk to me and I'll be happy to answer your questions. Oh, sorry. Here you go. What architecture do you support with CPL architecture? Can you repeat the question please? What CPL architecture do you support? So snaps are pretty flexible. We develop on AMD64, but ARM is tested. I don't know if the top of my heart had... But ARM, AMD64, power, PT and risk, I believe...
Welcome to Testing and Continuous Delivery devroom
All right, good morning everyone and thank you for coming so early. I'm just going to take less than five minutes of your time to say the welcome and then we continue with the awesome speakers that we've got. As you can see, there's a lot of us here today. If you aren't aware of the history, in the past we did have the testing and automation dev room which was separate from the CI CD dev room and this year the two dev rooms have been merged and going in the future we will continue like that. So we do have the two teams from the dev rooms organizing together this year. So we start with Anders. Say hi. Yeah. Then we have Jan. Olivia, Fabrizio, Carlos. We've got Sirio who is at home at the moment. He cannot join us because he has newly born small kids and my name is Alex. Nice to meet you. So you all know the rules I think. Don't be a jerk. Enjoy the presentation. If you see more people coming in late, go towards the middle so that they can squeeze in on the sides and not jump over you. And if you want to talk to the speaker, it's up to them whether or not they will be taking questions during the presentation or after, but we do not have a lot of time for switch over. So if you want to talk to the speaker, please take it outside so the next speaker can set up and we can continue and then you can just come back in. And this is it, I guess. Thank you again for coming and let's start.
Streamlining Developer Experience: The Power of CI/CD Standardization and Interoperability
I have, I've just been informed that I have a two hour talk. So we're going to use that, we're going to use that time wisely. Hopefully we also have like a minute. So I can't start talking with the talk until for another minute. So with that, who's, this is your first time at FOSDEM. I'm also raising my hands for this first time, tried for years and finally got here. So cool, glad you're all here. So now it's like 25 seconds, we have to kind of just whatever. Yeah, everybody awake? Who, what was the latest you were out? Like who was, who was out, who went to bed at like 10? Okay, good, nobody. And that doesn't mean 10 this morning, when after this and just been up. Who was, who went to bed after midnight? One, two, three, three, 15. That was three, 15. Four, are you, you're still awake. You're still good. Okay. All right, so we are, we're going to start now. Hi, I'm Jeremy. We're going to talk about streamlining developer experience, power of CICD standardization and interoperability. Really going to kind of touch on when we think about developer experience, how, what's kind of the role of CICD in that and how it fits within all of the different kind of tools and systems that we use. So I'm going to talk about that. A quick note, I did use on a fair amount of these slides. Because I had evidently time on my hands. I used chat, GBT and Dahli for the images. So that is a very interesting, don't, don't go into it thinking you're going to get exactly what you want. As you'll see on some of the slides, it's a little weird, but why not? So we're going to jump into that, figured I'd try something new. Okay, so you said, my name is Jeremy Meese. I'm the co-founder of a kind of a stealth DevEx startup right now. Hopefully we'll have some news in the next couple months. But yes, so Jeremy Meese, I've done, I've been in tech for a couple decades. Previously, most recently, I was at CircleCI for about three and a half years, running the DevRel and community team, doing a lot of talking around CICD and stuff. So that's me. Now I did have some early feedback on the title of this talk. So Gray had a lot to say around, this is probably heretical, what I'm going to talk about. I don't know about that. Heresy, but I felt that was kind of harsh. He hadn't even heard the talk and already he's given some feedback. But we're going to kind of talk about this evolving landscape of software development, especially in the modern world. If you've ever seen the CNCF landscape, could not even fit on one slide. I mean, it's fit on one slide, but there's no way you're going to read it. That's how big that's evolved. I really should have had a slide that showed some progression over time. But when you think about, this was a couple of days ago. I'm sure it's grown in that time. But continuous integration has a good, when you kind of zoom in, it has a good section of that. And it stands really as kind of this, CICD does, stands as this kind of transformative pillar that kind of has reshaped how we look at software and how we look at deployments and how we look at delivering quality software to, hopefully quality software, to the users, to the companies and such. And kind of driving that very experience. I also put out, kind of feel, when we think about developer experience, what is the kind of shortened version of that? And we're going to use DevX. The internet has spoken, so we're going to use DevX, not DX. So you all say DevX for short, instead of saying all of developer experience over the next three hours, I think we have. Okay. So developer experience, kind of defining it, it really kind of encompasses the journey of the developers as they're learning and deploying technology, whether that's software, whether that's even hardware kind of fits into that. And when you have a successful developer experience, it really is focusing on eliminating the obstacles that hinder developer or a practitioner from being successful in their endeavors, whatever they're trying to do. Now CICD's transformative influence that we've seen on the developer experience is really pretty profound. Because we've had kind of this dynamic shift in how developers over the years have collaborated, how they create, you know, how they deliver software. And by automating, you know, the pipelines and like the integration and testing, deployment processes, all those things, it really is to empower developers to really gather the feedback necessary with those feedback loops, having faster ones, so that they can improve the code quality and the ability to continue to iterate swiftly. That is not a Taylor Swift drop, it's just iterate swiftly. But by streamlining workflows, that helps to reduce a lot of the friction that we see, provides a lot of intuitive tools. And so you have like this good DevX empowers developers to focus on creating that high quality code we talked about, fostering the innovation and really eliminating and contributing to, you know, faster, more, ultimately contributing to faster, more reliable software delivery. So we're going to kind of hone in on the two of the critical pieces of what that looks like in CICD with standardization and interoperability. So from the CICD standardization side, that really brings the consistency necessary to your pipeline. So that you can reduce the friction, you can enhance the collaboration between your different coworkers or different teams. So we're going to also look in this at a few open source tools. We're going to look at Argo and Flux. I'm not going to bring up any demos or anything, but we're going to talk about some of the features that they have that really work well with this kind of standardization idea of standardizing processes and how you deliver good software that way. Then we're also going to talk about the interoperability side, which is kind of ensures a seamless integration across multiple different tool sets, everything from observability to, you know, different, potentially different frameworks. You have all the different tools that kind of integrate with that. So with that, we'll look at, you know, some of the features that Spinnaker has and also tools like Backstage, how they kind of work with the developer experience on the interoperability side, bridging kind of tool chain gaps and such. At the end kind of whole thing, we're going to kind of really dive in, not really dive, but just kind of summarize how that both of those things play a pivotal role in optimizing developer experience and improving, you know, overall productivity, which is really kind of the idea. All right, so the standardization side, that really means we're trying to minimize the variability, reducing all the errors, fostering environment where developers can, you know, again, collaborate. That's efficient collaboration. So when you're standardizing that, you're kind of defining clear, repeatable, no, not yet, clear, repeatable code integration, testing, deployment processes, all of those kind of things when you standardize that ensures that you're having like your pipelines are streamlined, the developer process becomes a lot more, a lot smoother for everyone that's interacting with what you're trying to do, whether you're building something internally or for external users or both. So when we think about that kind of the steps for what kind of better practices look like for that, we start with kind of assessment and analysis with that. So here you're really kind of looking at your current CI CD pipelines. You want to understand kind of existing workflows, the tools, all the processes that you're using to identify the pain points, the bottlenecks, and then, you know, areas where standardization really is needed. And then the next kind of thing with that is you're going to kind of look at all the specific requirements that are in place and the constraints of your projects and the development of that first step there. Then when you're defining this, you're really going to kind of define the goals and objectives that you're trying to achieve with your pipeline standardization. And those goals are really going to try and align with the overall dev strategy that you have and some of the organization business objectives. You don't want to stray away from that. And that also kind of helps you start to kind of identify those KPIs that are going to really measure what success looks like for you in your development process. Usually that looks like you're probably going to try and reduce deployment times or decrease, you know, error rates. We always want to try and obviously decrease error rates. Then you want to look at what the tools and practices are going to be for your CI CD standardization. So, you know, things are going to align with your organization's needs and goals. So that's things like Jenkins, GitLab, CI CD. There's other cloud native solutions. AWS code pipeline. There's I think Team City, I think is on the cloud side. There's a bunch of different options there. But you want to make sure you have those tools and practices that help you achieve those goals. There's some standardized templates for pipelines defining those essential stages of build, test, deploy, what's that going to look like for you. And then kind of what a standard configuration would be for all of your pipelines. And then you're also going to enforce a lot of those coding standards for CI CD, those configurations ensuring that there is consistency and readability for everything that you're doing. So somebody can come in and understand exactly what you're trying to do and you don't have to spend a lot of time kind of. I mean, there's going to be onboarding, but you want to make it as standardized and relatively simple as possible. And then on the documentation training, which is kind of touched on quickly, you want to make sure that documentation is comprehensive, that you're outlining all of the standardized processes that you have in place. Make sure everybody is aware of how you work, including how you work with your workflows. How do you, you know, what's your standard configuration? What are some of those better practices that you're using inside your organization? Make sure that's documented. And then you're also providing a lot of those training sessions for your dev teams and your support teams that work with the dev teams, ensuring that they're understanding and can be really effective in as they use your CI CD tooling and all those templates that get created. Then you kind of move into the version control side. You want to make sure you're storing those pipeline configs in some kind of VCS, you know, Git, GitLab, GitHub, whatever. That practice there is really going to ensure that your configurations are versioned so you know you can go back to something, you know what the changes were, you can trace where potential errors are, and you can, like I said, revert, you can easily get back to something if you need to. And then implement your branching and pull request strategies. It should mirror what you're already doing in your standard that you've already hopefully documented that we just talked about, but making sure that all of the, you think about the standard templates and such is that they're all kind of following in that same path of branching, pull request and such. And then automated testing. Since this is testing room, we want to make sure we talk about testing. You want to make sure you're integrating your automated testing and validation into the pipeline and all those templates to ensure that, you know, those standardized configs produce your expected results. Don't just create a standardized template and not test out to make sure that works. Otherwise you're going to create problems downstream. Another great opportunity to put code reviews in place. Build out your standardized templates and then start code reviews. Make sure that you're not missing something. Bring more eyes to it. Validate that, catch those errors before they become an issue downstream. Okay. And then continuous monitoring, continuous integration side of this or continuous improvement. Make sure you're monitoring and having alerting in your CIECD pipeline so you're detecting the issues bottlenecks in real time before they become an issue. Establish kind of this culture of kind of continuous improvement. So that means you're regularly reviewing, updating those, you know, those pipelines based on the feedback and evolving kind of framework that your projects and pipelines go through. Make sure you're not, those templates aren't being left behind. Also governance and compliance is very much an important part of the CIECD standardization. So make sure your policies are enforcing pipelines, the standardized pipelines and compliance with industry regs, regulations or, you know, some internal or external standards that are in place. Make sure that you're accounting for those. Really audit and assess how you are adhering to those to make sure that you continue to improve there as well. Scaling and adaptation, ensuring that, you know, those standardized pipeline templates are something that can scale and adapt to the different project types that you have. Every team or, you know, an organization probably has different types of projects that you're all working on. So make sure those templates are easily applied to different, you know, different things that you're doing, different sizes, different technologies that might be in place inside your organization or that you're developing for external. Maintaining the flexibility kind of helps there to accommodate the unique requirements that each project is going to have, but also making sure you're still adhering to your standardized core practices. And then there's that feedback loop. Very much a part of DevOps is feedback loop. Even more, that's part of why continuous integration and continuous deployment is there, is it helps you give you that feedback loop. So have an environment where, you know, developers can really collaborate and provide that feedback and contribute to continuing to improve those standard practices and then continuously kind of communicate the benefits of those to outside your organization. Make sure everybody knows what you're working on and knows that the achievements that you've had really helps kind of drive more collaboration, drives more obviously awareness of what your organization is doing, but also brings a lot of praise to the teams internally. So by kind of putting these steps, these kind of best practices on the standardization side, organizations really can kind of implement more efficient, consistent workflows so that the developer experience on the continuous integration, I'm sorry, on the standardization side is really you start to see those, the results of that. So we're going to right now kind of look at kind of Argo and Flux, just some of the features that they have that help implement some of these better CI CD practices for standardization. So Argo is reusable workflows so orgs, they can really define reusable workflow templates that set up the standard sequences for CI CD like build, test, deploy so that devs can reuse those things across projects, not just within your, the project you're working now, but you can use reuse those templates elsewhere. Argo also follows GitOps principles. So your configs, workflows, they're managed as code in Git repos, ensuring everything's versioned, like I said, traceable, easy processes to kind of collab amongst dev teams is really kind of a core piece of that GitOps. And then the way that they manage artifacts, Argo supports managing and storing those artifacts like Docker images as part of the CI CD process so that you can make sure that the right artifacts are used in the right situation and deployed across the environments and they can be used as inputs in subsequent steps as part of your template. So those are some things that Argo has in place specifically. And then from Flux, we have the declarative config model that they can operate on where systems are, you know, their desired state of how they're going to exist as a system is defined in code. This is what orgs can kind of define and enforce already those standardized practices in a VCS system, ensuring that you can kind of track things consistently. On the continuous synchronization side, they allow you to kind of continuously synchronize the desired state in your Git repos with the actual state of, for instance, like your Kubernetes clusters. So that is that changes, everything can continuously be replicated so that you have a standard config and deployments and that are consistent across your environments. And then there's the policy side. So that's kind of where we have, does it say that flagger? Yeah. So that's kind of the feature flag. So Flux has feature flag capabilities through Flagger, which is a part of that, so that you can deploy and allow orgs to define the different rules for how things get deployed, either different sets of things or to different users. So you can really do a lot of that A-B testing if you think about progressive delivery. It's that kind of thing. Yeah. So who here uses Flux? Okay. What about Argo? Okay. So about, I think there was some overlap. Good. So when we want to achieve these kind of standardized workflows, kind of the summary here is like achieving that. You want to make sure your templates with Argo and Flux, they allow for the standardized templates and definitions so that everything is, all your orgs have an established baseline to work with for consistency. There's also integrations with VCS, CI-CD tooling so that you have your configs are maintained and accessible to all, which is really important, bringing visibility to what you're doing. And then on the documentation and training side, it's really essential to make sure that you've got the docs and training standardized and that you have documents, that you have the docs and trainings for the things that you've standardized. So make sure you've done both so that orgs can really be responsible for making sure that dev teams and even support teams understand how these standard processes are. Continuous improvement really kind of fosters the culture that's really necessary to achieve a good developer experience so that you have, everything's regularly reviewed, the workflows are updated, you're getting feedback, improvements are continuously happening. Making sure that, again, developer experience is high on that list. Alright, interoperability in CI-CD is, you know, system, it refers to the ability of different tools, technologies, components within kind of the CI-CD ecosystem are able to work effectively together. So that means, you know, the various parts like, you know, the pipeline, source code, repositories, build systems, testing frameworks, deployment platforms, monitoring tools, all those things are able to interact with each other in a way that ensures that you're able to see what's happening. So if the data is effectively kind of exchanged and that there's, you know, not really any compatibility issues or disruptions to kind of your workflows. So kind of the way that that looks like, we think about, so there's a collaboration side that gives flexibility and choice that you, when you are looking, trying to implement interoperability in your environment really enables dev teams to use the best tool for the job so that you don't have to work with the vendor lock-ins, give them that flexibility to use what works best. And then there is the various tool preferences that are there that Oregon company has. And so you want to make sure with the enhanced collaboration that all those different various tools are not a blocker to success. Excuse me, that also ensures, you know, smooth interaction. You know, also the collaboration comes is. And then, it's a really important, yeah, really important side of that to make sure that you're, you know, able to integrate the interoperability kind of really enforces that better use utilization of your resources. So your orgs can make efficient use of your existing infra, infrastructure and tools. So you should not always have to build something new. If you have systems that are interacting together, the system, the tools, you're not waste, you're not being wasteful. Yeah, reason the components and scripts, saving cost. The next side of that, scalability and growth. So as organizations are scaling, they're adopting new tech, which happens consistently. Interoperability really ensures that your CI CD systems can adapt and expand as necessary to kind of support incorporating the new tools and processes and ideas and workflows, all of that into your, the way you all work as a team. And then, yeah, cross platform deployments. So interoperability advantage there is that, you know, when you, in this existing kind of multi cloud hybrid kind of environments that are out there now, it really promotes kind of this unified approach that you don't, that doesn't have to be a blocker to having all these different systems. You have it all together, ensuring that the data gets transferred well. Also kind of promotes a unified deployment and infrastructure management. And then troubleshooting debugging. I knew there was another one there. So when issues kind of are arising, this interop enables, you know, this seamless data sharing between all the different tools and process. I've seen a growth, like, it's amazing, but it's been a very astronomical growth of the average number of like SaaS tools that are in place in a organization into the hundreds on average by companies. And so like being able to look at all the tools and be able to troubleshoot those and have everything working together is a huge kind of game changer for kind of looking at better issue ID, troubleshooting, resolution and such. All right, so in essence, you know, this, when CICD systems are interacting together, this interoperability acts as really that bridge. It's one of those chat-chitty created images that kind of works. But you know, connecting all the different parts of your dev and delivery processes together, fostering, you know, we talked about collaboration, ensuring teams can work cohesively, efficiently. All that stuff is tied into the importance of, you know, interop. So looking at like how Spinnaker and Backstage do this, the, on Spinnaker's side, the, there we go, integration with cloud providers. So Spinnaker allows you to pretty much integrate with anything you want so that you have this consistent interface itself for deploying, managing across platforms, ensuring seamless targeting of the different environments that are in place by devs, allowing them to like choose what works best. Don't tie them into one specific tool that is that whole, that little analogy of, you know, hammering a square peg into a round hole. And then integration with VCS systems. Spinnaker, you know, works across, you know, really can work with whatever so that you can kind of trigger your deployment pipelines directly from your repositories and automate that release process, reducing manual intervention. And then extensible integrations. So you know, having an extensible architecture supports a lot of different integrations, which allows teams to connect with, you know, again, various set of tools like the monitoring, incident management scripts, those things, and really ensures that Spinnaker really seamlessly fits into the org. And yeah, tool sets fit into your org's existing tool sets, requirements, workflows and such. And then artifact management. We talked about, you know, Argo has that. Spinnaker also kind of lets you kind of interact with those, integrate with different repos, artifact repos. So you got Docker Hub, there's Artifactory, I'm trying to think those are the two that come to mind. Assist really in managing those artifacts, ensuring that, you know, the right things are consistently used in your deployments. And then there is the pipeline abstraction. The, helps you kind of abstract the deployments, making the process more flexible and adaptable to what you're trying to do. Developers really can start to reuse those templates that you've created, making an adaptation easier as the projects evolve themselves, and those requirements. And so that bridge between, you know, the abstraction and flexibility ensures Spinnaker can kind of cater to different various deployment scenarios. So that's the Spinnaker side. We think about backstage. Backstage has, you know, it integrates with a lot of CI CD tools and other things. We're talking about CI CD here. And so it integrates with a lot of them, like Jenkins, sort of CI, GitHub actions, Flux, Argo. All of those things brings that visibility. And so having that interoperability with pretty much anything allows developers to visualize and manage what's going on in their pipelines directly from backstage and not having to go to multiple systems. You can do it all in one. So there's kind of that unified single pane of glass view of the entire kind of dev workflow. Service Categlog integration with backstage kind of acts as that service catalog, helping teams manage, discover the services and apps and such that they can use. And so that interoperability with all the different systems ensures that the information in your CI CD is integrated into that service catalog itself so that they know that easier for teams to understand service status, history. The history is really important to be able to go back and see what's happened over time and see some trends. Yeah, it has a really good plug-in ecosystem. So, you know, that extensible architecture across all the different custom plugins that you can create that maybe your community has done. All that stuff can help bring better visibility to things you do. And then customization theming that comes in place, allowing repos to kind of customize the UI and theme that's in place. That may seem like a small thing, but when you're trying to get your organization to buy off to use something like backstage or things like it, having that ability to customize the look and feel satisfies a lot of those branding requirements that companies have, marketing departments kind of do. So it's important to have that kind of flexibility ensures that your org is going to be able to be flexible and use what's there. All right, so Spinnaker and backstage, they both kind of prioritize flexibility, adaptability, allowing organizations really to integrate with the kind of the diverse tool sets that are out there and accommodate the various needs that developers have. Bridging those gaps between the different tech and systems, it acts as kind of that central hub that connects those parts, enhances the flexibility of your CI CD pipelines and developer workflows and is ultimately going to kind of promote more efficiencies and collaborative development environment. All right, so the organizations often use kind of this mix of tool sets. And there's some of these challenges that come in with when you're trying to implement this. Is that that mix? And each of them have their own ecosystem, APIs, such data format schema differences, tools are using a lot of different data form. They're not all unified themselves. That's kind of their niche is having something different than everybody else. So that presents a challenge. Authentication and authorization, like those themselves also present a lot of challenges of how do you not only manage the access to all these different tools, but how do they, you know, you have APIs, different APIs going back and forth. How do you kind of work with that? Versioning and compatibility, that also is something that like tools change. New versions come out. They can break the, you know, breaking changes that either could have been avoided or not. It doesn't matter. You're trying to use them and now you have something that doesn't work. So having, you know, that is a real challenge. And then lack of documentation. We've all seen it. API that's on version two and their docs are on version 1.1 and they haven't updated or they haven't changed one thing and it breaks that. That often is a challenge to try and work with all these different systems and in some cases building your own integrations between those systems can really kind of get hit with lack of documentation. But there are ways to overcome those with us. So using unified config formats and how you define your deployment pipelines that are documented and forced for all the tools, libraries that are associated can then automatically kind of convert between the formats, ensuring there's data consistency, compatibility. There's API gateways that really translate the data between the systems for consistency, simplifying, you know, off and authorization access across all the different tool sets. Helps you maintain version compatibility and it's important to kind of use a version compatibility matrix, matrices, so that you can see, you know, track it all down and see what works with each other to help you make better decisions. Make sure you've documented. Oh, time out. Okay. So that's good there on that piece. The last little bit just we think about developer experience really important to remove all the barriers. So that, thank you.
Ghosting the hardware
Hello everyone and welcome to this session about ghosting the hardware. Maybe the title is a bit obscure to you. I will explain what it is a bit later on. So my name is Rémi Durafort. I'm a principal tech lead at Linao and I've been working on different open source projects for many years now. I'm currently working on Lava which is a test automation system that I will present. So Lava stands for linear automated validation architecture. So it's a test execution system which means it allows for testing your software on real hardware, on real devices like Raspberry Pi, Dragon board, so physical devices. It allows to deploy your hardware, boot your software and test it on real devices. It's used by multiple projects like NLCI for example that use mainly multiple Lava instances. We use that a lot also in Linao for the LKFT project, Linux kernel functional testing project that we are driving. We also use it for doing bootloader testing. So for example you can test your UBoot version directly on your board and Lava will allow to interact with UBoot and test it. We also do firmware testing with it. So it currently supports 364 different device types which is a lot of different device types. So if you want to test your software without Lava, so you will have a kernel, DTB, RAM disk, root FS modules that you want to test. You will have Raspberry Pi, so this is a pretty old Raspberry Pi 3 anyway, doesn't really matter. You need a way to access the serial to interact with the board, so FTDI cable usually over USB. You need a way to power on and off the board, so you need some device that will allow to send a TCP request to a specific port with some commands and it will power on the board and another request will power off the board so it can be made automatic. And usually we use TFTP and NFS for sharing the kernel, DTB and root FS system with the board so you don't have to actually flash the board because after some time you will actually destroy the SD card if you do that a bit too often. So then when you have all of this, if you want to test the board you have to power on the board so you send the right command to your power manager. You then connect to the serial, you interrupt UBoot, you send some commands like DSCP, so the board get an IP address, you load the kernel over TFTP, you load the RAM disk over HTTP, you set the console argument for the kernel, you then send the right boot argument that are both specific. You watch the kernel booting, looking for some crashes maybe or warnings, then you have the prompt, you log in, you run your test, you collect the results and you shut down the board. That's dangerous, not really funny and you will have to do that for every release that you will have for your software. So that's where lava come into place. So instead of having to do that manually, we keep the board, we keep the power control, serial relay and TFTP and NFS server and replace yourself by a program which is a lava worker. So instead of typing commands manually one by one, you will explain in a YAML document to the lava worker what you expect him to do. So you will explain that you have a kernel, a DTB and a root FS that you want to deploy using TFTP and you want your root FS to be available over NFS. And lava will then know how to automatically interact with your board to send all the right commands that I explained in the previous slide, automatically in a reproducible fashion for you and it can do that days and nights, including weekends for you. And this document that you write is what we call a job definition or job configuration. So obviously you can have multiple DUTs, device under test, for example in this case, per worker and you can have multiple workers attached to your lava instance and they will all connect to the lava server, the classical server worker model. For example in Linao in Cambridge, we have a lab with hundreds of boards and I know Collabora also has some kind of board farm like that. It has been made for a large board farm if you want to. Regarding the roles, so the server, it's a web UI API, it's what is visible to the user and it usually does not have access to the boards. For example in Linao, all our lava servers are in the cloud somewhere while the boards are connected to the workers physically in a closed lab. So the workers, they have direct control of the DUTs, the boards, device under test and they are not accessible to the users. The user will not have access to the board directly, they will not have access to the worker directly, only to the server. So the server will be responsible for storing the logs, the jobs, the results, do the scheduling, send notifications, have an ICOI, things like that. And on the other side, the worker are more responsible for the hardware. So they have to deploy resources, they have to pour on and off the boards, they have to interact with the serials, look for crashes in the kernel under the board health, something like that. So this is the list of supported devices. Obviously you cannot see it, it's way too small because there are way too many devices. But just to explain that we support from really tiny devices, IoT devices, up to Raspberry Pi form factor and up to even large servers that you can test with Lava if you want to. And as we support multiple different kind of device types, we have to support different deploy methods, different boot methods. So for example you can deploy with TFTC, NBD, fastboots for all the Android boards, Vexpress, etc. For booting you can use DFU, Uboot, PyOCD, fastboot, etc. As a result of different ones. And for the tests you can have a POSIX shell interaction if it's available on the system that you have. You can have interactive tests. For example when you want to interact with a bootloader, it's not a POSIX shell, so you have to send commands and expect results. And we can also do multi-node tests which is a test in which you have multi, you have more than one device that will be booting at the same time and that will be able to interact. For example you can test your server on a physical hardware that will stream to multiple different clients. It's something that you can do in Lava. So today I will speak a bit more about also why we want to test Lava itself because why do we want to test the CI system? The obvious reason is that it's just a piece of software so it's buggy. So you have to test it to know what is working and what is not. And even more important is that when you're building a CI system you have to ensure that the CI system is rock solid for two main reasons. If you have bugs in your CI and if you have false positives which means that you're reporting something like a bug on the software while it's not buggy, then your developer will just say okay I'm done with it, it's not working. I will not look at your CI system anymore. That's the first reason. The second is false negative which is you're not reporting an error that happens in your CI. So you're running a test, it's failing, but the CI system says everything is okay which means that you will say to the developer I tested it, it's working while in fact it's buggy. So you will have a, you will release the software that has been tested but it's still buggy. So you have to prove that your CI is reliable over why it's just fully useless. So how are we going to test lava itself? So we do have a classical hierarchy of tests. We obviously have static analysis, we do have unit tests that are running on every GitLab CI merge request. We also do integration tests and that's why I will print today which is called meta lava and we also do federated testing and test on a staging instances. So we do have some instances that we upgrade every day where we run actual workload and we check that it's still working the same way as before. But the main problem when you want to test lava is that it's a combinatorial issue. As I said before we support 364 different device types, roughly 16 deploy methods, roughly 26 boot methods and five test methods. So if you do the combination that's insane, the number of combinations that you have to test. Yes, I know that a lot of these combinations are just not going to work because not all devices support DFU or fast boot and things like that but still it's really good. So maybe you want to give me both and money, I will be out for it but obviously I don't think that's the case. So maybe we should consider faking the DUTs. So faking the hardware. So that's the goal of the MetaLava projects. So the goal is to be able to set the full system. I want to test from the user point of view back to the user. So the user should be able to send jobs. It has to be scheduled, run on a fake DUT, send back results and the user will pull the result from the user interaction. And I don't want to have any boards because I want that to be somewhere running in a CI CD system. And it has to be cheap obviously and fast. So you have two ways you can do that. If you want to fake devices, you can go for doing board emulation. You can use MVP or QMU devices for example to emulate devices. The main problem is that it's CPU intensive so it will be slow and expensive. The other way is to ghost the hardware. So if you go back at the lab architecture, I don't want to touch the user. That will be my testing system. I don't want to touch anything in the server and the worker because I want to keep my system intact. So the only thing I can change is what is on the left part, the board and the power control server and the TFTNFS servers. So what I have to do, I have to build a system that I have to build a fake DUT that will feel like a DUT, look like a DUT, smell like a DUT, sounds like a DUT and tastes like a DUT because lava should not see the difference between a real DUT and a fake DUT. But that's not enough because I also have to check that what lava will send, the interaction that lava will have with the fake DUT is still valid because if I have a fake DUT that accept anything then lava will do any stupid things and it will still work while it's just wrong. So it has to also check, the fake DUT also have to check that what lava send is legit. So it has to check that lava is still acting correctly. So we have to look at the interaction between lava and the DUT. So as I said there is free interaction, the power control. So by the way lava is designed, it's just a command that lava will run so it can be any shell command that has to return one or zero, one if it's failing or zero passing. But from the DUT point of view, from the fake DUT point of view, the DUT should be able to check that the command has been called at the right time, so before booting, that lava is still doing what is supposed to do. Yeah, the serial relay, again it's just a shell command that lava will run and that it will just interact with the input and output, STD and STD out. So I need to build something that will feel like a DUT when you interact with it with the serial. And the TFTP NFS servers, I will just use a normal and TFTP NFS server and I will just have to check from the fake DUT point of view that lava has deployed right binaries for me. So the question is where do I want to mark things? Let's take an example. I don't want to do this presentation but I want to be in my bed and I have something that will be in place of myself. So you don't see the difference. So I can build a robot that will be in my place and that will speak like me and explain the same things, interact with you the same way as I will do. That's one way to do it. I can also force you to have glasses that will inject in your vision an image of myself. That's two different ways to fake me but from your point of view, it will be the same. You won't be able to notice the difference. For mocking, it's all the same. I have different ways I can mock. I can create a hardware that will interact with lava the same way a real hardware will do but without actually booting a candle. I can do that if possible to just have to fake the serial and it will work. But as I said before, I don't want to have any hardware. I just want software. So what I will do is I will have only software that will fake all the interaction with lava. So it will fake the serial relay for example. So we're going for a full software implementation. It's a project called DEMISIS. So when you run it, it has the same output as a normal board. You can interact with it and you feel like you interact with a real board. I will show you in a right after that. So you can send it commands and it will react like a normal board will do. And when you do TFTP and NFS commands, it will actually load the TFTP and NFS command for you and check that the binary are present. I will just go for a really short demo. So I just have a run script, just a wrapper not to type everything because it's painful to type. For example, I want to, so my DEMISIS system, my program that will fake a duty, so it's a Python script and I give it some commands that are inside YAML file that I will explain right after. So if I start it, you will see for the one that are used to have a UBoot booting and acting to UBoot machine, it's what UBoot usually type enter and it's actually wait for you to do type enter and then you have a shell in which you can type some commands, for example, DSCP. It will get a DSCP. This is all fake. I don't have any board attached to it. You see that. It's just a program that is faking a UBoot interaction, a board interaction. And then I can just ask it to boot. I'm not doing it because I'm not booting anything. It's just faking it. For LavaPoint of view, it's actually booting something. You see that the screen is a bit too small. You see that it looks like a board is booting, but it's just printing text. But that's enough because that's filling all the requirements from the LavaPoint of view. And you see it's just a program running. I can just, for example, do a login interaction if I want to. I want to check that Lava is able to log in automatically to send or write login parameters and password. I can create a program that will do that. Again, just doing the basic thing, booting. You see it's a small delay when printing. It's on purpose to fake what a real board does because a real board does not send all the characters in one row because the cellular takes some time to process, to transfer. So we fake that also. Now I have to send. You see that if I not send in the right parameters, it would do a login incorrect. If I'm sending the right one, it will log in as normal. Again, this is not doing anything. It's just pretending to run up a system. And then this is what usually what Lava is expecting when it's run tests. It's expecting some signals to have. And I can fake that also. If you look at what's inside, it's a bit too small. So inside the argument for my program, it's just a set of commands that I'm asking. I'm asking my program to print the lines. That's the line that you've seen. Then it's printing the different lines and accept to be interrupted like what UBOOT does. Then it has a shell. This is a prompt with UBOOT prompt. And it will look forever waiting for exactly this command, USB start, and et cetera, et cetera, et cetera. And for the fake DUT to work and to go to the next stage, Lava has to exactly send the right commands. And if it's not sending it, it will fail. So thanks to this list of commands that I'm expecting, I'm able to check that Lava will send exactly the same command from one version to another, because that's what a real board will expect. And at the same time, from the Lava point of view, it will have the output that is expecting. And for example, here, Lava will, it's waiting for getting a TFTP instruction to load the VM Linux over TFTP. So I'm waiting for this exact command. And when I get it, I will actually download it. I have a small script that will download over TFTP, download the file, check that it's present. That's what I said. What it sends should be meaningful. So all the tools should be available, what it should be. That's for the shop demo. So that's what the MailTile Lava project is doing. So we have a server, we have the workers that are working together. And instead of having a real board, I just have the domicile system. So it's actually running 28 different device types, including both that I've never seen, because I just need the logs and the commands that it's expecting. I don't need the real board itself. And it also allows you to test bootloader failures, for example. So that's something that's difficult to reproduce in real life. You have to damage your board if you want to have some specific errors. The system, meta-lava and domicile, they can reproduce the same error all the time, because it's just a specific output that Lava will have to see. If you want to contribute to this, to have your boards tested by Lava, so a fake board tested by Lava, please come to see me. I will be happy to add that to the system, and that will ensure that your board will still remain valid in the next time I'm working. It's a fun thing to do, to do system mocking. And just have to look at the interaction between the different systems. That's all. Do you have some questions before we go to the next meeting, next presentation?
Pushing test lab to its limits: performance tracking techniques
Hello everyone, my name is Paweł Wietzorek. I work with Colabora and I've been involved in maintenance of server side components of Colabora's automated testing laboratory. Today, I would like to share with you a few lessons learned from that experience, particularly related to tracking laboratories performance and pushing beyond the limits of the software that it runs. We'll start with some background information. Next I will move to interactive approaches for tracking its performance, I mean the lab performance. After that, I'll describe a few solutions for automating that and finally I will also share some thoughts on data generation. So let's start with the reason why, I mean what brought us here today. Thanks to Remi's talk, we now know and have an idea of what Lava is, what it provides for testing automation and how it supports all these efforts. Some of you might also recall a talk given by my colleague Laura at last year's FOSDEM. Laura described in her talk how the lab at Colabora is set up, what its day-to-day maintenance tasks look like. What main challenges are while running this kind of laboratory and also shared some best practices. The key piece of information for us today is that Colabora's lab is a fairly large Lava instance that is continuously growing and together with high number of devices also comes high numbers of test job submissions to process which unsurprisingly can result in higher load on server side of things. And that in fact was our case. There was no need to panic though, at least not right away. High load means that the resources that were allocated for lab purposes are in use and that's what they are meant to do after all. Interestingly, especially high load was observed on the nodes running database processes. And all of that is mostly fine until the system becomes unresponsive. This might lead to potentially unreliable lab or even unusable for higher level test systems like MESA-CI or Kernel-CI on the screenshot which other Colabora's are involved in development, maintenance and of course usage as well. My first thought was to simply estimate what resources are required for day to day operations and simply throw them at this workload. This could work short term but it wouldn't really solve the problem. To do it the right way, a deeper understanding of the root cause for all these issues was needed. And by the way, this photo is from Polish IT Olympics where hardware component throwing contest is held. And while this is hard drive throwing contest which might not be the type of resource we needed, that was the initial idea. Thanks to RemiStock we also have rough idea of what main components for Lava are but let's recap them real quick. At the very high level Lava on the server side has two main components, a scheduler and a connection to the database. If we take a closer look, those are respectively a jungle application and by default a Postgres database. These are widely known used and mostly loved software components so we can make use of several already available performance tracking tools for them. So let's go through a few interactive or semi interactive ones. As tribal as it might sound, it is equally as important to start with simply enabling verbose logging on affected instances. This way we get first insights from redoing user stories based either on direct reports from users or maybe motomo statistics collected by recent Lava releases or maybe some logs from load balancer which shows us which API endpoints are mostly used by users or which views are most commonly requested. In case of Django we get a few other perks. It's as easy as literally flipping a switch. Django for database also allows to log every statement executed on the database in debug mode and it can be also easily extended with some additional profiling information. But even though there are all these perks, all this information is a post-action information. To collect it in a truly interactive manner, fortunately Django already has us covered and provides just the right tool for this purpose which is Django Debug toolbar. It isn't much harder to enable than just verbose logging. It just requires adding an additional Python package to your deployment, set internal IPs from which Debug toolbar would be available, confirm enabling it and you're good to go. Debug toolbar not only provides great and immediate feedback but also includes traces, some additional profiling information and it gives you all of that in an easy to use graphical user-friendly way. As you can see on the right-hand side of the screenshot you even get all the requests sent, the instance and all the SQL statements run. But even though these tools are easy to enable, it comes with some drawbacks as well. These tools should not be used on any user-facing instance which brings us to setting up a personal local lava instance just for debugging and performance tracking purposes. Such a local instance would often come in a clean slate state. So with empty database with no devices and most local instances would not be able to connect to physical devices, at least not in the numbers as the production instances run. And even though we could fake multiple devices like Remy mentioned in his talk, that wouldn't solve the problem of having a database pre-populated with some additional data. We could potentially prepare a database fixture for that purpose. But it might not be particularly easy to mock the entire database like you see on the model graph for lava server. It's non-trivial task especially when it comes to keeping large numbers of processed jobs as archives. But the question is do we really have to mock the database? It is all done locally in our private debugging and performance tracking instance. Maybe we don't have to create a new database but reuse a backup from staging or production instance that we also run. And as the old saying goes about two groups of people and backups, I believe we all belong to the group that already makes them. There is also an important second part of this saying to make sure that restoring your backup works properly as well. And with reusing your PGDump output as the input for your performance tests, you can tick off this task from your administrator tasks list. Also if you base your Postgres Docker images from the official one, there is a really simple data initialization method which requires just mounting the volume with PGDump output and everything else is taken care of by the INE-DB itself. It also supports the compression on the fly for the most popular archive formats as you can see on the snippet directly from the INE-DB code for Postgres. Since we already have this database in our local instance, it would be useful to incorporate even more statistics from the database itself. For this, we could simply use PGAdmin or even PostgresQL command line tool just to check the actual runtimes and other statistics with explain-analyze queries. This would highlight for us database operation bottlenecks. And this way, having the database level tool, we would also be able to run various experiments on the database like changes in indexes or maybe adding query planner hints. It almost doesn't cost us anything just running another container in our local setup or if PGAdmin is too much, you could also opt to use the online available graphical tool which would highlight the bottlenecks for you with this heat map showing you where the issue might lie. Using this database level utility completes our tool set for off interactive solutions and while it is really important to be able to perform all those actions, it's paramount to do that again sometime soon and again and again and again and that moves us to automation solutions. By now, we know what to look for or what to watch out for in our lava instances and from user stories or bug reports or the motomo statistics or load balancer logs I mentioned earlier, we know and have specific code components to track or maybe even test cases ready to check for that. But the question is how to run those test cases to get the statistically valid feedback. We would have to take into consideration cache warm-ups, test case calibration, preferably also a way to compare between benchmark runs and it would be also great if it fit well into the test suites that are currently used by the upstream project which by the way is based on PyTest. Fortunately, it turns out that there is a PyTest feature that provides all of that and even more. In the case of lava bottlenecks found in the collaboration instance, the next step was just to wrap the test cases prepared with this fixture and wrapping the key pieces of code allowed to have benchmarks ready to run. Next step, once the test suite was prepared, was to plug it into the pipeline. Both upstream lava project and downstream lava tree makes heavy use of GitLab CI and it shouldn't be surprising. Many projects already do the same. For example, DRM CI merged in kernel 6.6 release. So currently, job definitions for those GitLab CI pipelines above the downstream one and below the upstream one don't share any reusable code. This might be subject to change in the future. For now, downstream changes are made with ease of importing them later in mind. Moving to external definitions could make the GitLab CI pipelines a bit more complex, but that's something that we'll see if it brings any value in future. Of course, GitLab CI jobs need a run environment and to get a baseline of what should be expected from benchmarks run, the easy way out is having a dedicated runner that would provide most stable results that are not affected by, for example, other test suites run in parallel on the same GitLab runner. A good choice would be to select a machine that has similar resources to a node, which your lava instance is run on. And for proof of concept purposes, I used a small desktop computer which gave just that. GitLab runners are also really easy to plug into a GitLab server. And while we are already optimizing the pipeline, we should also take into consideration caching the CI data resources for benchmarks runs. For that, we could easily use already available upstream lava caching solution, which is based on specific CI images to run tests on. But that would also mean that production data from database we used earlier is no longer a valid option for us. And we need to revisit the lava server model, which brings us to data generation, which we no longer could omit. That brought us to creating a dummy database generator, which was focused just on a key few tables and relations according to Postgres planner statistics. It was implemented with very limited scope to only support the worst bottlenecks that were found in collaborators instance. And for that, we used standard Python tools, which were FactoryBoy and Faker. As a bonus addition, you might also want to ask a few questions. Should lava actually archive all the test jobs that are run, or maybe archiving those jobs can be delegated to a higher level test systems? Fortunately, retention mechanism is already available. In upstream lava, it just required enabling it in Helm charts, which is used to deploy lava instances for collaboration. To summarize all of that, I've got three final thoughts that I would like to share with you. Constructing in testing laboratories is not a one-time job. It's a process that might differ from instance to instance, depending on your specific workload. But it's something that I hope could be easier for you if you come across the same set of issues. It also requires frequent revisiting and adjusting according to the results you see. But even small changes can bring huge boosts in performance. But that probably is a topic for another talk. And that's all I have prepared for you today. Thanks for your attention. Do we have time for questions? If there is some question, I will be happy to answer it.
Performance testing and why even the imperfect one is important
Hello everyone. You have two more minutes. My name is Andre. I work for... Okay. I've worked for Red Hat for several years now as quality engineering. And today's talk is really about performance testing, but it's not about the testing itself. It's more than why we should do it. You have six minutes in class. Okay. You're starting early, just so you know. Yeah. Okay. So it's more why we should do it and why there are benefits in it, even if you do it wrongly or imperfectly. So that's the main point of the talk for today. So first thing first, why should we do it? What are the benefits for us doing the performance testing, even if we don't have isolated environment and all this kind of stuff? Well, for me, the main benefit is that even if you don't have the environment you would want, you can still find the bottlenecks in your application or whatever you are testing. And you still can optimize it even if you don't have everything ready because the truth is that the performance testing is quite expensive. And for the good one, I don't think that there are the companies that will give you the resources that you need to do it perfectly. So that's for me the main option or the main reason why to do it. And for me, the second most important is that you will gain knowledge that you need or you will obtain the knowledge about the product itself because you will suddenly see things that you cannot see, even if you normally deploy things and everything. You will see the little thingies that are happening here and there. And the information you gain are quite nice to get. So that's probably the most, the points that you should look for if you are thinking about the performance testing. This is actually what it will gain from it. So this is only in my opinion, like you will see a lot of papers about the performance testing and all the things that you have to take care of. Like on the GitHub, I know about two or three papers that have like 40 pages about the performance testing and all the criteria that you have to fill. In my opinion, there are like two variants of the performance testing. And first is measurement and second is testing. For me, the measurement is really the thing that you are looking for numbers. And you need those numbers for, I would say, legal reasons or anything what you have to declare for your customer. You will say that, for example, for us, I work on the division. Like if we would want to say that this connector is actually able to do 30k per second, we would need some kind of a proof that we can do it. And getting this proof is like, it's very complicated and you have to do it in very specific ways. And even if you have everything, it's not quite, it's not always acceptable. But the second part or second variation is just testing. And for me, the testing, the testing is really just finding the bottlenecks in your product. And I think that's even more, like the testing is even more important because there you will find all the bottlenecks and you can optimize your application really. And you can see the flaws in your call because you cannot see these things when you run it regularly and you don't have everything around the application tuned up so you don't throttle your application to the maximum. These things usually happen when you go over the top or near the maximum, near the max. So, yeah, this is in my opinion two ways how to do the performance testing or two variants of the performance testing. What is not really optimal about the testing, which I was talking about, just finding the bottlenecks, not the number, you need massive monitoring and I will say more about it in the talk, but that's like main disadvantage of it that most of the time you will go around tracing, monitoring, metrics and you will find some stuff that will really give you hard times figuring it out because you are going for performance and you are speaking in milliseconds, yeah? But most of the things that are used for monitoring are not really prepared to handle you for one minute second. They think that it's okay to scrape, for example, metrics by 10 seconds and like this will give you massive headaches during the time. I think I have already somehow gathered those, but the goal of the performance testing is, as I said, find the bottlenecks, but they are much more to it because, for example, the load types, you know, even if someone of you had crossed the performance testing, the main point all of the guys are talking about is what kind of load we are going to do, how we are going to do it to make it reproducible, you know? Because that's like a problem because for some application you put constant loads to them. Let's say you are going 10k per second, some requests for the API for one hour. And like that could be fine, but we all saw that some websites or, for example, the systems that you are buying tickets for concerts, these kind of things need peak loads. You know, you are going low for 5k per second and suddenly you spin it up for 100k per second or something like that. So you really need something that will generate the load for you and do it reproducible. You need to have the same load so you can repeat the testing for a couple of times and be consistent in that because otherwise you will find all the other things except the flux in your cold. So the main problems during the performance that you will find. I have said that you don't need the isolated environment to do it. And that's true. We don't have isolated environment in our team when we do performance testing. But if you don't have entirely isolated environment, you need to know your environment pretty well. You have to know your latencies, you have to know all the hardware specs and these kind of things. You really need all of the information because if you see some very specific things happen during the test which are not common, you then can somehow put the puzzle together with all the information you get from the environment and you can somehow at least, I would say, decrease the number of stupid mistakes that you will have to gather around. The next thing that is very important is to have the monitoring which I have said already. You need all the metrics you can get because before that you don't even know what kind of the information would be valuable for you when you start doing it, but you will need all of it. If you can get rid and you will surely use it, if you don't gather those metrics and then you will figure out you need it, you cannot get them from the past. So really the thing is gather everything you can and it will be fine. And the last thing for this is you need to tune up all the systems that you are dependent on. For us, we are working on databases and we cannot really throttle up our product if the database isn't optimized to the hardware it's running because if the database is not throttling, we are not throttling. So you basically need to have everything on the high spec there to not bottleneck your application. So that's one of the main points because on some things it's quite problematic to tune it up. I quite like this quote because it's all about the metrics and if you have them, it's fine and it's nice. If you don't have them, it's massive problems. So I think that's really the quote that you should be looking for. So again, monitoring. I have already talked about the problem with the scraping. So let's say we have used Prometheus mostly and the maximum what you can get from Prometheus is one second scraping. So that's fine for information causes but not for the performance metrics because the things happen during the milliseconds. Maybe 10 milliseconds would be enough but one second is really you are losing a lot of information and later on I have the example of what you could see when the scrapers are not fast enough. And there's like massive problem because not every scraper is or I would say there are no scrapers that can do it really fast. So you probably have to implement it and we are working on that actually. So that's the main problem. And the second problem is that you will end up with having a lot of the systems in the field because you need hardware metrics. You need JMX metrics and I don't know what other. It really depends on your application. But for us, we needed hardware metrics, JMX metrics and some metrics from our test suite. And these three things actually all the different outputs. You know, we have used net data for the hardware metrics. It's like really nice tool open source fast everything nice. But you cannot import JMX metrics to net data. And you net data has also one problem that you cannot import like anything would happen in the past because it's strictly hard coded for now. So that's the problem, you know, and then you will say yes. Okay. So I cannot have JMX metrics to the net data. So I'll add Prometheus, you know, that's fine. Okay. So you have now net data and Prometheus. Sorry. Well, and then you will continue because you still have all at least in our expertise. We still need that someplace to store the metrics from our test suite and you can just import it everywhere. So then you happen that you will deploy the Postgres because you can use Postgres as backup for the Prometheus for the data storage. So now we have net data, Postgres and also Prometheus. And last but not least, you will add the Grafana because you need to visualize it, you know, and getting all those things in shape that you have like massive monitoring. You know, everything can go wrong. So if you can use the least amount of the tools that you can use, it's better because once you have too many, it's nightmare to somehow get that in shape for the whole time. Yeah. So with the performance testing, you are not really looking for the numbers. Numbers are not important in this case because you don't want to see the throughput is like this or like that. You need to see the trends in the graphs because there you can see if you are constantly slowing down or if you are going like optimizing your way. So really you have to look for the patterns in graphs and the trends in the graphs. I have the example for our testing that I will show you the patterns which we have found. But before that, just our system under test was the BZM. I don't know if you guys know the BZM, but it's effectively changed that I capture streaming, which means that we sit on the top of the database and we scan the transaction logs and sense all the events that happen there to the Kafka. We are effectively running in Kafka Connect runtime, which makes the performance testing even more juicy, I would say, because the runtime is not ours. So it's a little tricky. So that's our system under test. And this is the first example that I have put on. The image on the graph on the top is basically our process duration. And there are two things that you can see on the graph. We are effectively most of the time we are oscillating between some values around 200 or 170 to 220. And that's entirely fine. That's actually what we want to see if you are looking for some responses. You need to oscillate around some value like a sinus graph or something like that. But what is not actually nice is this on the star. Where we are constantly getting slower and slower and slower. And we have some peaks there, which have these peaks are don't have a reason to be there because the data are the same all the time. So this is most likely the flaw in the code. There is something happening, what shouldn't be happening. And it can be the database flashing out to the score. It can be basically anything, but you know that there might be a problem. You have all other metrics. You can have metrics from the databases that will show you that flashing was happening or anything. But this is what you have to look for. These, it will certainly be different for your application than ours. You will have to define what you are looking for on the star. But that's the main thing. And the more funny example is this. No, here. This is Jmx metrics from our, from Libyseum. And it shows you the size of the queue. Internal queue of the Bizm. And that basically means that from the database you are reading to internal queue. So once the queue is zero, you are not reading. You know, but we are still processing, right? So there should, there is some mistake. And this is actually the problem with the scrapers. Because if the scraper takes each second, the database is pretty fast and it can empty up the, it can empty the queue during that time. And if the scraper hits the right time, it will give you zero. You know, so from this until the end, the graph is all wrong. It's not true. And it's all because of the speed of the scraper. Because it basically hit the wrong time. So that is the thing that you have to be worried about because this will happen surely. And these are some other graphs. These are, I would say, more wild. I would say it's from the start of our playing with the performance. But the top one is also, it's pretty cool. It's not that constant as it was for the previous one. But it's still in some borders, you know, we are somehow oscillating, but there is not really clear way. But the queue size is okay now. You can see because you have some data there, but not zero. So there's an issue with the scrapers, as I said. And this is actually the thing that you will have to look for the patterns. Really important in graphs. You can see all the different ones and there are a lot of papers on the Internet that you can find. What to look for on your specific application. So, yeah, but not look to the numbers. Numbers don't say you anything. You can usually get the higher numbers if you change up the hardware that you are running on. But if you can optimize it on the some hardware that you have, you will surely get the big numbers on almost anywhere. So some tips and tricks for me. During the way that we have started playing with the performance, we have developed a lot of tools. First is the database manipulation tool, which is effectively giving you a Json API. And just with the Json API, you can create DML for almost any databases. We have now probably MySQL, Postgres, and Oracle there. So it's just you don't need to have a lot of different JDBC connectors in your code. We'll just deploy this and it will take care of it. We have also implemented some kind of the load generator that can generate you. Load for, I would say, constant load, P-codes and all this kind of stuff. We have also some automation and the other, like MySQL auto-tune, that's it. We are pretty proud of that because it can basically tune your MySQL to the whole VM that you are running it or physical machine. And you would say that it's easy, but it's not. You know it's hard when you look on the seven or eight page of the Google. You know, in this phase, you know, we are probably not on the good shape and this is one of the things. So please take a look if you are working on MySQL. We have some counting of the parameters for the database there. And it will save you a lot of time if you want to tune up your database. We have spent the time for you. And secondly, we have implemented the metadata to Prometheus Creper. So you can get rid of one, one struggle point in your monitoring environment. And we are also starting working on the fast Creper for the, for your monitoring, for our monitoring stack. But it's not done yet because it's quite more complicated than before. So, but yeah, please take a look. I will have the links on our GitHub and everything. It's all open source. So you can just also add some code there if you want. Yeah, so, so I have started quite early on than I should. So I have some time now then. But okay, we can just summarize everything now. And I hope for some discussion before you guys who done performance testing. So for me, don't be scared around the performance testing. He's not like some, some monster. People are mostly just like creating the monster from it. But if you don't need that for some legal options or anything, it's fine. You can play with it. It's funny. You will gain a lot of knowledge about your, about your product on that. And especially if you are QE, I mean, a lot of QE folks don't have the necessary knowledge about the product itself. And this helps a lot to get through everything because in the end, you will, you will just go through the code and look for, for the mistakes or something like that. So that really helps a lot. So, gather all the metrics you can. And well, we are also writing our blog and all the repositories will be on the other side before the two, before the two links. So I would be happy to hear from you, you know, like repository or organization before the two links. And yeah, that's probably it for, for my talk. As I said, I have started a lot earlier. So thank you very much for listening to me. Please do have some questions. Yeah. So my question would be, so what kind of experience do you have in your complex system? And then you see something happens there. And say, okay, here, here's the latest EP or something like that. Which experience do you have with, let's say, find the cause of the problem? Cause, so when something happens randomly, you will see it with wrap and say, okay, something happens there. And that's maybe annoying, especially when it happens, happens randomly. So what kind of strategies you are using then when you know, okay, there is something to find cause of the problem in the complex chain. Yeah, so this is, okay. So, so the question was actually, if there are some, some changes in the environment, some like latency things or everything, how we can deal with that and how we can find the causes of the problems. Yeah, so surely this is the main problem of the whole performance testing outside of the like isolating and, you know, well, you need the metrics from everything because then you somehow at least it will help you to, to get all the things in the right timeline and you can see the need and picks what could happen. But if it's like something that is really bad, you can find it usually because it will mostly, it will just disappear in all the logs because it can be something like if you have the smaller machines, I have to write once on the, some microchips, you know, it was funny thing that you fill up your TCP queue. You cannot find that anywhere in the world. So at that case, you will just repeat the testing, even if it's take long and you will see if same thing happens or not. I don't have any other like recommendation for that because this is like really main problem if you are doing it outside of the ideal environment. This surely will face this, but mostly it's not happening that often, I would say, because you can have observation for and tracing for a lot of things and most of the times you can like colorate those things together. So you exactly know what is wrong, especially for the network. You can get a lot of network traffic, like, how is the word? You can see all the traffic and what is going on, especially on one line. So then you can usually put those graphs together and you know at the time. If that is okay, answer for you. Yeah, yeah, yeah. Thanks for the talk. I was just wondering that how to use the traces, analyzing the traces, because I've seen that you mentioned metrics and traces. Sorry, can you speak more louder? Oh, yeah. Can you hear me now? Yes. Yeah, I was wondering how do you use the traces for performance testing, because when you collect the traces, how do you deal with the sampling of the traces? And if you miss something because the sampling is bad or you are not sampling everything, maybe you have to infer something from the metrics and the traces, I was wondering how do you deal with the traces? And if you use distributed tracing in a large project like collecting all this kind of stuff. I'm not sure. I understand the question. The question actually is, I've seen that you're using collecting the metrics and then you are analyzing the metrics. Yeah. And what about the traces? Yeah, so, well, the business does not have really that amount of traces that we could get from it. We have mostly like JMX metrics, you know, from the Java environment. So that's for us what we analyze. And I'm not sure how can I answer more for the questions. So I'm sorry. We can discuss it later. I'm coming to you. Okay, so my question is about the long running tests. Sometimes the performance validation is visible only after long run, for example, one week, couple of weeks and sometimes even more. So how do you address this in your process or how do you recommend to address this problem? Yes. So for this, actually colleagues of mine as part of our open source organization, they are also developing the long running cluster environment, something like that is like, because, you know, having a long running thing is complicated on itself because you have to manage it a lot, especially on OpenShift or Kubernetes or these kind of things is like little problematic before the upgrades and this kind of stuff. So we have not dealt with that yet, but we are planning that once we are okay, that we manage that we have everything prepared for like databases and everything, we want to get the up running on the long running clusters and like regularly doing the performance tests over, I would say a month or something like that or a week, it usually is enough, even especially when you put all the numbers to the low ranges for the retention for the memory and all this kind of stuff. It doesn't take too long to fill everything when you will start to see the retention and flashes and everything. So yeah, that's our plan, but we haven't done it yet. So yeah, but if you are interested in that, you should definitely look in the hub that we have in the repository because it could be useful. Do you have any tips for running performance tests in the cloud? Because for me that's quite the opposite of dedicated test runners. But when the software runs finally in the cloud you should probably also performance test it there. It's a problem. A big one. We have tried it and it is so inconsistent. The results are so all over the place. If you run, you have two same clusters, Kubernetes, OpenShifting doesn't matter actually, and you run the tests on the same clusters at the same time, clusters are in different zones on AWS, you will get entirely different results because all the load balancers and these kind of stuff, it really, you know, if you have only internal communication on the cluster, we found anything from the outside. It could be actually doable, I think, but if you have any communication during the test that is going outside the cluster and it has to go through the load balancers and these kind of stuff, I think that's not doable in any way because it's like you don't know what latency we will have for this kind of the request and travels. So I think that that would be like really problematic, but if you could mock up the external communication with just some internal endpoint, it should be quite okay-ish, I would say, but you will not get like really good results from that, I think, even if you try more and more. But I think there are some Kubernetes, some special Kubernetes builds that should be used for these kind of measurements, but I have never actually tried it, so I cannot recommend it when I try it, when I try it, but yeah, this is definitely a problem. Okay, so I have more questions? No, we have a few more minutes for questions. Come on. Otherwise, I'm going to ask you to, you know, to move your seats. Wait, wait. You said you would want to have a very small statement developed, so in the milliseconds ago. Yeah. So doesn't that create problems on their own, something like noisy neighbors and so on? Yes, it does. It does. Right, but... How big of a problem that is? Well, that's the thing. We are really thinking about writing some scraper that is fast enough for this, and yes, you will probably generate some problems during the way, especially if you would like to send the metrics directly to the Prometheus every millisecond. You will probably fill up the network line or the TCP stack or whatever, because it's really fast. So you will... It will strongly depend on the machine that you are running and if you have space there, if you have a lot of RAM, you could actually batch all the metrics and send it like one package after the test is done. But yes, that's actually what we are now fighting with and we are trying to figure out how we are going to aim for this. But mostly we are thinking that we will do somehow a configurable scraper that will either batch the request or send them or something like that. I cannot say you because we haven't tried it, actually, what problems it makes, especially with batching, because I have counted it up and metrics aren't small, actually. So it will take a lot of place in the memory. So we will have to try it and somehow figure it out. But before the fast scraper, it will give you really headaches because you will try to find something and fight something and you will spend 20 times debugging it and then you will find that the scraper hit it the wrong time every time. So we have to deal with this in some way. But it will be hard and problematic. I think we have time for one last question. No one? Tough crowd. Thank you very much.
squash the flakes! - how to minimize the impact of flaky tests
I Come on people Yeah, let's share up for Daniel because his first time speaker and everything's failing And it's off to a good start. Yeah, come on big applause. Thank you. You're doing awesome And You know the only certain thing about technology is gonna fail exactly when it doesn't need to Yeah, like I think I said already flakiness is not only happening in test obviously right So While we're waiting for this thing to happen I could ask a question about who actually Has has an idea what a flake would be in testing Okay, I should just repeat what you're going to say Yeah, yeah, go ahead you probably I don't know So you have an idea, but you don't want to tell me Exactly exactly so To me or I think to most people that agree about this topic Flaky test is a test that fails and passes and successive runs without changing any code Neither testing code nor the underlying production code Okay So, yeah, this talk will be about flaky test. Yeah Yeah, of course, of course flaky behavior is not determined by just being the test being flaky but also the software but I would Divide those two kids into different categories and how they are going to be handled this differently. So Yeah, but Let's wait. So Yeah, I'm going to start with introduction. My name is Daniel Hiller. I'm working at Red Hat I'm working on the upstream cupid project and there I'm Maintaining the cupid CIS system So this talk will be about flaky tests and How we should or how we are actually going to handle them I'm in our community of supporters for the cupid contributors. So I don't Say I have the silver bullet for handling that I would be happy to have any input from you folks And how we can improve and I would actually also Want to have some kind of extended Q&A session if the time is still there Somehow so that you might talk about what you have experienced and how you are going to handle it Just as a quick Thing how I think this should be going I'm going to start with Waterfeg is but yeah, you described it perfectly already. So it's fine and then What the impact of flakes is And then how we are doing or how we are how we can find flakes somehow and then at the last I'm half rate the flake process works and what tools we do have that support this and Yeah, in the end, I just want to describe what we're aiming to do in the future to improve this Just don't have internet and made up for some reason. Oh, no My email, okay Yeah, yeah, I think it's going really terribly wrong Sorry for all that by the way a packed room. I didn't expect that to be honest. So thank you all for coming Really great. I'm gonna help you out. Don't worry. So Tell me kind of a little bit more until we wait for the slides Can I give us a hint as to what you wanted to show us and just tell us the story about it? Yeah, without the slides just going to open it a bit because so I can Supposed to talk about You know pretend I'm stupid and I have no idea what's flaky and just you know tell it to me So I told you already about the agenda And yeah, the question of flakes was already answered so I Have two other questions. So that one is somehow like a little bit suggestive, I guess so who thinks handling flakes is important Like put your hand up a few you don't Who yeah, of course things handling of flakes is important Okay, I thought about that. So You save my day do you have a USB port I Hope so once again, it's on you need to put in a presentation mode There is on the right should be presentation. Yeah, that should be okay Yeah, okay, so the questions we already had Yeah, and another question who has to deal with flakes on a regular basis Wow, okay. Yeah, I expected something like that So yeah, like you correctly already said Flakes are caused either by production code, which is a bug of course or also by flaky test code This is also a bug, but it's handled differently like I already said So we are using prowl for our CI system which comes from the Kubernetes ecosystem. I'm not sure Whether you're familiar with it, but it's pretty flexible and it can just Start jobs from From GitHub events, which is exactly what we want and what we need this picture actually Shows on the top for example there is the commit ID I made I don't even see it like that there We're pointing this is the commit ID and these are the job runs that are defined So like the jobs on the CI system and this of course is a failed job and these are successful jobs so obviously you can see like this is the PR history for one of our PRs inside the Kubrick CI and What you can see here is that of course? There is jobs all run on the same commit ID, but some failed and some succeeded and That's exactly how we see where we have our flakiness Oh Wait a second. That's the wrong direction. Okay, so There is a really interesting survey that which is a major survey about flakiness in test Which is just called a survey of flaky tests Not really impressive about great stuff inside there something like that there you can read that 79% percent of the flakes were for the lungs and More than 50% of flakes could not be reproduced in production In isolation, so which of course leads us to the conclusion that Ignoring fake heat as flaky test is okay, right? It's doesn't of course So When we're talking about CI we want to have a reliable signal of stability in our CI So because of course we want to know whether we can ship our product or not and so any failed test run signals us as the CI maintainers that the product is unstable and that we can ship it So if we are having flakes in our Production system, of course, they give us a wrong signal like that the product is unstable and that we can ship Which we later then have to verify the test code what exactly got wrong and then we notice it's a flaky test So this is wasted of course a lot of time so Not only does it waste the time of the Developers themselves who have to look at the test results somehow and determine whether this is the flaky test or not But it's also like when you have a CI that is somehow Determining whether a PR can get merged via the test And then you have a test result Of course the merge will not go through So this cost friction by the developers who have then somehow Maybe reissue another test So if they see it's flaky if they there is nothing to fix then they would just retest Which somehow? Yeah, sometimes you would just think okay, there was flakiness. I'm just going to retry not even looking at the test results somehow Which which I would call the retest trap and we have actually had retest like I Mean the highest number I've been seen like 25 times testing and retest on the same commit Do they have to oh I have to stay here. Okay Okay And also a very bad thing is like I'm not sure I guess any CI system has something like an acceleration system where for example, it's like testing like Multiple git commits at once so that it can merge them all together And of course if there is a flaky test this acceleration effect will just be reversed. It will not be effective Yeah, like I said another wasted wasted thing so also flaky test Trust issues at the developers themselves because they of course lose the trust in automated testing Which is really sad because that's All that we want to do we want to trust the test But if we can then then of course we are just ignoring test results, which is not a good idea So how so we want to minimize the impact in our CI so that people don't Experience that much friction Time flies so What we do there is we quarantine those tests we put them out of the set of stable tests and Put them in another set so that they are not run during pull request runs But we only want to do that as we want to do that as early as possible when we Detect the flakiness, but only as long as necessary because tests on themselves of course have value So otherwise it wouldn't be there What do we need for that? We need some like mechanism where we can put stable test from the set of stable test to the set of quarantine test Of course, we also need a report over the flakiness So we can triage Which flaky test we need to act upon first if you have a lot of flaky test that matters so because the higher The flakiness of the test is of course the highest impact And yeah, lots of data because of course you need to somehow analyze whether a test is even flaky or not So as I already said I already described this this is like a The latest commit on a merge PR where we have some flaky test or some failing test runs Which later on got green on the same commit so no changes on the code So This is not of course not saying us that is it is actually flaky But it might might be flaky and like you said it could either be in production code or in the test code itself But that doesn't matter in the end the Problem that we have is the fiction NCI and the wasted resources there So our flake process is pretty well pretty pretty rough I'd say are pretty pretty easy So we have regular meetings where we look at the at the results and at the flakes And then we decide what we want to do with those flakes. So first of all, of course You have to know whether a test is flaky or not You're looking at the test results and deciding whom you showed contact so that he fixes that because we don't fix the test Ourselves we let the developers do that because yeah, they created their mess. They should clean it up A problem is of course when people are gone from the project then someone else has to care, but yeah So we have the flaky test to the dev developers and at the time when it's been Corrected we bring those tests back in so the truth that we have is like we have the main thing that Decides whether a test is being run between For the pull request. It's just a just a note on the test itself. That is like There is in the test name. There is this quarantine Word which is the keyword which makes the test get ignored for the pull request runs We still do to do run those tests to have this stability signal But not in the pre-submit which are required for the pull request merges But in the periodic runs That run I think three times a day So that we still have a signal when we can take a test back in in order to Have the value added again So another thing is of course you need to report so we this is a not not really Nice looking but efficient thing like a heat map so where you see where the action is going on You see like the more reddish the colors are getting the worst the problem is This is in oh, no, I can't go there. So like on the top you can just see on which day how many failures were occurring and There is another like axis which is the per lane Failure so that we can pretty much see which lane is flaky and on which was the biggest impact for everything This is the first time I'm using this sorry. I'm just always switching the directions Okay, this is the detailed report about how flaky a test is or how flaky those tests are This is ordered by the number of failures Occuring for test this is a bit overwhelming I think but on the left column you just see the test name and on the on the upper column you see the number the test lanes that are The Versions of the latest three Like we have a lot of test lanes that have different Sigs which are maintaining them and this of course obviously creates a matrix of like like at least 12 really well, yeah, really Important lanes which absolutely have to be be stable Yeah, and this helps us like finding where which test we should look at and quarantine or which we shouldn't We have also long-term metrics where we can decide how we were doing in the past so like at least everyone of course wants to know whether they are improving or Getting worse in handling flakes that we have long-term metrics where we can look at how we were doing So how many merges per day for example or how many merge PRs with zero retest Which is the thing that we are measuring currently against the most because Obviously that number should be like 28 of 28, but we seldom reach that like flakies We also have a small report over the The tests that happen quarantine so that we can find them quickly because it's like Grapping over the code base is also of course doable But it is easier to just have some report that we can look at straight away during our meetings And then finally we have all the test grade which also Collects all the periodic results so that we can deduce Where to Whether the tests have been stable or not. So this is the tool I guess that guys from the from the Kubernetes ecosystem know that because Kubernetes also uses test grid for collecting all the test results so that you can quickly drill down Yeah, and we have also established Another lane that checks the test for stability which does a thing that that makes like Test dependencies for example visible you I guess you know what a test dependency is some tests that hasn't cleaned up and Leap the mess for other tests and then influencing them and then they might be failing Or the other way around they might not Was already sufficient for For the following tests and if you are just randomizing the test order you catch those cases which is like you have to have Isolated test cases, right? And also it tries to run it five times because Like I said before in the in this metadata in this meta report like Bit more than 80% of all the tests have been fed off the flaky test have been failing after about five times This is not that you catch all of them, but the majority Yeah, that's just the CI search tool so in a nutshell we Just do in regular intervals meetings that we look over the data somehow So like I described before What we want to do is of course we want to collect even more data like We want to run the majority of tests in the same way as we are doing in the flake lane like Running them five times after another and also running Randomizing always the order so that we have a better picture over how flaky our code base is And yeah, of course like we want to avoid this retest problem where you Blindly just retest your things so we are looking for ways to just Directly find that case Yeah, so it's pretty I've been Running through pretty quickly any questions Yeah So you've been talking about responsibility of devs to fix the flakiness So this kind of assumes that the flakiness is introduced either by new tests or by changes on tests or changes on the code base But what about flakiness that is introduced by the by your infrastructure actually like network latency or things like that Do we have we have those problems or is it something that you I didn't get the less could you repeat the last sentence? Sorry sorry so you You imply that the flakiness can either be introduced in new tests or changes in tests or changes in the code base but have you ever been confronted with flakiness introduced by your infrastructure your Like network latency or something like that and how do you detect them and how do you of course of course that that is also a problem But when you have like flakiness in your test infrastructure or even failures in the test infrastructure That's an entire different problem and what we have observed there is that a lot of tests are to fail then and that's why we look at first of all when we have like Rough estimate of like like have more than 20 tests failing at one run that might likely be because the test infrastructure is failing and actually We decided to just quickly verify that there is something going on in the infrastructure And then just disregard that run and yeah in earlier days. We had that problem pretty much often But in recent days it hasn't been happening anymore or Much less that's let's put it like that Of course of course we look so like what we are what we are having to test our E2e test like QBIT is a complex system It's an addition on Kubernetes so that you can run virtual machines and of course for testing that you for testing E2e You need a full Kubernetes cluster which with on which you will deploy The QBIT and that's what we're doing in DCI. So we are actually spinning up Some I would say a frozen cluster like the virtualized nodes that have been frozen and that are spun up on demand Like this takes around one and a half minutes to spin up such a cluster and Then you run all those tests somehow and we have like we have like always three versions of the thank you very much We are running out of time. Yeah, you can continue us. Thank you
Chaos Engineering in Action: Enhancing Resilience in Strimzi
This is nice. So hello guys. Today we prepare a presentation about the house engineering in action. I am Marj Orsak and this is Henrik Srenching and we both work as a software engineers in Red Hat. Today we also prepared a quiz. You can see the QR code. You can scan it with your phone and if you are quick enough and you get correct answers you can win a prize. So over to you Henrik. Yeah so the content of the presentation is as following. We will begin with a brief explanation of house engineering. Then we will describe how the target systems may actually look like in the production. Then we will turn our focus on disabling house. Afterwards there will be two brief demonstrations and then a quick conclusion of how to actually work with the house. So when we are thinking about system resilience or application resilience we have to think about all the components which our application depends upon. That mean other components and other services. There is also big dependency on the network and infrastructure. All of these things are mostly visible in the distributed systems. There are many known fallacies about distributed systems mostly concerned about network and bandwidth. When we will then take a look on a system from the viewpoint of many instances and services which have to communicate with each other in order for the system to work great. We will come to the problem of complexity and the fact that there is possibly no single person that can understand the system completely and every state which can the system get into. So what can happen and what will probably inevitably happen in the system of such magnitude is the thing that one instance or more will crash. This is the story about house monkey which I guess some of you may be familiar with but all we have to know so far is that it is some of first house tools which just randomly kill some instance in the production and it will force engineers to take proactive action to make system more resilient. We can take this step further and bring down not just few instances but availability zone or cluster or bring any kind of network traffic and get the system into the state we are not so comfortable in for the production environment. So we will get to the definition of the house experimenting. It's experimenting on a system in order to build confidence in the system's capability to withstand turbulent condition in production. This may sound weird for us because why would anyone want to bring the house into the production isn't it something funny which should we actually avoid and the real reason for doing so is actually it's the time difference. It's much more easier to solve the problems at 4 a.m. or 4 p.m. rather than 4 a.m. when you are under the high pressure from the customers to solve the problems. There are many principles which we have to abide or we should abide in the house engineering. The first one and most important one is minimal blast radius for each experiment you conduct. We should imagine some red button for each experiment which should be able to stop it in case anything goes wrong. Other principles are mostly focused on the same thing like we would test the thing in the real life. We want to focus on how it actually works in production. We want to make sure it works correctly and we want to introduce the problems that may happen in the real life. The last principle is that it's a continuous run which is basically about running these tests or experiments each time for as possible and as effortlessly as possible. Now over to the target system. This all started with the monolith architecture where we get one box, one backend, one database and one UI. In terms of the complexity it was quite low. You simply get some users connection and the server complexity or overwhelming was not so high. Then after some time you add some customers more and more like let's say not four or five thousand and the load was pretty much high and the server would immediately crash for instance. So such architecture is really hard to scale horizontally and one way how to tackle this problem is scale vertically but you can scale vertically all the time. The second point was that the fault-or-ency of such architecture is really bad. You just target one node and the server just immediately crash and the users will be really sad because they don't get any response. So then Dockercams with the microservice architecture where all these previous improved we got portability isolation. We somehow get better horizontal scaling but in case when you have like thousands of instances it would be quite hard to manage all of these containers. On the other hand also the complexity here increased because of the network trickle and more. And so Kubernetes came to solve scalability in terms of the horizontal and in the Kubernetes you easily if you want to have one replica of the system you just type in the semi-ammo file, apply it and the Kubernetes will do it. Then if you see your server crashes or somehow is overloaded with the request you just simply set it to free and the Kubernetes will do it. The same with the fault-or-ency where you just I don't know inject some disruptions or something else in the pods. The one will still be up if you only target two of them. But still complexity increased again. And so we are in the operator stage where no one can entirely grasp the system in terms of the behavior. And I want to present one of the such operators is the StreamZ. StreamZ is basically Apache Kafka on its core with encapsulation in the Kubernetes system. On top of that you got some operators which simplifies some upright dynamic configuration. It is tracing more security involved also Grafana dashboards. And it is part of the cloud native computing foundation. But it's quite tough too much unknowns right? So let's break this down. So Apache Kafka it has a lot of buzzwords as you can see public subscribe model it is messaging system and so on but still this doesn't help right? So let's move to the basic of the Kafka. We got some producers not these ones but we get some clients right? So these clients sending messages to the broker. They are happy because the connection is up. We could also set system scale. We could create another Kafka broker set up some listeners and another one. We got a second set of the clients which are called consumers and they are simply receiving this data. So we got this really example of the system where you have producers and consumers but we need also some representation of the data which is Kafka topics. Also each Kafka broker has his own configuration and you can basically set up versions set up in center replicas but this is not important for this talk. So we got a lot of buzzwords as you can see but unfortunately or maybe fortunately we don't have time for it. So we could stick with this model now. So we got the producers we got the consumers we got some brokers which are the servers and what if we encapsulate system in the Kubernetes? Now top of that we add some operators managing the Kafka ecosystem and on top of that we have the operators and this is basically Stream Z. Really complex right? So here we can see an example of the deployment of the Stream Z where you got a lot of connections. These components are not like really important now. The main idea here is that even with this low deployments you get a lot of things where you can inject the hairs. So now I want to say that if we go to the production one of such production is the scale job and before I dig into it I want to thank these guys because without them we would be unable to run such hairs in such a massive survival scale. So as I said a scale job is the production environment for Stream Z and other projects and there are a lot of technologies involved such as I don't know Tecno pipelines, Teno's Promi to use, Grafana, Loki logs and more. And here you can see a basic example of how we basically induce the hairs here. Here we have some Kafka clients, Streams produces consumers with some databases which are communicating with Kafka Connect. We have some middle maker which is transfer data from Kafka A to Kafka B but still this is not intention of this slide. There are a lot of connections. So I think over to you, Henrik. Thanks. So the point of these slides were actually to show or somehow explain that when we come to the system and take a first look it may look quite MSC and quite complicated. We may not understand the whole underlying technological stack or every single components and we are in the position when we want to talk about how the system actually behaves when we introduce house whereas we are not even sure how it should behave normally. That's even increased by the fact that the system doesn't behave how it is in paper but in actuality there are countless of instances and connections, operators, clients, network traffic. We need to have some sort of observability and some intuition in the system. Like in other presentation that were before us there were already some mentions about Prometheus and Grafana. They are quite famous for their purpose. So we will be using them as well. As mentioned we need to have some intuition about the system and how it behaves. Without that it is just a mess. So we actually want to introduce some chaos into the system so we start a search for the problematic parts of the system or where we actually what we actually want to focus on. It is a simple process when we take a basically simple look on the system. Take a look what is critical component, where are some possible bottlenecks, are some part of the network really critical here, are there some real world events that can cause my system to be vulnerable for some time like some rolling updates or some notes we start in the cloud and things like that. What would be really helpful is to collaborate with all the people involved in the system. Like we definitely need some input from the devs. We need to know about at least some basic information about architectural component. What we may come up with is some simple document describing all important parts or things that may occur there or protocols that are included and we will naturally come to the important configuration parameters and maybe even some proposal for simple chaos that could be included. So the output of this in reality is some first look on part of the system which may be actually targeted for simple chaos. Now that we know at least have some first insight what could be first, what could be our first guess to start with the chaos, we may focus on concrete chaos and we may start with some simple experiments. Now how to actually formulate some kind of hypothesis or some sort of experiment when we will take a look on specific thing. We will take a look on just part of the system or few components. Now we will decide to make sure that our core part of the system is actually capable to withstand some instances being lost or have some failures. Because this is still a production environment and although it was even in the main principles of the chaos, we don't want to start with chaos in the production environment. I guess everyone here knows why because first intern will try to introduce some chaos, he will bring down all instances, service will not be available for for holiday and good luck explaining that to your boss. Now we will probably start in a smaller scale, in a stage environment with much smaller traffic, with much smaller stakes like let's say there will be some some client maybe just random random fraction. We will have some few instances and few controllers. We will start by making sure that system is in a steady state. We have our instances up and running. When we are sure about it we can introduce the chaos. When we introduce the chaos, instances goes down and afterwards the system stabilizes by again bringing down the instance. During all the time we are observing all important metrics and parameters about the system. For example it could be messages per second. Now that all that is set and done we can actually implement our chaos. What can be really helpful for this are chaos tools. We will not be describing all of them but simply mention like there is chaos mesh, Kraken, or it moves or some other choices. They will help with definition, evaluation, execution and all the other stuff. We will end up with very simple email files to be executed. Now we can actually execute our chaos and see everything went as expected. There was small decrease in the traffic but overall the system got to the desired state after a while. Okay this was first experiment in the stage. Everything went great. We've got the good feeling of resilience being confirmed in our system but what we are supposed to do now is to repeat the experiment, scale it a bit up, go into the production, really try, really make sure that it is this production environment where we will get the confidence. What may happen is that it will not at all go according to the plan. It will fail miserably and this is also the reason why we should scale these experiments a bit slower and this is also the reason why we eventually want to make them in production because we want to really make sure that this environment which is so important for us is actually able to handle that problem. So as I said no reason to despair just keep on and try in the stage and make it and start slowly definitely. So to the demos. Okay so guys today we prepared two demos for you. The first one is the broker's failure. Here we will target the Kafka Pots. We have seven replicas of the Kafka and we will be targeting the three of them. The observability or the metric would gather would be like throughput, CPU memory and the traffic in the Kafka Pots. Then we will also define some steady state which is basically that all brokers and client replicas would be like okay and the communication throughput will be stable even we inject the chaos and if we define the hypothesis it would be like we will be eliminating three of the Kafka Pots and this would not eventually do some cascading failure and we will be okay like user will be not affected with this. So yeah and also we will have some checks on the producers and Kafka Pots. So let's move on the demo and hopefully it will somehow work. So okay so here we got some setup. We have Kafka cluster. We have some notes. We have producers. Here is defined Pots hails. We have ModFix which we targeting the value of three. Three means that we will be targeting three Pots which will be unable to run and duration for this would be three minutes. So let's try with our script to inject the hails. Yeah so now we are injecting the hails and we see that all Pots that three of them would be not running. We would move to the graph on a dashbox where we have some metrics. Here's some really simple not production ready messages per second as you can see. Now you can see here at the decrease immediate decrease of connections. There will be also decreased the average of the messages but Kafka would recover even when the Pots are down. So here's the decrease but after a time we would see that it eventually recover somehow. Yeah and as we can see also we got some brokers on line four. It's correct now. There are also some under replicated partitions. Yeah Kafka is okay now and after this experiment will be done. I think it could be done. Yeah so now we are do the checks. We are checking the stream Z Pots at Kafka which are just internal custom resource of stream Z and now the Kafka Pots are ready. We're completed and also in the Gavana dashboards we will see that brokers will go online. The under replicated partitions we all go to the zero. I think so in the yeah and here it is. Okay so this was the first demo and we got also the second one. Yeah and this is basically a worker node node crash and to quickly somehow describe it the topology is that we have the producer we have the Kafka AI Kafka B with some consumer and in the middle there is Kafka mirror maker which just basically transfers data from Kafka A to Kafka B. The steady state again is that all services are fully available and ready to accept traffic. We made the hypothesis that eliminating one of the Kubernetes worker spools will not bring any down services and also the producers and consumers will be not affected. They will be simply sending some messages without any harm. So let's move to the demo two. I will show you the important things. So we got source Kafka cluster, target Kafka cluster, mirror maker, we have some work nodes, we inject the chaos. We would also create the continuous clients, person, consumer and that's for the correctness that all messages are sent and also receive without any harm. There will be no connection refuse or something like that. So we now reset or crash the worker node. We will see that the worker node will move from the ready state to not ready. Here it is, it's not ready, but clients are successfully and happy with the sending and receiving messages. The script is just checking that worker node is still not ready and we are waiting for recovery. It would take some time. I think it should be a worker. And so now worker nodes just move to the ready state. We can see that all containers which were affected on the specific node would be creating again, producing consumers still sending and receiving messages. We do some checks on again, again, the stateful sets. Yes, this is okay. We target cluster, cluster, recovered also. We're also doing checks for mirror maker. And the script just runs successfully and we are happy. Okay. So I think that was two of the demos. And last words over to Henry. Yeah. So as you could see in the demonstration, the benefits of the chaos or execution of the chaos was a bit different from testing we are used to. There was quite a big hype about the house engineering and possible benefits it can bring to our organization. Yes, it can definitely reveal bugs in the production. You can drastically improve the actual performance there or the situation in the cluster regarding the resilience of the system. But what is the main benefit of doing such a thing is getting confidence in the system, finding the misconfiguration. Those of you who have tried running application in Kubernetes know how important it is to have all volumes, all liveness and readiness checks set correctly and overall infrastructure set in place. The actual greatest benefit is actually is in fact getting experience and new knowledge about the system and really understand how it is supposed to work. This is not a holy grail as I said and it can be a bit disappointing for some. But if we think about the house engineering as natural a step above the other testing and not their replacement, we can see a great benefit in it. So how we can actually embrace it in our organization. The very well-known concept is game days when we will put together a lot of roles and a lot of people from our organization, introduce some kind of chaos and let them handle it in some reasonable manner where they can all communicate, all contribute and fix the problem in reasonable time. So that's why I do a friendly way how to start with it. Know your tools. I know it can be overwhelming. You could see it even in the demo that we have to introduce quite a lot of tools in order to run even simple experiments. But once you know the basics and have some confidence in it, you can really start to make some kind of chaos. We can really recommend some great books about house engineering, Kafka if you want, but still there are a lot of tools and it's what is the most important due to that is to definitely start small don't be afraid to set up some stage environment where you can actually practice and confirm your hypothesis before you will actually go into the production and start doing mayhem. Thank you for your attention. Really appreciate it. Questions? No time? One question. Question? Yeah? Yes, there are. It actually depends. It mostly from practical terms. It mostly start to make sense only when we are talking about not some kind of monolithic application, but we are actually deployed on a cloud. It's some kind of microservices architecture. I would say that it does not depends as much on the size of the system as the fact how much you depend on a customer experience in a sense. That when will it really be decremental for your system to get into the chaotic condition. But yeah, thank you as well.
Progressive Delivery Made Easy with Argo Rollouts
Thank you for being here. I'm going to talk about progressive delivery and hopefully by the end of this talk you're going to know how to easily do canary deployments on Kubernetes. Who is using Kubernetes today? Raise your hand, please. Everybody. I'm not asking if everybody knows what Kubernetes is because you're in the wrong place. I'm a principal scientist on the Adobe Experience Manager Cloud Service. This is a content management system. I'm a long time open source contributor to Maven, Jenkins, Puppet, a few other things. I'm also part of the Google Developer Experts Program. But probably most of you know me because of what I did with Jenkins on Kubernetes. Some people will love it. Some people will hate me. We'll talk about that later. Actually, I just, before this talk, 15 minutes before, I realized, oh, this is 10 years ago. Time flies when people didn't know what Kubernetes was. So, what is progressive delivery? This also came, this was August 2018. This is when the term was coined at the LaunchDarly blog. And also was picked up by Red Monk. And I said, this is a great name for these things that everybody knows about. But the name kind of sums up very well what we're trying to do. So, I said, I'm going to steal this. That's the gist of it. So, it includes deployment strategies that avoid this. I'm going to push this to all my nodes, all my containers, all my files, whatever it is that you're running. I'm going to push this new version to all of them. And if it breaks, it breaks for everybody, we want to avoid that. So, you, with progressive delivery, you have new versions that do not replace the system versions. And you have both old and new versions running in parallel for an amount of time. But the interesting part is that this is happening in production. And you can evaluate both old version, new version during a period of time that you figure out what was the best time for you. And before saying that this is a successful thing that I need to roll to everybody in all my customers. So, continuous delivery is hard. I used to say, like, progressive delivery makes it continuous delivery easier to adopt. Because it reduces a lot of the risk associated with continuous delivery. Yeah, it's great that you commit something to main and it gets pushed to everybody. But what if that's breaking in production? Then you have these methods behind progressive delivery that will prevent you from breaking things. And give you these guardrails that will protect your users. The key points, avoiding downtime, limit the blast radius. You deploy something, it only affects a subset of your users, not all of them. And also shorten the time from your idea to production. So, from the time you create a commit until you push it to production, you can use these techniques. So, you can shorten really as much as possible that time. And it's not affecting your life customers, it could affect your maybe internal customers, employees, something like that. So, you can confidently push things to production. The name is great, but all the techniques already existed for a long time. We have rolling updates on Kubernetes. This is the standard way when you change something on your deployments. You just get a new pod with a new version. When that pod comes up, the old pods start going away. And you can configure that easily on Kubernetes. You can configure how many pods you want to come up, if you want them to come up a little by a little. If you want all of them to come up at once, and they will start rolling. So, Kubernetes has been around. Blue-green deployments, same thing. It's been around forever. Well, defined for some definition of ever. And you have green, what you consider the old version, which is green, the new version, which is blue, or the other one, we're on, I don't know. And you have both running at the same time. You evaluate or you start sending traffic to the new version. And if something happens, you just have to flip the switch to put back to the old version. So, this is a variation. The difference is, in a couple of days, you don't need to have all the machines running at the same time, all the containers. With blue-green, you need to have room for both versions running at the same time. Canary deployment is one of the most interesting ones, where you send a percentage of the traffic or a percentage of your users to the new version, and a percentage, a small percentage to the new version, and you keep growing. I mean, you could just do a small percentage, or you could keep growing that canary percentage. A lot of companies do this. First, this gets deployed to internal employees only, then some countries, some like New Zealand, or a percentage of users, depending on some characteristics of them. And they keep growing this canary pool over time until you reach 100%. Feature Flags, another interesting one, where it allows you to push things to production behind Feature Flags, so you can test them in production. And also, disable them after you deploy them. You push something, you realize, either it breaks for a lot of users, or it breaks for a percentage of users, you can switch that feature off using some tools, or using something as simple as environment variables. But yeah, there's tools that allow you to manage Feature Flags, so you don't have to deal with environment variables, things like that. Monitoring is the new testing, so you know the goal is to know when the users are experiencing issues in production, and the other characteristic, I think, is they react to the issues automatically. So if you deploy something that is bad, how you can automatically roll it back before some human has to go and figure out what happened. So did you know that 90% of the outages could be solved? There's a study that said 90% of the outages could be solved with progressive delivery. Did you know that? No? Because I just made that up. And one thing they need is, yeah, some requirement is having a good amount of metrics, or you need to know what's happening in your production system before you can react, knowing what users are seeing the new version, what users are breaking with the new version, what's happening here. So you need to have this visibility. And I always love to plug Devos Barat, which disappeared from the Twitter server, to make your resumant to prepogator or in-law server in automatic way, that's what DevOps is. Raise your hand if you have broken a lot of servers by doing this automatically. So yeah, what I love to say is if you're breaking something automatically, is that you haven't automated enough. I think that's the... When you get there, it's like, okay, maybe I should step back a little bit. Until you get there, you keep automating things. Now, more to the practical side, how can I do this in Kubernetes? Introduction, who's familiar with Ingress? Ingress in Kubernetes, okay, yes. Then, 10 years ago, this was not like this. So on Kubernetes, you have the load balancer, and you can have services, and from the load balancer, Kubernetes will send you to one service or another. So your load balancer would send traffic to one service or another. But this was kind of the old way. The new way is you have Ingress controllers on Kubernetes, where the Ingress controller is running on Kubernetes. I typically domain names, but you could also do headers. For each of these traffic, it's easy. You can do headers, you can do all sorts of things. And that Ingress is sending the traffic to whatever service you're running. So you can have one service A, one service B, and with their pods. And the Ingress is the one that's, okay, you configure this domain to go to the service, you configure this domain to go to this other server. And there's a lot of Ingress controllers out there. If you run on a cloud provider, you're going to have the AWS, the GC, whatever. And then you can have your own NGINX, ambassador, STO, traffic. That's a lot of it. And ARGO rollouts, anybody using ARGO? Wow, okay. What are you doing here? I mean, we already know this. So provides advanced deployment capabilities. All the things that I mentioned, blue, green, canary, category analysis, experimentation, there are variations over the same thing. ARGO rollouts provides that to you and makes it very easy to do it. And the good thing is you don't need to use ARGO CD, for it to use ARGO rollouts. You don't actually need to use anything else to use ARGO rollouts. You can run ARGO rollouts just with Kubernetes, nothing else. You don't need external dependencies. And, yeah, it allows you to do this very easily. I'm a bit on the architecture of ARGO rollouts. So we have the controller that is watching a new object called a rollout. So ARGO rollouts has this object that can replace or complement your existing deployments. And I think I'll go down there in a bit. So this rollout manages the replica sets. So these replica sets, typically, you would have your deployments with the replica sets, and now they become part of the rollout. And it has the concept of analysis run that will check metrics or any other external source that this analysis will decide is this rollout successful or not. And based on that, it's going to cancel the rollout or keep it going. So you get the traffic coming from the ingress into your services, and you can tell ARGO, OK, send me, send traffic to this new canary replica set or send it to the old one. The percentage base one, for that, you need a service match. So if you need to do something fancy like, oh, I want to send 1% of the traffic or I want some traffic that matches this header, then you need something like a service match or the integration with the ARGO rollouts integration with the ingress controller. But if you use bare Kubernetes without integration between rollouts, you can still do it. Basically, it will use the number of pods in the replica sets. So if you have 10 pods, you can tell ARGO, OK, one new pod is going to go to the new version, and now you have a 10%, 90% sort of split, more or less. You cannot do things fancy that require support from the ingress controller or a service match, but you can still do things. The rollout object, you have two ways of defining the rollout. One is you replace the deployment with the rollout and add extra fields, or you create a rollout that points to a deployment. I don't like a lot the way of replacing the deployment because then people that are not aware of the rollouts objects, they may go and see, oh, there's no deployments. What's going on here? So it requires you changing things. And for us, it requires also you have to change rambles, you have to change commands that people need to secure documentation and all that. So I don't know why the decision was made that way, but it's not something that I'm too happy about it. Of course, you require all the Jamel tools to write these things. And let's go to the demo now. So I have here... So I'm running the Argo Rollouts demo. This is hitting the backend and it's returning one color or another, depending on what is running on the backend. So right now I have the blue one. Let me see how can I do this easier. So what I'm going to do is to change, update my deployment to use a new image that is going to be green. And ring. I lost the terminal. Okay, so it updated the image and let me fit here in big To show what this is doing. Okay, I think I pushed twice and now I see two Rollouts happening at the same time. Otherwise it's not working. Here it is. Okay, so I have the green. The one that shows green is the one that was running and it's stable. So I think I have five pots running and I push a new change, which is the canary. And this should be using the green. Okay, there it is. So like 20% of the traffic is getting green. Right? And how I define this rollout, this one is at the bottom is just the standard deployment configuration. So what image do I want? What ports do I want to expose and so on. But at the top I have the strategy configuration from Marco rollouts. So I can say point to this analysis template. This is what defines what is successful, what is not successful. And I'll show you that in a bit. And it's, I have several steps. So set weight 20 and then do a pause, set weight 40, pause for 10 seconds, set weight 60. So this is percentage. Pause for 10 seconds, set weight 80. So this is my definition of a rollout. 20% wait for me to manually do something. I only do that for demos in real life. That's a bit harder to do as you could still do it. But this is my definition of what the rollout is. So right now it's waiting because I set a pause and it's waiting for me to give you a key. I look at it, it says it looks okay. So I can do the promote. And this is going to continue through the rest of the steps. So hopefully we'll see this in like 60 seconds. It should continue the progression until everybody receives a green. A green color when they call the API. So this shows you just by creating a rollout object with this small section defining what your rollout is, you can do this. There's nothing else you need. Well, you need to install Argo. And what else can you do? Oh, yes, you can also have a preview version. So you can have another ingress pointing to your preview version. So you can even if I said I want zero traffic to go to the new version. All the existing traffic I wanted to go to the old version, but I want to see the new version in a new place. I can do that too. So that's very useful for preview environments sort of thing. So if I go back, okay. So while this continues running, this is running on Google Kubernetes and sending autopilot clusters, but you can run it in any Kubernetes. And the autopilot is pretty cool because you only pay for what you use. So if you scale things to zero, then you don't pay anything. What does it says here? Okay. So now green is the stable one. It says here, stable here. What if I want to do... I was talking about how does this protect me, right? What if I want to do a rollout that is broken? So... Let's see. This works. Right. Okay. So now I push an image that is bad. So I'm changing the deployment. Of course, you would do this with the GitOps. You would never push the production, but YOLO. And so I'm pushing the red image, but this red image is returning in 500 errors. And now Argo realized, oh, this is giving errors based on my analysis template that I'll show you. And this is in the graded status. And it went down and the scale it down, and my canary was set as failed. And you see that only a few percentage of traffic got the red dots, and then it was automatically rolled back. So I think this is the power of doing progressive delivery. Of course, this is very easy if your application is exploding. It's very easy to see. It's like, what if people ask me, oh, can we do this if a button doesn't work? Can we do this? Well, it depends what button. If it's the button that adds, imagine you're in Amazon, you break the button that adds things to the cart, and you get a metric that says nobody's adding things to the cart, maybe you're like, oh, something is really bad. Right. So let me show you the analysis template. Is this one? Yeah. In my case, my analysis template is a very complicated call that fails if this fails, if this doesn't return a 200. But again, you can integrate this with whatever you want, metrics. Argo rollouts also gives you a nice dashboard. If you are not into the command line, you can come here. And here. So where I can see the status of my rollout, what is strategy. As I said, Argo rollout supports multiple strategies on some of the more complex drivers. I can see my steps that I showed you before in the Jamel, 20, 40, 50, 80. And I can say what was the last image that I pushed, and I could click here and do the clickity clock instead of doing Jamel. Okay. So, yeah, what I mentioned before was if you're using service mesh, like Istio, then it integrates with a bunch of service mesh ingress providers. So you could go and say, I want 1% of traffic because Istio supports doing those things instead of saying more, because when you are using only pods, you don't have anyone here. Pod is going to receive the traffic or not. So it's more of an approximation. But with Istio and other advanced things, you can do more complex. We hook it up with Prometheus, also the support for multiple things to get metrics from. And, yeah, hopefully you'll learn how to do a progressive delivery canary deployment very easily. Just you need to do some Jamel here and there. Let me see. On here, this one. So you can have the other labels to the existing version, to the stable version as labels to the new version. So you can do other things with services on Kubernetes. You can pass what analysis you want to run and you set what steps to run. And everything else is just the template. And if in the deployment template. If you don't want to put the deployment template in the rollout object, you just point here, there's another option that says points to existing deployment. The only problem with that is that rollouts is not when you're migrating, rollouts is not going to scale down the deployment. A colleague of mine, she submitted a PR to Argus, which is going to be in the next version. So it will automatically, if you have like thousands of deployments, when you spin out a rollout with a deployment, I pointed to a deployment, when the rollout is successful, it's going to scale down the deployment. So that's how it will actually exist. Okay, so, yeah, and what's that thing? I lost my... Did I close it? Yeah. Okay, so, just a quick summary, you saw everything? And I hope that this helped you and you can try it and do it at home if you like it. And I have time for asking me two questions. Two questions. No questions. One question. I was wondering if you've been testing using the gateway API and some fingers in the waiting? So the question is if I tested using the gateway API instead of fingers, no, I have not been using the API yet, but I'm guessing that if there's no support already, there will be. Because... We did not. Yeah. Hello. So my question is that for, in case of buggy rollout, the particular traffic which is forward to the buggy instances, is it possible to automatically replicate it and send it to the stable versions after the fail? To ensure that even the traffic which hits the buggy rollout instances is served later by stable versions? So if it's... It's possible to run it back automatically, but also... Yeah, the individual traffic, individual request. So you don't want any user to see the spot? Yes. The other thing you could do, if you use a service mess, probably is send a clone the traffic and send a clone to the new version, but the actual traffic is going to the old version. And you could see if the new version is breaking or not. But also that's tricky because you need to make sure that it's not changing your state. If you are getting gets, it's fine if you are changing status. That's my point. Don't do the duplication in advance because it will go to the parallel execution, but do it only when the first execution failed because it's go to the canary instance. Yeah, I think you can do that. Send traffic to the new version, but it's a copy of the traffic that is not seen by any user. And then at some point you could say, okay, this is good. I'm promoting this. I think it's doable. Yeah, thank you. Okay. Thank you. Thank you.
Own your CI with Nix
Okay, all right. Hello, everyone. Meet Bob. Bob's a software engineer and Bob just had the idea of a century for a new startup. That's Gray Cat. Gray Cat is a service that given the picture of a cat will reach in the same picture, but grayscale. Bob's really excited about that and just got some funding to start working on that. So Bob gets started. Chooses to write that in Rust because that's cool and trendy and use GitHub because that's the standard. And so just writes the initial Rust boilerplate, the initial boilerplate for having that built by GitHub Actions, then Git commit, Git push. The first TI runs green. That's wonderful. Champagne. Now Bob decides to do something useful with that code and so he pulls in image2, which is the Rust library, for doing image manipulations, uses that in the code, builds it. The build is just fine locally. Wonderful. Now Git commit, Git push. The TI runs. It's not green. It's complaining about some missing data files somewhere. Okay, so Bob, no big deal about the software engineers. He knows how to use Google. So he searches, finds out that actually this image2 library is mostly a wrapper around C++ library, which is OpenImage.io. And it turns out that Bob had that installed on his laptop, but the CI runners don't, which is why it's failing on the CI. But no big deal. Bob just tweaks a bit the CI config to install OpenImage.io before running the build. And now the CI is green. Wonderful. So fast forward one year later, Bob Crop has grown quite a bit and also did the tech stack of Graycat. It's getting a bit complex, but no big deal. I mean, it's just a matter of having the right CI config file to make sure that everything's gets installed. So the config file has grown a bit out of hand with 5,000 lines of code, but it's not real problem. People just treat it as happened only whenever they need a new piece of software. They just add that to the config file, add a few lines to install it. Generally forget to remove it if they don't need it anymore. But I mean, no big deal. It works, right? And I mean, it's not like anyone would want to maintain that because just the feedback loop of having to do a change, push, push, wait for the CI, get the results. Yeah, now. Okay, better keep it like that. It would need to move fast anyway. Because if some troubles, if we know them, like for instance, when GitHub decides to update the base image of the runners because given the complexity of the setup, obviously something's breaking here and there. No big deal. It takes a couple of days sometimes to fix, sometimes a bit more blocking, blocking things a bit. That's annoying, but I mean, have to move fast. No, one day big deal happens. Microsoft, it's now 2025. Microsoft is in a slightly difficult financial situation and decides that this GitHub action thing is actually, yeah, it's wasting money on that. So just decides to shut it down. Well, no, not such a big deal. I mean, it's not like Bobcops married to GitHub actions. They just have to migrate this little config file and use another CI provider. That's what they do. And three months after, they actually managed to migrate that to a new CI config by a new provider. But now by the time competition has caught up, Graycat is definitely beyond. Bobcops is leaking, bleeding money everywhere. And this is the end of the Graycat dream. So very sad story for Bob. But could you have avoided that? So there's a bunch of things that went wrong. You might have noticed. But most of these are just like natural consequences of we want to do things the quick, quickly as possible. And so we just, like, we can't take care of everything. But there's one practical choice that Bob did at the very beginning, which was what caused the ultimate failure. And that was just being stuck on one single service provider and being at its mercy. And Bob could have avoided that, hopefully. The first, like, the elephant in the room, if I may say, is just the blue whale docker, which would have given Bob a way to just have a, like, agnostic way of defining this CI environment that doesn't depend on GitHub. And Bob could just have written a docker file instead of the CI config file. So now some things that wouldn't have solved is that the feedback loop for a big docker file is not that much better than that of the official CI. Docker's layering is great for caching when you don't need to touch the latest lines of the docker file if you touch up things at the top. You're pretty much screwed up. The other thing it would also not have solved is that unless you're very, very careful, it's easy to just have lines here and there that might break at any time because upstream decides to change something. But, I mean, this is okay. The big thing, the big problem that Bob would still have is that the libopementimage.io issue we had at the beginning. I mean, Bob has his laptop, he's working on that. On his laptop, he has the code for the project. He has tool chains needed to build the project. Then the CI of the container has the same code for the project, checking for the same, from the same commit, also has tool chains to build it, but not exactly the same ones. They are provided by a different means, so obviously there's going to be some differences that might break things down on the line. If you're lucky, it breaks your build. If you're unlucky, your build still succeeds, but then the underlying something behaves slightly differently and you have absolutely no clue why. So what would have been nice would have been to just have Bob's laptop and whatever is running the code in the CI, use exactly the same tool chains. And there's an obvious solution to that, just ask Bob to do all his development on the Docker container. That is great. The thing that's not great here is that Bob's laptop doesn't only have tool chains encoded, also has his text editor, his config, his whole development environment fine-tuned for years to just make Bob as efficient and productive as can be. And if Bob has to develop in the container, he mostly can't access that easily. And now we get a very sad and angry Bob and a very efficient Bob. So now there's one bit of the infrastructure that I barely mentioned in passing, but not made much attention to. That bit is cargo, the rest package manager. And the reason why I haven't really talked about it much is not that it's not important. I mean, it's probably the most crucial part of the infrastructure because it's the thing that pulls in the bulk of the dependencies of Gray Cat. But the reason I didn't talk about it that much is that it just walked. I mean, I was talking about broken things because it's always funnier to talk about broken things. And cargo was not broken, not nothing. And the reason cargo just walked, I think it's like there's two reasons. The first one is that cargo has been very transparent of the role. In the CI, Renskago to provide the dependencies for the build, and that's fast enough for the CI that just walks fine. Bob on his machine runs cargo for that, and that's works that doesn't prevent him from using all the rest of his tooling. And so Bob is happy using that. And beyond that, cargo is also declarative. Like there's one file, two files, that exactly define the set of rest dependencies that your code has. And when Bob runs cargo on his laptop, cargo just reads that file and provides the exact environment needed. When the CI clones the project, Renskago, it reads that same file and provides exactly the same environment. And that's why it walks. Now, there's one thing that cargo doesn't do properly. And that thing is everything except rest packages. And yeah, that's a problem because like the, yeah, that's why we have this open image, your problem. And so it means that actually the declarative aspect of cargo is limited because it's declarative to some point. You really have to read the terms and conditions for that. And so at that point, like it would be great if only we could have something a bit like cargo, but more generic, you know, that would be so awesome if only such a tool could exist. Okay, so meet my friend, Nick. You can think of it, Nick, if you don't know about it in that context as something exactly like cargo or NPM or whatever you want, except that it's fully generic. So you can use it to package and provide your rest crates if you want, but you can also use it to provide the C libraries that your rest crates depend on. And the C compiler used to compile these C libraries and the mini server you're using to run the test or the PostgreSQL database you're using for your deployment server. And so really now like declarative is not just a vain word. It is fully declarative up to the lowest level you might want to think of. And so what could happen for Bob if he were using Nick's is that he has this laptop with everything set up and then he can just use Nick's to provide these tool chains. And because Nick's is transparent, it will it won't prevent Bob from still using his editor with his all his tools just on top have the required tool chains to build the code. And that makes a very, very happy Bob. And then the CI system can just use the same Nick's with the same Nick's config files to get the tool chain. And then the CI builds exactly the same thing as Bob on his laptop and the world is wonderful. So now assuming that, now I mean I'm not assuming Bob is convinced that Nick's is the great thing and probably you are too, right? And what would that look like in practice then for Bob to use Nick's? So Bob would essentially drop a shell.nix file at the root of his environment saying, hey, I want a shell. So calling the make shell function to get a deployment shell saying, I want in my shell this set of packages, cargo, rest, open image, whatever you want. And the little bit of magic here is this PKGS thing from which everything comes, which you import which points to Nick's package collection, big repository with recipes for all the packages that exist in Nick's, which you can. So I've hidden that, but you can import that in this Nick's package.nix file and pointing to a very specific commit of the Nick's package's repository which will pin down every single version of all your transitive dependencies. And now if Bob wants to use that, he can just run the Nick's shell command, be dropped into a new shell in which, for instance, cargo will be available at a path that is managed by Nick's. And once Bob exits the shell, no cargo anymore, that's what we wanted. But then, I mean, Docker also does that. The bit that Nick has in terms of transparency, the extra coolness, is that the shell doesn't tell anything about Vine, Bob's editor. But still, I mean, Bob can still access his Vine, install Globody on his laptop, even inside the shell, which is what you want for development because it's much, much, much nicer. And when Bob wants some extra guarantees that really his shell is complete, he's not just accidentally leaking things from his computer. You have an extra pure mod that you can use for building things with more guarantees. And so that's Bob's machine, but this talk about CI. So let's look at the CI side of things. On the CI, so Bob would still be using GitHub CI because that's the standard at that time. But then beyond the initial mandatory boilerplate to just fetch the repository and all that, there's only two things that Bob would need to have in his CI config file. The first one is install Nick's because it's not yet part of the default GitHub image. And the second one is render build within a Nick shell, a pure one because you really want to be strict at that point. And then if Bob needs to migrate, all he needs to do is on his new CI system install Nick's again and copy that exact Nick's shell command. And now Bob is fine. And we can all send great cat pictures over the internet. So this was like scratching the edge of the tip of the iceberg. We could go a bit much further, although I only had 15 minutes, so I won't cover all that. But the first thing we could do to go a bit further is to improve the pinning situation. I mentioned like I hand wave this, oh, you have this file that pins things down to a very specific version. There's ways to make that much nicer and using a proper log file like all modern package managers would do so that you have full control over when you want to upgrade. But it's also trivial to upgrade. More interestingly, you can also use Nick's a bit further and not only use it to provide you some development or some CI environment, but you can build your thing fully with Nick's, which gives you, I mean, first time extra guarantees that, oh, this is really the right thing I've built in the right environment, in the right set of dependencies. But more interestingly, now you can integrate that and use Nick's a bit further, for instance, to build OCI images on top of that with only a few extra lines of Nick's code, or use that to build AMIs for whichever cloud provider you want to use. At that point, you probably want to care about caching. Nick's is pretty great at caching things for that today. If you have a know-how build, it's really going to be a know-off and take you a few seconds rather than the whatever time it takes to build the project. And you can also get that cache to be distributed, meaning that your developers, if that's built on the CI, then your developers can just reuse the pre-built results, which makes their life quite nicer. And the last thing that could be done to go further is to use Nick's OS, which is a Nick's-based Linux distribution, which follows that same philosophy of being purely declarative, which means that you just have one config file that describes the whole system, and you just rebuild the system based on that, which is useful for deployment, because that's infrastructure at the core, but really down to the deepest level, not just scratched on top of something that existed before that and was never meant to be that way. And you can also use that for testing things further. There's in particular a really nice testing framework that allows you to always declaratively declare a whole network of virtual machines that you can spawn, rinse, comment on, and then just read the results. And that's really useful, whereas as soon as you start to want to test some weird multi-tenant applications. So all that I've been talking about, Bob, but maybe a few words about me, but you know who's talking to you. So I'm Theo Fan. I'm the leader of the Nick's Tech Group at TWIG, which is the open source program office of Modisk Create, and it's pretty big on Nick's, as you might have guessed. And I'm also a maintainer of Nick's and a member of the Foundation Board, the Nick's Foundation Board. And you can reach me in all these places, and more concretely, you can also meet me right here in the AW building where we have the Nick's West End. And that's all for me.
Testing Go command line programs with `go-internal/testscript`
Good afternoon everybody. Who is a Go developer? Very well. Very nice to meet you. My name is Giuseppe Magia. I work with VMware. I am not the creator of this thing that I'm presenting today. My company is not involved in this. I'm just a user and since it makes my life easier, I decide to share with you what I do with it. So the thing that I want to do theoretically is the things that you see on the screen. But practically what I want to do is to make you curious about Test Script. So you will try and eventually see how good it is and what you can do with that. The important thing is that you will learn a few basic things. I mean, I could talk about Test Script for three hours and probably wouldn't exhaust the argument. But since we have only 20 minutes, this is what we are going to do. We are going to show the basics of Test Script so you will know it. We started with why. Why do we need to have this kind of tool? It's because we have a problem. When we do a command line program, we need to test it and to test it, we need to build it. Then we need to do something with this build to shake it up and check that it's doing what it's supposed to do. You can do a lot of things instead of testing directly the command line program in the shell. For example, you could test the single functions that are inside the program and you should do that. But this is not the same as testing the program. To test the program, you need to make sure that the function that works well in your tests also is linked to the command line command or option that you hope is linked correctly, but not always. Also, the input that you put in the function that works beautifully and since it has space in between, it doesn't work on the command line. You need really to test the real thing. The problem that you have is that to achieve this goal, you need first to compile the program and second to find a way of testing that program in such a way that it works well with your go code and it is checked in the right way. By checking the right way, I mean that you are sure that what you hope to achieve is what exactly happens. Doing this kind of thing in a shell script is not always easy. Let's talk about this test script and what is it? It's a Go library. It's also a standalone tool and the best thing is that this is developed directly by the Go team. They use it to test the Go tool itself and all the tools that come with the Go program. It has been recently, a few years ago, released in the Go internal package, so you can use it separately from the Go code. Especially, you can use it for mostly anything. If you are developing your command line code, a command line program in Go, it's much better, but you could also use it to test mostly any command line program, even if it's not written in Go. Of course, if it's written in Go, it helps. Let's see a first example. To test something with test script, you need two components. The first one is a script. It's a script that says something about what you want to test and what you want to get. In the script that you see on the upper part of the screen, there is an echo, a hello world and a keyword, exec, before. Exec is an internal command of test script that will run something. Then there is a standout with confirmation. The confirmation is that you should receive something that says hello world and a new line. Then there is an exclamation point that says standard error and a dot. This line means I don't want a standard error. I don't want anything in a standard error. More about this later. Then you need a component in your Go code. The component will just call the standard, the test script dot run, which contains at least one piece of information, meaning in which directory you find the scripts. I say scripts plural because in that directory, you can have one or a thousand scripts that do different things to your program. Let's modify the first script a little bit. Instead of expecting hello world with standout, we expect h. Then you have this strange thing that is a regular expression. If you know regular expression, what we are saying here is I just want two words, one that starts with h and one that starts with w. Like before, I want the standard error to be nil. This standard error with a dot suggests that what we are expecting here is not a dumb piece of text but a regular expression. You can use a dumb piece of text if it suits you, but you can have much more powerful type of information. For example, you can use several statements to describe better what you expect from the output of the program. In this case, instead of putting everything in one line, I put that in two lines. This is often useful if you want to make your test more readable to express exactly what you are expecting. More important, the test script environment includes one thing that is called text par. Text par is a very simple way of encoding files. To encode files, you just put the name of the file between double dashes and then you put the content of the file there. The file will be magically created in the environment where the test is executed. The thing that happens there is that the script will use a different temporary directory to each script. Every script can run in parallel and it will be more or less isolated from the rest. This data.txt will exist in the temporary directory created for this script only. You have some built-in commands that you can use directly to do your tests. Exactly what we have seen already. Then standard out and standard error, it will check what happens after you have run your command. A standard input will create the input for the next command. Then there is the command exists, the checks that the file exists and stop and skip will interrupt the test. If you put the exclamation point before the command, it will negate the command. Meaning that you expect that command to fail. Other commands are compare and compare with environment and then you have n that will set variables and this can be useful. Then you have something that are also available in shell scripts like cat, cd, cp, check, change mode and make directory move and remove that works like in a shell. Then you have conditions. Conditions is like a command but is within square brackets and you are telling the program that you expect something to happen. For example, exec file name, you are saying I want to make sure that this program is in the path. Unix says this will only be true if you are running in a Unix system. And after that condition, you put a command that will run if the condition is true. And you can check other things like if you have a network, if you are running a specific version of Go and so on. There is a specific environment, some specific environment variables that you have work is where you are running the test in practical. The home doesn't exist but you can set it if you want. And then there is a temp directory that it created for each script but you can change it if you need something different. So if you run the test with verbosity, you will get a lot of information that tells you what is the environment where you are running and what is everything that is executed. If you don't put the verbose, the test is silent, it will just succeed silently. You will see some output only in case the test fails. Let's see some more examples with command, second condition. So the first line says if it's not Unix, skip it. The second line says if it's Linux, say exec, good choice. Exec, sec means if the exec doesn't exist, then say the command echo, the command sec was found and so on. Remember I mentioned something about compiling the executable and one thing that the test script can do for you is having a transparent executable. How does it work? Let's say I have this word count command that I have created in go and I want to test it. So I run exec word count and this command may fail or may succeed whether word count exists or not. So if word count is in the path, it will succeed. If it's not, it will fail. But we want to make sure that it always succeeds so we need to take to tell test script this word count not only I want to exist but I want to be the one that I have created for which I have the code and to be fresh, not stale. How do we do that? In the test we use the test main in case you don't remember the test main is something that you put in your test code and runs before any test function that you may have in that directory, in that package. So the test main contains a call to test script dot run main which has a map of functions that you can associate with a name. In this case we have a name that is word count and we have command run main that returns one integer. In the main code you will have the main that doesn't run directly the code but will call run main and it will exit with the integer that the run main returns. So what happens here is that your word count that is in the script is in reality a call to this function and the funny thing is that there is no separate executable. If you remember go is a compiled language so whenever you run a test nothing is running like in Python, it's not interpreted. There is a compilation and the compilation happens in a hidden place and you will have a piece of code that has been compiled that is a binary and that binary will be available for your test and the good thing is that there is no additional binary is the same binary that is used for the test. I'm going to show you an example later. So let's see something more. I said before that we have built in commands but we may want to have something more profound like custom commands. So for example I want a command that will sleep for let's say three seconds and this command is not available because it's not one of the built in commands but I can build it. So I also can have a command that says I want to check all the files in a directory and I want all these files to exist. So I want to check files that has first argument the name of the directory and the rest of the arguments are the file names. How do I do this? When I run a script run I can have a call an indication of a map of functions that will produce these custom commands. So the custom commands are a map of functions and each function has a test script object a negation in case I put a bang there and a list of string arguments. If the command succeed I do nothing if the command doesn't succeed I call test script dot fatal and I fail. So for example this is how implement the custom commands in my word count. I call check files and I call sleep for and if we look at the implementation of each one you see that the sleep for is a function that accepts a test script a negation and a list of arguments and it takes that argument to say to determine how many seconds I want to sleep for and then it calls time sleep. If the first argument was not a number it will fail and the command will not succeed. A similar thing I do for check files so the first argument will be the directory and then I will check that the files exist for each one of the of the arguments. A similar thing I can do for custom conditions so in addition to the conditions that we have I can implement conditions that suit my environment better for example I may want a condition that says the version of this particular program must be at least 0.2 and how do I do that I cannot do this with the the building syntax of test script so I can implement a custom condition. To do a custom condition is is similar to what we do for the custom commands. I have a function that function receives one string and will parse that string to determine what we do with that condition. In addition to that test script allows us to pass arguments that are variable depending on the environment. So for example I want to pass the current version to the test or I want to pass the home directory to the version and I can do that with a set environment function. So back to the custom conditions we receive a string and return a boolean and an error. What do we do inside that function we parse the string and the string could be a simple condition or it could be a condition with some arguments that are that we need to parse inside and see if the the condition is true. In this case we have a version is at least and you see I have created a condition a function that will check that with the elements that are parsed just in the first line of the of the function. So I assume that the function the the arguments are separated by a column and I use them. For example this version is at last we'll check that we have at least two arguments the first argument will be the version and the second argument will be the compared version. In the same way I can have this condition exist within seconds that checks that a file exists at least after at maximum a number of seconds that I wait. This is useful for example when I test in a database system that is supposed to create something but it doesn't create it instantly. So I say I want to to see this log file at maximum 10 seconds after the database starts and if not I get an error. So I'm going to show you a quick demo of something that happens when we run test script. So if I run go help test flags sorry I get a bunch of options and these are used always by the test. Now if I run word count minus h this is the real executable that I built with go and I you see I have these options that are just the options of the of the program but if I run something a little bit different so let's see a test that I have inside here where I running word count minus h like I did on the command line and you see I have here the options that are made by the executable but in addition to that I have also the option that belong to the test. This to show you that what we are running here is an executable but is the executable that go builds for the test itself and the side effect is that it contains the the command line options that belong to the test itself. Now back to the to the presentation what we have learned today is that using test script you can simplify the testing of any command line program and programs that manipulate tests are extremely suitable for this kind of testing because test script was created for the go tools which manipulate tests. You don't need to have a separate executable because the test script environment will create one for you and you can build the commands and conditions if the built-in ones are not suitable. If you want to see the slides and if you want to see a full example of how to use test script to test a common command that I was created with go you can go to github.com data charmer word count and there is the code for this word count and all the example that I have shown here and a lot more that are testing the word count in most the conditions that you may have. So you can see how to test this kind of program in reality. Well in reality I can have a lot more than that but it will be too too long to show. So this is the beginning of a project that I have to to illustrate all the characteristics of test script using code and the first step is to show a simple command line program and all the tests that are needed. Here you will see also more that you will see in the in that on github all the resources that you can use to learn more and if you want to learn more right now you can ask me outside and I may show you some more examples. Do we have time for questions? Three minutes. Any questions? Yes. When you still want to do unit tests like is it for CLI to know it for like not click input or would you always create custom commands for that? You can use so the question was whether I can use test script for unit tests. So you can use test script for mostly anything. I use it for unit test and I use it for integration test. For integration test I just put some logic before the test to create the environment and so it will run a little bit slower but mostly it will I mean doing unit tests is the easiest thing in the world. So if you if you look at the at the go code most of the unit tests for the go tool itself are run with test script but also the integration test can be run in that. More questions? Okay thanks a lot.
How mutation testing got practical
Thank youender million that is going Technical question. We can edit the video later. Let's start very lighthearted then. Maybe a shell of hands. Who here has heard of mutation testing? Amazing. I can go very quickly to some flights then. Who has never heard of mutation testing? For who is the completely new concept? I will cover it for you guys. That's nice. Of course I'm here, but I'd like to promote Striker a little bit. Who of you is actually using Striker already? Nobody. One person. Well, that's my colleague. He's also working on it. Cool. You guys are definitely going to learn some stuff today then. I can just start right? Sure. Welcome everybody to the talk, How Mutation Testing Got Practical? I'm really focusing on the Got Practical in this talk. I will be explaining mutation testing a bit, but I'm really looking deep into the internals and why is the idea really old? It's a practical use case, relatively new. So I'll get into that. But first, a little introduction. My name is Jan Dele Kester. I'm a self-engineering consultant at InfoSupport. It's a consultancy organization in the Netherlands. I'm also a trainer there, and I'm a research supervisor. And that last one is very relevant today. You'll hear why soon. If you want to contact me afterwards, you can use the link then, I guess, but you can also find me on GitHub. And as I said, I'm here on behalf of Striker. Striker is a mutation testing framework for JavaScript and TypeScripts, C-Sharp, Scala, and hopefully at some point, Kotlin. There's like a partial implementation there already. We're working hard on it. You can find us at StrikerMutator.io, and of course, I have all these nice socks for you guys. So if you're really good with asking questions and reacting to my questions, you might get some. And otherwise, you'll see us after. We'll have some more. So in this talk, in this next 25 minutes, because I am going to try to leave room for questions as well, I first want to talk about why do we actually need to understand our tests? Why is just writing a test not good enough? I also want to go into what mutation testing is for the people that don't know yet. And finally, and that's the major part, hopefully if I don't run out of time, I'm going to go deep in how we got to practical applicability, how we got into this state now where we can actually run mutation testing in our real projects. And that means talking about some state of art, state of art performance improvements. But first, we have to talk about the false sense of security. Because this is a promotional image I copied shamelessly from the Sonacube website. And they show this nice dashboard where they say, well, everything is good. There's no issues, no bugs, it's all fine. And there's even 76% test coverage. Who would be happy with that? Okay, who wants to say why? Why is 76% good according to you? Lots of green. Lots of green. Larger small socks. I don't wear small socks. You get some anyway. Sorry about that. They're hard to throw in this room. I would say I would not be happy with that. Because, I mean, our tests, apparently when we're running our tests, we reach 76% of our code. I don't think that's enough. Because there is more than 20% of our code that is not even getting executed for doing the test. That's a problem. But even 100% code coverage is actually, it doesn't say much. Because coverage only means that code is being executed. We are only testing, in a worst-case scenario, that the program does not crash. What you actually want to know is if our tests do something, and I can very easily get 100% code coverage on the program without writing any assertions, that I'm just checking that the test execution does not crash. So we need a way of testing our tests. And no, we're not going to write unit tests to test our test logic, because that would be stupid. I mean, it would never end. We need to be smart about that. And that's where mutation testing comes in. So what we're going to do is to introduce changes in production code automatically. That's a tool that's doing that. We're going to test again to see whether the tests start failing. Because when the tests start failing, at that point, you know that your tests are actually able to catch that bug that we purposefully introduced. So it's also from a white box testing, because we really have to know stuff about the internals of the code to change it, and see whether the tests are good enough to catch that. And this is really not a new idea, because there's this nice paper from 79 already, where they talk about a new way of a new type of software test. And you can actually find this on Google, and you can read it if you want. But even back then, 40 years ago, 45 years ago, they were already talking about it. But only recently, and I talk recently in a very broad sense here, because I wasn't even in high school, I think, when I'm talking about reasons. Then it got more traction, because the problem is, in the 70s, or the late 70s, it was just a good idea. We did not have the resources to actually apply it in practice. And what you see here, the dark-colored bars are publications, research publications about practical applicability, and they really spike early this millennium. And there are reasons for that. Mostly also, I think, because our computers got fast enough. And why that is important is because of how mutation testing works, what the process behind it is. Because we start with our source code, and we are feeling very happy about it, of course, because we made all this nice code, we even wrote tests for it, so we're very confident. And what the tool is then doing is going to introduce mutants in your code. And mutants are just changes. And for every change that is made, the tests are executed again. And we can have two results. Either the tests start failing, which in this case is good, because we then detected that mutant, we found the bug, so we say that mutant is killed, or the mutant survived. And that means that your test is not complete. And when you do that for everything, in the end, we get a nice report out of it, like a covers report, except a bit more detail. And how that process actually works is that there are operators for that. And an operator is basically a transformation. Given a certain thing in your code, what kind of changes can we do that might fail your tests? And some examples are here. There are way more, and also one operator on the left could result into, or one source code, original source code on the left, could result into multiple mutants. But you could do, like, just switch the operators, or throw away a whole block of code. And when we do that for every of these mutants, we measure something, right? So I already talked about killed and survived. But that's only in an ideal scenario, because in practice, there might be code that does not even reach by test. So you can say, well, we have no coverage, or we have a timeout. And timeout basically means that there are mutants caused by an infinite loop. And we should consider that, okay, well, infinite loop tests actually failed then, so that's maybe kind of killed. But you can also get runtime errors, or compile errors, because we're just introducing weird code changes without looking at what the code is actually supposed to be doing. And finally, mutants can also be ignored, because that's what a developer said, I don't want a test for this. So I don't want to see the report anymore. So it's just like suppressing warnings for your code code dejects. Who here does that very often? I do, actually, but... And then just like with the code coverage score, we want to have a nice metric. We want to know how good we are doing. And for that, we can compute the score. And we call that a striker-deorientation score. And basically what we say here is we want to express in a nice number on a scale of 0 to 100%. How many mutants did you actually manage to kill? So how many unexpected changes in your code are your tests actually catching? And that's this nice formula, but basically what you do here is everything above the line is what you consider that you killed. And everything below the line, so we divide that by everything that was actually a valid change. So we exclude the crashes, for example. And that gives you, for your whole program, for your whole code base, an indication of how good your tests actually are. But what if you don't have that many tests yet? Well, they can also compute a variant of the mutation score. We just look at the code that's actually being tested. So you see here that we do not include the mutants without coverage anymore. So one might think, just like with code coverage, we should maybe strive for a high number. We should maybe have 100% mutation coverage, 100% mutation score. That would be nice. But there we actually ran into a problem because we cannot actually kill all the mutants. It's very easy, well, relatively easy at least, to reach all your codes, to make sure all your code is getting executed. So basically to get 100% code coverage, that's relatively simple. But because you're still calling functions from the outside, you might not be able to test every single operation happening inside of these functions. And actually, some mutants are, if you split up your whole code base completely, you can never kill them. And what category of these is equivalent mutants? Which is also a problem. So given this code, we have a nice for loop, and we say, well, we want to iterate 10 times. You can also write like this, and it will still work. So this mutant, we cannot kill because even though we changed our code, semantically it's doing the same thing. So that's where you might want to ignore, basically. And mutation testing is also very challenging. That's actually where that practical application problem comes in, because you can imagine that mutation testing, basically changing your code and then running all the tests again, doesn't take a lot of time. And if you have a very large code base, that might actually not finish in a reasonable time. You also need a lot of configuration. The mutation testing tool needs to know stuff. It needs to know how it can run tests. It needs to know how it can verify that those tests did complete successfully or not. It also needs to know stuff about your programming language, for example, in order to make sure that it rewrites the code in a correct way. So that also needs to be a lot of tuning support to make it work. And for a long time, mutation testing was simply not feasible or not easy to do. But we're bracing the gap. Not specifically at Striker. A lot of stuff has already been done. Luckily for us. But when we're looking at performance, this is basically the worst-case scenario. So the time it takes to run a single mutant, to analyze a single mutant, is basically the time it takes to run on your test cases. So we can approximate it by saying, well, let's just count the number of tests that you have. And then the time it takes to mutate your whole program is just the sum of that. So it basically means that you multiply the number of test cases that you have in your code times the number of mutants that you need to check, because mutants need to be checked in isolation, because they can influence each other. So you can imagine already with a very small program, this number, the time approximation, can get really, really, really large. So we need to be smarter about that. We want to make sure that the total time is a lot less, not just a bit less, a lot less than just a multiplication. And basically there are three approaches to get there. It's either to do it faster. And doing it faster, for example, I think we're going to paralyze it. We're going to use love course. Like take a big machine, nowadays it's relatively simple to get a machine with 128 cores, so we can do 128 things at the same time. You can try to do fewer. So you can maybe try to make smart choices and say, well, certain stuff we maybe don't need to analyze, because we kind of know that it's probably fine. Or you can try to do it smarter. And the study I referenced here, they really did an analysis in there, it's like a litigator review. And most of the studies are actually focusing on fewer or smarter. And common techniques there, some of us here, I won't have time to go into detail on all of these. But you can think of like random mutation, like we're just randomly checking stuff sometimes. We're just randomly picking some mutation that we're going to do. But that's not deterministic. So that might not give you the best knowledge about what the quality of the test actually is. Parallel execution already mentioned. You can also do stuff like data flow and control flow analysis and try to reduce the set that way. Or maybe look at AI to try to pick smarter sets of stuff that you're actually going to check. But if you want to use that mutation score as a benchmark, as a comparison, for example, with the pull request, did you actually improve it or not? Or to give you a good indication of how good your tests currently actually are, you actually need to execute everything. So just the approach of we're just going to run less, that doesn't always work. And one big way how this process can be sped up is by looking at how we actually change the code. So a very new eave approach would, for example, be just changing the source code, running the compiler again, running the test again, and then making another change in the source code, and then running the compiler again, running the test again. And if you have a compiled language with a fairly slow compiler, that's quite problematic because then it gets really, really slow. A bit better might be bytecode mutation. So, for example, the JVM languages have bytecodes, they have an intermediate step, maybe you can mutate that. And they only have to compile the source code once, they just change the bytecode and then run it. And while that is a lot faster, it's also a lot more complicated, but it has one big downside, and that is that every change that is possible in the bytecode is not necessarily something you can do in your source code, especially for Java, for example. If you write Kotlin or Scala, there's a lot of, a very simple thing in Scala, for example, can result in a lot of bytecodes, and if you're trying to mutate that with the assumption that the Java compiler compiled it, it might come up with mutants that you did not kill, but you cannot actually kill them because they don't exist in your source code. So, who thinks, who has an idea, how we can, how we can do this smarter? Any ideas? Yes. Are you a step? Sorry, what did you say? You compile all the different mutants, the same as people, and you select which mutants you get from the other side. Exactly. Large or small socks? Large or small socks? You have socks with a logo on it. Large. This is the, the larger the other ones, right? Sorry about that. Thank you, Niko. For testing good device. I didn't test my own system, no. Silver written software to do this. But, yeah, basically, so the answer that was given was basically this, and we call that mutant schematic, and this just makes sure that you compile all the mutants into your code once, and then use an environment variable to just switch them on or not. So, if you do this, your compiled code is just full of if statements, that is, check a certain number. And that is complicated, but it is manageable, it's not that hard. The main problem is in keeping track of it, but if you assign every mutant a unique number, it should be fine. And this really helps with compiled languages, especially with stuff that's a bit slower, like Scala. And this is actually relatively new. In the world of mutation testing, this is relatively new. It is from 1993, though, so it's the same age as me. I wouldn't say that I am relatively new. But, as something else, Martin, that you can do is coverage analysis. That's also something that has been part of Striker for a long time already. So we actually do an initial run where we just check which tests are reaching what code. So we also know if you change one part of the code, which tests actually need to run and which don't. So you can also get that number of test cases down a lot, depending on where you mutate the code. Some codes you don't really know, so if you have something static, for example, it's not extreme defined somewhere, that, you know, you might not be able to figure out how that is used. You might still need to have to run the whole test suite, but you try not to. Something else you can do is incremental analysis. So just try to div some previously stored state and try to guess which mutants you actually need to check. This is very hard to do, fully foolproof, fully complete, but you can get there like 99%. And that means that if you make a small change, a small pull request, that checking whether your changes are tested properly is relatively simple, is relatively fast. So if, uh, Nico here in front, he gave it all yesterday in the JavaScript dev room, and he actually showed this feature, and I think that there was a difference between a couple of seconds and like 30 seconds and three seconds, something like that, on a small project. Another cool thing is mutation levels, and that's where you actually give the user a choice. Do you care about testing it fully, or do you care about performance? And the choice that the user wants to make can depend on the type of project or the domain. Do you have code where it's really important that every single thing is tested, or isn't it that important actually? Do you actually want to spend the time? Or maybe you want to do a quick and dirty but pretty good analysis for every pull request, which you do in the nightly build where you test it fully. There's different approaches here, and this is actually something that is researched by one of my colleagues at InfoSport, he did his master's thesis on it. So it's really cool. But what could be downside of this approach? Any ideas? Remember, you can get some socks. Yes? So the answer was the feedback loop is longer, basically. It might take time to find if your tests are not that great. That's the one. I have another slide, but yeah, another guest. Sorry? The problem is very useful because you run nightly and... Yeah. What kind of size socks do you have? Large. Sorry? Large. Large. It thinks too far, just come get them later. I will put them aside for you. What's your size? Small. Small. Ah, damn it. I'm not good at throwing. But yeah, so the mutation score that you compute, if you choose to run other mutants, the mutation score might not be comparable. So you really need to take care of that. And the tool that my colleague actually created, analyzes a code base so that you can do this actually for a specific project. It analyzes a code base and it analyzes the test and it's trying to find a nice balance between accuracy and the number of test executions that you need to do. So it tries to see if there are some mutants that we can exclude that will gain a lot of performance. So it speeds it up massively, but doesn't reduce a lot in accuracy. So that's really nice. And you can actually find his thesis online. So if you go to the FOSDAM page for this talk, you can have a link there as well. And I'm very hard to press actually, so there is not even documentation for that and it's not even merged. It's project Xavier and that is actually implementing that idea because it was very theoretical, actually implementing that idea in Striker for JavaScript. So if you're really interested in how that all works and what decisions they made, I honestly don't know yet myself. Go look at the request. And it's also a very cool example. Mike is dead. Oh, it's back again. We're a project group from a university, in this case the University of Twente, contributed to Striker in that way and they actually built this. Yeah, documentation as I said, still to follow. And this is also a very new thing that also a student is currently working on is to do more analysis on static analysis on the code to figure out if we can run in one run of test cases, if we can analyze multiple mutants. But in order to do that, we actually need to make sure that these mutants do not influence each other. So this only works if you know for sure that they don't cancel these order out or if you can still, given that the test fields can say with confidence which mutant it was. So this is really, really complicated. Again, I'm not entirely sure on this progress yet, but question. So this is sort of static data for more complicated. Yeah, so the question is whether modularizing your application would help. And it would help because if your modules are smaller, then the test runs are also smaller. They would take less time. But it only works if a normal boot rest for you, normal change that you want to do, if that's only contained to one module or two modules. But if you split your code up instead of changing all modules, then it doesn't help. So it might help. But that's also with, like, in general, I want to make my CI pipeline quicker, just make more repos or build less modules. So yeah, that would definitely work. Come grab a pair of socks later. And now it really is time to also start testing your tests. So if you're not using mutation testing in your products already, it's really good now. We can actually use it. There's been a lot of progress in 45 years. We have better hardware. We have process improvements. And actually, there's a lot of research going on still to make it actually faster every time. We also have production ready tooling. There are many great libraries out there. Some of them are more mature than others. Some of them are faster than others. Not everybody integrates the same process improvements, for example. But in general, for most popular programming languages, there is a tool available and you can run it in your pipeline. And most of these tools just integrate with the build tool that you expect. They use information that the test runner already gives you. So that's great. Here's an overview of some suggestions. But if you just Google your favorite programming languages plus mutation testing, I'm sure you'll probably the first result will be the right one. So in summary, when we're talking about mutation testing, we're really talking about testing your tests, making sure that your tests actually expect what you do. And something, if there's anything, only one thing that you take away from this talk, don't rely on code coverage, please, because it doesn't say anything. And a lot of research has been gone into performance improvements. There's lots of research still being done. There's still always students coming to us interested in contributing to an open source project with research. So there's plenty of open research questions. And it's also applicable now. So if you're maintaining an open source project, at least consider mutation testing, because especially in open source, where there's many contributors, there's a really good metric on to get an idea about the quality of the tests that somebody wrote for our poll request. If you want to know more of the code implementation details, as I said, my colleague gave a talk yesterday in the JavaScript Dev Room, it's probably online at some point, so you can go check that out. And that was my talk, so thank you for listening. APPLAUSE Exactly 25 minutes as well, so that went great. Any questions? Yes? I could determine which expressions do mutate. Sorry? I can determine which expressions do mutate. OK, the question is how do we determine which expressions do mutate? Basically, there's a lookup. So it does abstract syntax free analysis, and it just checks a certain node, and there's a lookup table to say, OK, if we have this kind of operation, these are the options that we can... These are the mutations that we do. So there's basically a big mapping file with all the options, and that is probably not complete for every standard library out there, but for a lot of the logic, comparisons and stuff like that, you can do it pretty complete. Yes? I couldn't hear the last part. Can you repeat that, please? What is the way to find the way to test your... OK, so the question is what is the baseline when you start with mutation testing? Right, so if you have a new code base, if you have green fields and you implement mutation testing from the start, it's actually relatively easy to get high 90% mutation score. When you have an assisting project, it's usually very hard. So actually Striker for JavaScript has a mutation score of around 80%, which is actually pretty good. It's really hard to get very high scores. So it's not very... Not like covered. If you're anywhere close to 80%, you're actually doing pretty well, I think. Yes? Yeah? Yeah, so the question is when the purpose of mutation testing is to make sure your test are good, if you're doing selective mutation, how do you know you're not missing something? Actually, you don't. You might miss stuff. That's the point is because it can take a really long time that you have at least the option like, OK, I'm OK with 80% accuracy if it's half the time, because that's for me a good balance for some use cases. But yeah, you have to accept that you're missing stuff because you're just not running all the mutants. So the combination of mutation testing and test-driven developments. Well, you have to write your test first then, but you can only mutate once the implementation is there. So once you test that green, you can check whether you actually did a good job before writing your implementation, which is kind of strange, actually. Yeah, but that's very nice, actually. If you have to change your testing of changing requirements and you re-implement part of your code, then mutation testing will check whether your test are still complete. So that's actually very good, very nice. Yeah, actually, if you really want to go into that, property-based testing, you're only going to test all possible inputs, even though for sure that is correct, but that's not feasible yet. Property-based testing is really hard, too. Do you have time for more questions? Four minutes. All the time in the world, up front. The question is, from your experience currently, if, for example, you have normal unit tests and they run, let's say, it's called one minute, how many minutes will they be around using the framework? So the question is, if I know how long my unit tests run for, how do I know how long mutation testing will take? And there's only one answer. It really depends, because it really depends why your tests take a minute, but it's going to be a lot longer. It's like, it's not going to be four minutes or five minutes. It's way more than that. The only way to find it out is to just actually run it, because the problem is it really depends on how many mutations can be generated for your specific code, because that is what makes it slow. And because of all these optimizations, you cannot really predict how long it will take. I didn't hear that one. So the question is, how do we report it? And for Striker, and you can go to the talk with my colleague, he went into that in a bit more detail. We have a standardized adjustment for that. That's good. Oh, it's a bit. So there's a nice dashboard, but go watch this talk and you will know more. Up front, yes. So you already run one mutation at a time? Yeah, you have to. If you want to do more, you have to prove first that these will not influence each other, so that you need to know that if one test fails, because of which mutant it is. With coverage, you can have reached that? No, you really have to do data flow, control flow analysis, and stuff like that. So that's very, very difficult. But yeah, there's somebody working on it right now. So maybe in half a year's time, we'll have some more to talk about. Yeah, Sci-GridNet already has an implementation for it, but it's not scientifically proven, so we do not know 100% where that's correct, but at least 95% there. So if that was the last question, if there's any more question if you want to talk some stuff, I'll be outside in the hall. And if you ask a question, feel free to come grab your socks here up front. There's plenty. Thank you. APPLAUSE
Running systemd integration tests with mkosi
I'm Dan, I work on systemd and I also maintain the maker side tool which is a sister tool of systemd and I work in my day job at the Linux user space team at MATA. So specifically why do we want to do this? Systemd is a pretty low level user space project so running its integration test is not as trivial as it is for a regular project. So specifically we want to make sure that we don't accidentally break the host machine which when you're running something like systemd becomes rather easy. We also want to minimize the requirements that are needed to run the systemd integration test so that regardless of which machine you actually run them on or regardless of which machine you're hacking on you can still run the tests. This is especially important for new contributors because at the moment the barrier for writing a new integration test is pretty high and we want to make that lower. We don't want any host system details to leak into the integration tests so currently that actually happens quite a bit and it means that you often get a failure for example on a CI machine that you can't reproduce locally. And when that happens it's usually a huge pain to figure out what's going wrong and how to fix it. So we want to try and make these tests more reproducible regardless of the machine that they're running on so that we avoid issues like this. We want to be able to paralyze them as much as possible and again the isolation from the host helps here because it allows you to run more instances of tests without having to fear that they are fighting over the same resources that might be leaking in from the host. We want to make them easy to run of course like I said for new contributors and we also want to make them easy to write. So before I go further with the integration test I'll give a little overview of MakeOSI. MakeOSI is basically system deals tool to hack on systemd. So because systemd is such a low level user space project you can't just build it from source and then run it especially not if you're working on the in-it system itself because you're very likely already running a systemd on your laptop and you can't simply replace it with another one. And even if you could if you write a book and it crushes systemd then your laptop is certainly unusable. So we need another solution and specifically we need to run it in a virtual machine so that if something goes wrong and it crashes you can simply kill the virtual machine and it's like nothing ever happened. And this is where we use MakeOSI. So we use it to build a Linux image that contains systemd compiled from source and installed into the image along with all the other packages from your Linux distribution that you would need for development. If you can then boot in QMU and do whatever testing you need shut down the machine and then you can submit your patch. So it does a few things but the primary thing it does is it simply runs your distribution package manager to install whatever packages are needed and then runs various tools to configure the image. Most of them coming from systemd but also a few from Linux itself. It builds an environment where it's necessary, it generates a unified kernel image if you wanted to and then it packages it all up and then boots it in QMU. And so we can generate a few different archives but the most important ones are probably the disk images and just a plane directory. So what does this look like if you want to build an arch Linux distribution image and install systemd and Linux and then enable autologon that's how you do it. And this will build that and then boot into it with QMU. So you eventually end up in a root shell in a virtual machine with systemd installed. You don't need root privileges for any of this which is another thing we want to do with the integration test. Currently you need root privileges so if more files are written they're owned by your root user in your home directory which means that you run into weird issues when you try to delete files and stuff like that. So we want to try and do it all without even root privileges. You can figure out how to go aside it's like a systemd project so we do the usual unit file stuff. You can conditionally include stuff with a match section. They only apply something to the Fedora distribution for example. So we already use this for hacking and we don't use this for the integration test. So we use macOSI for manual testing which is not exactly great but the automated testing still runs outside of macOSI. So this is because the integration test existed before macOSI was there and the way this was implemented was they still wanted of course that you could run in a virtual machine. But instead of assembling the virtual machine from distribution packages the implementation decided to use the files from the host. So similarly to the first generation tools like Dracood which is where the approach came from they pick various files from the host when building the integration test image and then that becomes the image and there in the image you run the test. The problem is that this is completely independent from macOSI so we have two very different environments one for hacking and then another for running the integration test which isn't great. Even if you manage to do a set of two manual testing inside macOSI you then have to somehow translate that to the existing integration test which is very hard sometimes. We have a custom test runner using make so it's all implemented with make and bash and shell scripts. We don't really use any off the shelf tooling here so it can get very nasty. The tests themselves so this is one part that does work well. The tests themselves that run inside the image are implemented as systems. So what do you get this? Start the image and then we pull in the system unit and the system unit executes the test. If the unit succeeds then it has succeeded and then the test failed. Of course all the test specific dependencies have to be added to the image so this ends up being like I think it's like a two or three thousand line bash file now which is responsible for making sure all the dependencies get picked up from the host file system and then put into the image. So it's very complex and I don't think anyone fully understands it. Any customization that you want to do to these test images also requires writing a lot of bash which again is very hard and for new contributors especially to figure out how to do. As you can see to run a test roughly this is what you currently do. So as I said the files gets picked up from the host for the current images but of course we do need to lay the system to build from source. So you build system from source of the host as well and then what the three thousand line bash file does is it basically takes files from those takes files from the build directory combines them and you end up with this franken image. That contains God knows who what. Half system the build from source half from the host and that's where the image runs in and as you can imagine figuring out what's going on in this environment can be rather complicated. What do we want to do instead. So we want to reuse as much as our existing tooling as possible so one make OSI which are already used for the environment and then the other part is system these build system which is a mess on which already has targets test targets which will execute the tests. This is primarily intended or the I guess the primary goal for this was actually unit tests for C or C++ projects where the test macro and in mess on simply execute the unit test. But there's nothing really specific about it that says it can only be used for unit tests since all it does is really just run a command and check whether it returns zero or non zero exit status. So it's perfectly possible to just have running integration test as well. So I wanted to make use of that so that we can simply add a mess on sweet that's specifically for the integration tests and then running them is exactly the same as running the unit test. So you make things more similar and it will generally we hope lower the barrier for running the integration tests for new newcomers to system. We want to make sure that all the tests reuse the same image. So currently the image gets rebuilt quite often for individual tests which makes the whole thing a lot slower. We want to get to a point where we can ideally reuse the same image even the same one that we use for hacking for the integration tests as well. So we can make use of caching and we avoid having to rebuild the image. And the customization instead of writing whole pile of bash you can just reuse all the settings that may go as I provides to customize the image. And we hope that running an integration test would look roughly like this. So a proof of concept PR is already available on the system the GitHub repo where we more or less have it like this so that an integration test can be executed simply by running mess on test. Specifying the individual test if you want to run one or specify the entire suite if you want to run all of them. Mess on supports running tests in parallel so we want to make use of that as well to be able to run multiple integration tests in parallel. Of course since these tests are quite heavy because they spawn a virtual machine we can do as much parallelization as we would with unit tests but we can probably still run more than one. So how do we run an integration test in a virtual machine with system? There are a few interesting things about running a test in a virtual machine that can make it interesting to get the results out. So for example if mess on runs a unit test then the process simply exits with its exit status either zero or nonzero where nonzero means that the integration test has failed. But if you're running an integration test in a virtual machine when that integration test unit fails in the virtual machine that doesn't mean that your virtual machine is suddenly going to exit with exactly the same exit status. And you're not able to use that without some effort to determine if the test failed or not. You need to somehow get the exit status of the test out of the virtual machine and to the host so that it can be interpreted by mess on. So the way we do this in system D is by using what's called in the AFV socket family. This is a socket family that like TCP or the UDP sockets or the unit sockets but this is specifically intended for inter virtual machine communication. So you can assign a virtual machine and AFV socket device and it has a connection ID which identifies the virtual machine. And then you can bind two ports on that in the virtual machine and you can connect to it from the host. So we use this by for passing data from the guest to those. So system D as this as the notify protocol which you can is basically it can send messages about its status over a socket. And we extended this with support for AFV so that we can send information about the virtual machine to the host if someone is listening. We can we the most basic use case of this is to tell the host system when the machine is finished booting. So we send ready equals one then but it turns out that we can also just simply send access status equals whatever the exit status is. And that's how you can get an access status out of the VM. So this is then this is the access status of system D. So how do we make this access status of system D the access status of our integration test. Well we have two different unit settings for this and success action equals exit or and failure action equals exit. And what these two settings tell is that when this unit exits system D should also exit and specifically with the exit status of that service. So this gives us a way to pipe the exit status from the integration test to system D which then exits with the same. It sends it over VSOC to make or say which is listening it reads the exit status and make or sign in exits with that exit status. So you get this whole flow of data through to the host and to just be able to exit with the same exit status in make or sign. Of course just getting the exit status isn't really sufficient. If you had to do that could ask just by looking at this exit status you'd have a pretty bad experience. So you also need the logs ideally. So because we run on a serial console the serial console is already displayed so you get those automatically. But we also wanted a way to be able to get the system D journal from the virtual machine off the virtual machine and to the host. Normally you would just mount the disk image after the virtual machine has finished executing and get the journal out that way. But remember that we wanted to be able to support running these integration tests without needing root privileges. And if you don't have root privileges then you can't mount any file system in Linux. So we can mount the disk image anymore after the integration after the virtual machine has shut down. So we need to get the logs out while the virtual machine is running. How do we do this? Well again with AFVSOC. In the next version of system D most likely we're going to add another forwarding mode to system D-journally so that it can forward its logs over an AFVSOC socket. So again you can have something listening on the host on AFVSOC, configure journal D to send its logs over this AFVSOC. And then simply store them on the host instead of in the virtual machine itself. Or do both because having the logs in the virtual machine available as well can be useful for debugging. So to listen on the host we have this little program which is system D-journal remote. You can configure to listen on any address. This can also be on Unix socket sort of stuff. And it will simply store the logs to the directory that you specify. So once it's done you simply run journal code, you specify the directory that the logs are stored in and you will get the logs of the virtual machine. You can access them, you can read them, you can debug what's going on. Or you can just simply store whatever CI system that you're running the tests in. Then of course we need to be able to debug any failing tests. So the test might be started. It started via the serial console. But when Maston is running a test it doesn't give you interactive access to the serial console. So we need to have a way to be able to get into the VM without needing the serial console. So the regular solution for this is SSH of course. So we want to provide SSH access to the VM. But we don't want to tie this to the network of the VM. Because let's say we might be testing very specific networking access network tests. This might involve multiple VMs and they might need a very particular networking setup. And it doesn't mean that this network setup might not allow for access to the VM via SSH. So we want to use a different protocol. And again we can just use AFV so for this. So this just emerged. It will be in the next release of system. But when system D started with an AFVSock device it can now detect this during early boot via a new generator. And it will bind port 22 on the AFVSock family to a socket unit. Which will start SSHD when connected to. So this allows you to use SSHD with VSOCK. So you can connect to the connection ID of the virtual machine on the host using SSH. And you will get an SSH session in the VM without needing to configure the network. To provision your public key we use system decredentials which can be provided using SMBIOS. To the VM to provision your SSH public key into the VM in the correct location. In .ssh slash authorized keys. So that you don't need to do anything like you don't need to enter a password or anything. So just SSH it will do the usual key cryptography or key authentication. And you just get your root shell in the VM and you can debug whatever you want. To make this nice to use on the host we can drop in an SSH config file that configures a proxy command for SSH. So we take ownership then of the Unix and the VSOCK host name prefixes. So you can do SSH VSOCK slash the connection ID of the virtual machine to get an SSH session into that virtual machine. So this is what we're going to try and use to be able to debug any tests that are going wrong. That was all I had to say. I'll put a link to the project and go take a look. We want to use this for the integration test but make our size of course useful for a lot of other things as well. If you need for building Linux images please take a look. I'm always happy to add new features or you can join the Matrix channel which is linked in the written and ask new questions. And I'll be happy to answer them. Thank you for listening.
Making it easy to get to SLSA level 2
Hello, hey everyone. Welcome to Making It Easy to Get to Salsa Level 2. Thanks for sticking around. It's last day of the conference, send me last talk. So, yeah, today we're going to be talking about salsa and compliance and hopefully how you can meet those compliance requirements as a play. My name is Theophilus and I'm going to be talking about Choc and open source framework. We developed that crash override. So I come from a security background and every time I hear the word compliance I get bored to death. It's kind of like a book sticking exercise. But hopefully we can discuss today about this and see how you can do this in your own organization easily while also get value for your org. Before jumping into the topic, let me kind of quickly set the scene and talk a little bit about software supply chain attacks. So in a software supply chain attack, the attackers compromise the build system or the package registry and get a foothold there. And over the past years we've been seeing an increase in these types of attacks. So there was a report say from Sonatype that said since 2019, year after year, we've been seeing a sevenfold increase in this type of attacks. The report came out in 2022 that said supply chain attacks basically surpassed malware based attacks by 40%. And last year around two out of three of US businesses were impacted by a supply chain attack. So you can take these numbers with a grain of salt, but the fact of the matter is there is a surge in these types of attacks. And this popularity on the attack realm drives policy changes. So in May 2021, there was an executive order by the White House that said software vendors must be provided and purchased with software of materials and provenance information. And quick show of hands. How many are familiar with the term S-bombs or provenance? Cool. How many of you have been deploying these in your pipelines or your organizations? Okay, great. There's also an S-bomb room. So today we're going to jump into these topics real quick. So we're going to discuss some concepts and then talk about the challenges people face when trying to deploy these things to production. Then we're going to talk about CHOC and how CHOC can help you solve these problems but achieve many, many more things and hopefully have a discussion in the end. So for those of you who are not familiar with software bill of materials or S-bombs, you can think of it like a list of ingredients for software. So you go to the supermarket, you see a package, then you read the labels and you get a list of all the ingredients that are in there. So an S-bomb is pretty much the same thing but for your software applications. So you get either an XML or a JSON and from that you can get a list of the packages, their versions, etc., etc. When we're talking about provenance, what we're talking about really is how did the artifact get here? Like who created it, who packaged it, how was it modified along the way until it actually reaches the user basically. So that is all good but if we think about a list of ingredients, what are the guarantees that what we get is actually what we're promised? So for instance you could have an NPM package and you can generate an S-bomb for your NPM package saying that these are the ingredients that are there but then in a package you could get a foothold somewhere in your build pipeline and inject something that was not originally there. So another key component here besides generating the S-bomb and the provenance formation is really having some attestation around the integrity of the generated artifacts. So anyone should be able to cryptographically verify that at least what we're promised has not been tampered with and that the contents of the S-bomb were coming from an original author, etc., etc. And what's really important here is we need to have some clear assumptions around the threat model aka what can an attacker compromise and what are the security guarantees we're getting depending on that. So do we require the attacker say to compromise our build pipeline or do we require the attacker to get a foothold on developers boxes? What's our threat model? And that's really important because if we think about DevOps pipelines in practice you have many components like developers are pushing codes, that code ends up in some provider like GitHub or GitLab, you have open source packages, you have container images, you have infrastructure code that modifies this code and pushes it out and then somehow it ends up in the service or the cloud. And as we're building out all these graphs of components attackers could get a foothold at various places. So this is where Salsa comes in. Salsa is an open SSF project and essentially gives us some framework to talk about the security posture of our applications. And we have different levels for the supply chain security of our artifacts. At level one essentially all we're doing is we are providing information about how the package was built and have a report but we don't really have guarantees around the report whether it has been tampered with or not. At level two we get signed provenance. Essentially at this point we say okay once the thing has been generated there has not been tampering on that artifact but you don't get guarantees around the build platform etc. And as you move up the layers you get stronger and stronger security guarantees. So today we're going to be talking about chocking how easy it is to get to Salsa level two if you deploy chocking to your built pipelines. So how does one start to do this? This is good, we all want to improve the security posture of our applications, we want to deploy these things in our organization, how does one start to do it? One could think that okay that surely a solved problem there must be tools for this already and you're right to some extent but the tooling ecosystem is really in its infancy and it's largely fragmented at this point. So it's not necessarily obvious to a newcomer which tool or framework they should pick and even if you say select a space like S-Bombs the outputs of different tools are inconsistent with each other. One tool get a certain report, another tool get some different report and there might be assumptions around how these things should be getting set up, how you should be deploying all these things so it's not a straightforward way and what's really really hard is thinking about how can you do this at scale. If you have a large organization with multiple repositories, different providers, how do you make it easy for your teams to just set this and let it run and have it be easy to consume the data and also generate data that are of interest to you. So yeah it's not an easy problem and hopefully Chalk will help here. The main idea behind Chalk is really we have some metadata that we care about and then we want to embed that metadata that we call Chalkmarks into the artifact. So the artifact could be a binary or it could be a docker image and you want to embed this metadata into the artifact during the build time or post build time. So you could have an L file in a box and then you can inject metadata into that L file and say okay that was indeed here, you can have information that you care about like the security settings on that box like if a partner is enabled or what are the users or the network connections, you embed that metadata on it and now that artifact is tagged and once you have that tagged artifact you basically let it go and it gets deployed somewhere in production. So think of Chalk pretty much like air tags for your code so you embed the air tags and then you're tracking it across the ecosystem of your infrastructure and once the artifact actually gets executed what's interesting is you can get back reports with metadata that you configured there. So essentially you can scan what has been out there in production, you can grab for all this metadata that has been embedded in the artifacts or you could configure the artifacts some cases to phone home and give you the report themselves and you could do this once or you could do it periodically for instance configuring Chalk to send you heartbeat reports. So let's see this in action. I have set up here a very very basic git repository and this repository all it does is it deploys a lambda function. So we have the main code of the lambda function here and as you can see there's nothing really special to it, we just sleep and return it to 100k and we're building this lambda function using a docker file and there's nothing specific to Chalk in this docker file pulling from a well known image and we're actually building the lambda using a github action. So during the github action we check out the code, we set up the build environment and then here we're setting up Chalk. So we're telling our build ecosystem that Chalk should drop this build of the image and embed metadata on it and what sort of metadata we choose to embed is completely up to us. It comes like Chalk comes with defaults. So these are the only lines of code we ever need to do for our build pipeline so that Chalk can embed sbombs and actually use, you know, provide cryptographic guarantees around the integrity of the generated reports and we're also creating attestation manifests using SIGS-Tor for those of you who are aware of that framework. So cool, let's go ahead and trigger this. I'm going to go here in the action, kind of re-trigger the action once more and what we're doing here is we're building a docker image and we're telling Chalk to encapsulate the whole build and inject metadata in here. And that metadata, we can choose how we want to emit it. So we can choose to emit a report in there or the CLI or in some destination that we care like S3 or some server. So I have here a dummy server that's running and it's waiting for reports. There's nothing here currently. And I'm going to go back into one of the previous actions and show you a report that was emitted by Chalk on the CLI. So during the build, after we've actually pushed the image, you can see down here we have a Chalk report and this is basically a JSON file that has metadata that we care about. So here we know that some image could build, that was a daytime, that was a docker file path, the exact contents of the docker file, the commit ID, the author of the committer, but you also get a cryptographic signature about the integrity of this report essentially. You get interesting things like the environment variables, arguments. You can configure this to be however you like it. And this is generated on the CLI, but we can send the exact report, the exact same report or variance of that report in other destinations. So going back here to the action we just triggered, hopefully once this completes, we will be seeing a report populated to our server. So not only will we see a report here on the CLI, but we'll also get the metadata in the endpoint we configured. What could possibly go wrong? This is just a live demo here. So you can make this as fine-grained as you like. So Chalk supports plugins. So if you want to run, say, your static analysis tools like SemGrid or CodeQL, you can embed this metadata into the report as part of your regular other metadata that you're tracking. So it looks like this got finished. So we did get a report here. And if I go here, essentially we see that we got a build operation, so that got sent over to our server. And this is essentially just a pre-defined rendering of the JSON, right? You can send it wherever you see fit and render it however you would like. But we get some interesting information. We get some signal that we collected S-Bomb and Signing data. And indeed, if I scroll down here, I do see that I have the full S-Bomb. And I can fetch information about the attestation of the artifact. But I also get a bunch of interesting metadata that might not have been obvious just by seeing the build. So I see here CloudProvider is Azure. And we have information about the actual Azure instance metadata in which the build happened. So essentially what happens here is GitHub runs their machines on Azure in this particular instance. And so that build triggered in one of the Azure instances. So that's nice. We can also see the build command. And you can see here how the normal build command is now wrapped under Choc. So Choc is in charge of the build and embeds the metadata into your image. So that's nice. What we did do here is we pushed this demo lambda essentially. You can see this was modified just now. So I'm going to go ahead and execute the image. And hopefully, if things work as expected, the lambda will execute. And I'm going to get a second report here. And that second report is an exec. And if I zoom into the exec, you now see that the command that got executed is actually running within the lambda environment. So Choc is wrapping the entry point of the execution for that Docker image and tells you like, hey, this Choc mark that you inserted, the metadata that you've all captured is still here, but now I'm executing in lambda. And indeed, if I go here and see the Cloud metadata, you can see the region, the role they are in, the account ID, et cetera, et cetera. So with this, we can basically say the metadata that we injected in our build pipeline here is still present throughout wherever we deploy the image. And we can keep track of where the thing actually executes. So if I take into this Choc mark, I'm sorry, let me zoom out here. I can see that there's two reports essentially associated with this. One was a build and the other one was an exec for the exact same hash. So the exact same hash that I build in that machine has been executed in the other machine. So what did we do here? First of all, with four lines of YAML in our GitHub action, we generate and distribute the desktops. And we also have provenance information because we can track where the build happened and where the actual image got executed. And we also get artifact integrity. So in our case, we're using cosine. You could use different frameworks to do this. But essentially, we're meeting the basic requirements here. So we're checking those boxes. And that's with minimal effort, in my opinion. Like all you need to do is you need to configure whatever destinations you want for these reports to be sent. So you say, OK, that's cool. What more can you do? So let's think about typical scenarios that happen during kind of live production environments. You might be on call for a given service. And you get a page in the middle of the night. And there is some issue. There's a bug. There's a vulnerability. Something is off. And you want to figure out, OK, what's the component that's responsible for this? You could have, say, a pretty complex application with multiple teams pushing code. And for large organizations, usually the pattern for resolving these issues is you cut the tickets to the team. You wait for somebody else to be seen and be like, hey, that's the responsibility of that person. Potentially, you grab into code. You say, OK, what was the last commit? Or you have metrics. And you track from your metrics what chains. And you try to correlate it to somebody else. If you're using Chalk for your build pipelines, it's much, much easier to correlate what exact version of an image is running where and what the components are. And potentially, like, who are the code owners, et cetera. Because if we go back here, you see that we have things like the cometer and the commit ID. So we have the commit ID. You can start building these profiles about ownership incrementally as you go. So instead of having a process which could potentially take a couple of hours to determine the root cause of an outage or an issue, you now can have this in a few clicks, hopefully. Another common use case is application inventory and change management. So say, for instance, you're part of a large organization. You want to deprecate the framework. You want to deprecate, say, AngularJS. So AngularJS is running production. And you want to figure out, OK, what is the impact? How many teams are using it? Is the code even live? And what was the last time things got executed? You can figure out, you can get reports around these things. More importantly, you can see how applications change over time. So many of the people we've been talking to have processes where, for instance, they do a sort of change management meeting. Like, once a week, they say, OK, what has changed? What has been deployed? Do we need to go through a security review? What's the exact list of changes? And that process is manual to a large extent. Using Choc, you can automate this, because you can generate an exact report of the diff and you can get some integrity guarantees around that report. But more importantly, besides these things, you can do much, much more rightly. It's not necessary that you can only Choc containers. You can run essentially tools of your choosing, or you can submit custom plugins for metadata surfacing. And currently, the open source implementation that we have on GitHub only supports the entry pod wrapping for containers, but we're working to expand Choc functionality with more and more features. You can still Choc L files and PYC files and jars, et cetera. So yeah, the framework is out there. It's written in NIM. NIM is a very, very cool statically compiled type safe language. So any fans of NIM here, feel free to contribute. And we're welcoming feature requests. And I think that's my talk. I'm happy to take questions or discuss this. Thank you. Thank you. Yep. You talked about large organization. I'm very open to second. Yes. So the question here is I brought up large organizations, but given a concrete example of what are some use cases that this would apply in, right? So just to make this clear, this does not only apply to a larger, a small organization. It applies to everyone. It's just that if you are having a single application with a single repository, pretty much you know exactly what version is deployed where. The complexities of these situations start to be amplified the bigger and bigger you get, right? So if you have, say, a web application, and that web application has multiple components that are live at any given time, or say you have a distributed service and you have microservices running, you have multiple teams committing different versions of their component at any given time. And potentially some of these teams change. So you could end up with a repository having outdated code, right? There's a mission now, something has failed, and you go into the code, say, what was the last commit? It was six months ago. The committer of that application has left the team, potentially has left the company. Who do you contact? How do you know that's actually that part that has been outdated? But if you keep track of your builds and your executions, you have the ability now to tap into all the history of, like, all the provenance of a certain artifact and surface metrics that you care about. So if you cared about, say, show me all the components that haven't been updated in the last month or that haven't been executed the last month, it's way, way easier to do this. I'm not sure if I answered the question. Yeah. Yeah. So you showed how to do it in a GitHub action, but could you generalize and do this manually in a one-prime or in a different pipeline environment as well? Yes, yes. Chalk, you can actually, if you go now, if you visit the GitHub repo or the website, you can just download it. And it's a binary that runs. You can run locally and embed metadata into any artifact that you care about on your machine. So you can download on your laptop and scan all the L files in your system or the jar files or whatever, or even scan a whole directory. You can specify whatever you want. And then you can configure metadata that you care about, and these will be embedded there. And you can then extract it. So you don't necessarily need to have Chalk report back to you or run it in a GitHub action. You can just use it to embed information and then surface it. So you can both insert and extract if that makes sense. Yep. So that's a great question. I think one of the big benefits of Chalk is that you can embed information even in generated images or artifacts, right? So if you're using, say you have some third party software like a library that you're consuming, perhaps you don't know where it came from, but you know that you saw it in a certain machine at a certain hash. And then you can use Chalk to encapsulate that information for your artifact. And basically, if you run a query across all your applications that say are importing a given library, you can see all the versions of that library that are running. So you can start building these application inventories very easily, even if it's a third party software. Is it the total of the bottom third party container? It's still the same premise, right? Because if you have a container, you have several layers. So you can start saying, okay, these are the layers I have seen here. And potentially you don't have the full information, but you can at least ensure that you can attest that, okay, these are, this is the hash that I have seen. We are starting to add support to actually wrap entry points of different layers if you'd like to. So you should be able to interpose yourself in another layer should you like to, but that's not currently up yet. It's not up on the open source limitation. Yeah? How does Chalk play together with the useful bits? Are they to include them in the library? That's a great question. No, you don't need to include any compiler. All you need is the binary. And then if you have a reducible build for, in your pipeline, you should still be able to achieve the same guarantees. For instance, if you have, say, an L file, we'll embed metadata into a section and that will survive stripping and all that. So once you have a build, then assuming you know that you're running with Chalk, right, and you don't modify the thing later on inappropriately, you would at least know that you're running with Chalk, right? So that if you're getting a report, that report has not been tampered with. Yep? Let's imagine I have a jar which I have Chalk, right? Then I modify it and zip change it. So at which point Q and V8 and then I Chalk it again. At which point how do you pull the code? Right. So the question is, suppose you have a jar, you Chalk it, and then you modify it and then you Chalk it again. How does the tool help you here? So Chalk does not allow you to have just a single Chalk mark within a binary. You can wrap Chalk marks within Chalk marks within Chalk marks essentially. So if you're making modifications and you'd want it to, you can maintain past information about past Chalk marks. Or if you're building a jar, say, out of other jars and those have Chalk marks, you can use this information and embed them into your final jar, if that makes sense. So you can wrap and encapsulate all the metadata from all the components. So I need to focus more on this. Well it wouldn't be more complex than just saying Chalk insert. Like Chalk would take care of all the build dependencies and make sure it injects it automatically. At least that's where we're heading at. It might not be full feature for all the flavors of what can be choked currently, but that's where we want to go for sure. Cool. Thank you.
Perl at PayProp
Thank you. This is a QR code for the slides and also all of the talks I reference in this talk. And yeah, thank you Theo for organizing the poll in Raku Devroom. I'm going to talk about, you can all hear me okay? Yeah, perfect. I'm going to talk about Pearl at PayProp, which is a company I work for, an established company, been around for almost 25 years now. And briefly about me, I don't normally do this, but I see a few faces I don't recognize and I'm sure people don't recognize me as well, so I thought I would do this. I'm a senior dev and head of processing at PayProp. I've been there for 10 years. I've been a developer for just over 20 years. I've worked with various languages, but mostly Pearl. But I've only worked for three companies in the time I've been in that 20 years, so I've kind of seen tech stacks grow and shrink and change. I'm a C-Pone contributor, so Lijo, I'm a C-Pone, Meta-C-Pone. And I'm a London and Swiss Pearl and Raku workshop organizer, so come and talk to me if you're interested in any of those. We're searching for a venue for this year's London Pearl workshop, so if you have any ideas, come and talk to me. And I'm a regular speaker at other Pearl workshops and conferences, and often I'm helping out with the video. And I occasionally blog on Pearl. I prefer to do long form articles rather than technical kind of, this is how you use this module kind of posts. And I run the Pearl events Instagram account, but that's about the limit of my social media interaction. And I'm a Fosdum Newsbie, so my first time here. We usually have a work event that runs at this time of year, so it always clashes with Fosdum, so I've never managed to make it, so this is the first time it hasn't clashed with Fosdum. So about paper op. That's what kind of what we look like, the public facing part of the site at least. We're a payment and reconciliation platform for the residential letting industry. And we kind of, our core business value is we turn things like this, and this is one of the nicer ones to deal with. This is a Swift format into things like this. So we put interfaces and automation on tank consuming payment reconciliation flows. And this literally saves our customers hours, days, weeks of time, so we're really, really useful to them. The key night of you might see CGI bin script.cgi in that URL. So yeah, we've been around for over 20 years, so we have some old code, bit of an understatement in places. But the pearl we are using is relatively modern, 532. And we build our own pearl, and we don't use the vendor supplied pearl or the system pearl. We don't do anything special with it. We could in theory compile it with different flags, but we don't do that. So we get the defaults, which means we don't get things like iThreads, because if you use vendor supplied pearl, you get things you probably don't need. Yeah, the key is that it's not the system pearl. So we're not kind of tied to any particular version of an OS or package or whatever. And we can apply updates and patches as necessary. We should be on 538 by now. We tend to trail a major version. I've been spread a bit thin, so we haven't managed to get to latest, but that's on the roadmap for this year. Yeah, and it gives us dependency control, which is critical. If you've been paying attention the last couple of weeks, there's been a couple of critical CVEs against a couple of spreadsheet passing modules, so we could get those updates out quite quickly. Loose coupling, so yeah, like I said, not tied to the OS or anything like that. And the key is it's everywhere. So we have the same version of pearl, the same version of modules from dev through CI staging demo all the way to production. So otherwise you get interesting debugging problems. And while the issues and challenges around that, well, probably the ones you've all heard, you know, still use pearl or even what is pearl, and the bus factor, which is, you know, becoming a problem with some of the pearl dependencies. So yeah, it's a 20-year-old, 22-year-old app, so we are in the process of migrating from CGI.pm to Modulicious. A 20-year-old app has some legacy, a bit of an understatement really. This is an ongoing task, and we're about two-thirds complete in terms of number of requests to the site. We have a lot more pages than we really use after 20 years. Kind of inevitably happens that people write features and functionality that end up not being used, and we've got hundreds of pages, and really only 20% of them are actively used. So a lot of them will never actually end up getting converted. And one of the ways we did this in one of our other apps is using this plugin from Modulicious. And we decided not to do this with PayProp because we're using Apache on the front end anyway, so we can kind of proxy to a Modulicious server or just run exec CGI if it's CGI script. So we're not doing a kind of serving the CGI scripts from Modulicious using a plugin. There's no real value there, to be honest. So that's kind of what the setup is. I actually gave a talk about this almost a decade ago, so there's a link there to that talk, which has some suggestions for how you can do this if you're using CGI. You want to use Modulicious, what the options are. But it was 10 years ago, so it's a little bit out of date now, because Modulicious moves fast, and it is one of the challenges in using it because they say that you can always count on backwards compatibility, but they will deprecate and remove features within a three-month window, which is not really backwards compatibility. So you just have to be aware that if you haven't done an update in a while, things might break. And we're adding an ORM. And I know this can be a contentious issue, which I kind of find surprising. I'm just title writing this kind of stuff. And this is a simplified, about as simple as the query you can do. So you select some columns from the table, prepare the query, make sure you have the error handling, execute it, grab a hash ref. I just want to write that more descriptive. All the stuff we can get for free is there. And we can still drop down to vanilla SQL if we want. And we do do that. We have some pretty hairy reporting queries, and we're not writing them in ORM speak, because they're big enough already. If you start using the DSL of your ORM, they become an obfuscation. And the reason we're doing that is it allows us to kind of isolate some of the legacy issues in the schema. Again, 20-year-old app, organically growing schema, you can have some issues like this. And we can kind of nicely abstract them away in the ORM that we're using. Put this down as stuff hack and use says, you know, just fix your schema and things will break and you might see it. And it's like, no, we're not going to risk the business by breaking stuff. We don't move fast and break things. You know, we want to keep our customers happy. And then another suggestion is, well, why don't you write your own? But why would you do that? You know, we could abstract all our logic into an ORM, but it'd be half done one full of the bugs that all of the available ones have kind of already ironed out anyway. And yeah, we're using DRX class. Very feature-rich, but not dogmatic about its use. It's like, say, you can use it in ways you want to use it. Some of the issues and challenges around that, well, there's a learning curve, a big learning curve, especially if you haven't used an ORM before. But the manual is very good. Lots of stuff on the web you can find about how to do, you know, quite bespoke things with it. Currently, I say unmaintained, I would say stable rather than maintained. There are talks happening to kind of address this because it's a backlog of patches that could be applied and that kind of thing. And I did talk about this, I want to say, six years ago, using a legacy schema with the big class and how you can address some of those issues that you might have in your schema. Business objects, the model. So the older code is kind of a procedural mashup of business logic, database access, new logic, and so on. So it's all kind of smushed into the same layer. The newer code we're factoring into business objects. And the key is that the business objects are our model. Our ORM is not our model. People often kind of conflate the two. And the reason we're doing it is to get all of this stuff. If you're doing object-oriented coding properly, you get all of this really nice stuff. It's not just calling a method on an instance of a class. You get really powerful, useful things. And we're using Moose. And we were previously using mouse, but we're kind of moving to Moose reasons that I won't go into here. Karina is one to eventually look at. That's been added to the core in 538, the early version. Ovid's going to talk about that a bit later, so I won't go into that too much. But just a quick example, this is kind of the thing we're doing. We're dealing with payments, so we have this incoming payment class, and it has an attribute that references a bank statement, so we're having type constraints. So we can properly constrain that it has to be an object of this type with an ID, and we can throw a useful exception if we try and put something in there that shouldn't be in there. And then we can use the tell-on-ask principle. We can say fail that payment, and then the logic is in one place. And we're throwing exceptions if things aren't in the right state, and then we're delegating to the bank statement object to then fail its payment. So it's all nicely isolated, easy to test. So yeah, Moose, again, what are the issues and the challenges? Well again, the learning curve, if you've not used much object-oriented programming, this is a big paradigm shift. But I think it's worth it, because I think Moose is one of the best object systems available across any language. And then you add the mop, meta-object programming, and you can use introspection and everything. Pearl's very powerful about introspection. And there's been multi-day courses at Pearl Conference that's talking about Moose, so it's impossible for me to even scratch the surface in a small section of a 20-minute talk. People often talk about the slow startup if you're using some of these frameworks and systems, but if it's in a persistent process, a modulator server, that's not an issue. You load it once, it's loaded. If it's on the command line, well yeah, it used to be slow, but now it's things have caught up, and you're probably running those command line scripts once in a blue moon anyway. CGI scripts, we do use some of this, but we lazy load. So these are pages that are taking a couple of seconds to run their commands anyway, so the compile time of loading some of those subjects is a tiny percentage of that anyway. Yeah, mutable state, that's my technical debt. It's one of the things you learn, you know, mutable state is bad, so all our new code and your objects are immutable objects. Refactoring and regression testing, and I'm talking about beyond unit and integration testing because that's kind of the easy stuff. We're adding this for all new code, and mind we do refactoring, we're making sure there's test coverage there and addressing any gaps. But what about those critical business scripts that have existed forever and have no test coverage and basically run the business? I mean, how do you adjust bootstrapping problem of refactoring so you can work easy with them but there's no tests, but you don't want to refactor them because there's no tests, it's kind of a catch-22 situation. Well, this is Pearl, so we've got some useful features we can use to work around that. One of the frameworks we've come up with is we are creating override libraries that we pass into scripts that allows us to override various functions at various times in the lifecycle of that script that runs. So here we are overriding the call to file slippers read text method by saying run this script with this override library path and then we have these various blocks that will override calls so we can kind of monkey patch things. So we can add as much test coverage as we need and then start changing the script. So that's kind of an example of how we do it, a bit down in the weeds, but I would encourage you to watch this talk by Nick. He talked about this at the Pearl and Racket conference last year. It goes into all the details of how you can do this, which blocks you can use to run when, how it works and some of the issues around doing that because you're actually adding technical debt when you do this, but we need that test coverage there. So the aim is get the test coverage in place, the fact of the scripts, the fact of the test coverage, we're in a better place. This has been critical for some of the scripts we have because I mean they literally run the business and they literally have no test coverage while they have test coverage now. Like I said, we don't move fast and break things. Contributing to C pan. So yeah, we actively encourage contributions to C pan. These are all the distributions that we've either written or taken over maintenance of in the last decade, which is the time I've been a pay prop. Stuff like some modulus plugins. So there's this plugin for NMIT, modulus that allows you to profile your routes using NMIT prof. It's really useful. I wrote some of this OAuth 2 server stuff. If you've ever used OAuth 2 and tried to implement server side stuff, it's a fun game. That hopefully makes it a bit easier. Third party payment libraries. We interact with third party payment providers so we've written some stuff. Go Caldlis do direct debit in the UK. TrueLayer is a new opencomer. They're using the open banking spec so I think they're going to get quite big in the coming years. And other stuff, so we maintain CGI.pm because we still have CGI scripts. We can maintain un-maintained libraries, Google Maps stuff and all that kind of stuff. The issues and challenges around that, well, the pool of contributors to C pan is shrinking. Libraries for newer services and APIs don't exist. Often you'll find third party libraries for languages except Pearl, which is a shame. Modern APIs are restful and easy to create a third party library for. We're happy to throw somebody at it for a week or two, which is what we did with the TrueLayer one. They threw me at it for a week and there's one on C pan. Navigating around IP issues, well, that encourages to decouple our code. So that's actually quite a good thing. And finally, hiring devs new to Pearl. I say Pearl has been on the plateau of productivity for quite a while. Those that left it a long time ago don't know the current ecosystem. But more than a generation removed from even Pearl 5. Pearl 1 was released in 1987 and actually probably Larry was prototyping it a long time before that. 510, which can be considered modern Pearl, there are people starting a university now that were born after 510 came out. But it's still in a lot of places and I know that because we've interviewed people. Some of these users can't talk about it. Banks, the fangs, I won't emphasize which letter in the fangs, but we know there's people using Pearl in these places. So I think the rumors of Pearl's demise are greatly exaggerated, but it's kind of a known unknown at this point. And it's still be using Greenfield projects, so the system that Fosdham used to review, annotate, cut, process, transcode and publish all of their videos runs on modern Pearl. So over a thousand videos this weekend are going through a modern Pearl system. And its popularity is kind of normalized over the last two decades, I think. So it's had to find Pearl developers. But newcomers don't have preconceptions. That's my experience of interviewing anyway. I think those under 30 either haven't heard of the language or haven't used it. And those who don't want to use it self-select out of the process anyway. Because we are explicit that we use Pearl in our job specs. We just don't require it unless we're hiring a senior Pearl developer. And I find modern Pearl is an interesting and enjoyable language to work with. Working with legacy code is not specifically a Pearl thing. And we make sure to do all of this stuff, because you should be doing all of this stuff. And we're finding in a distributed work environment you need to do all of this stuff. I've not really talked about this much in the past, but I have written blog posts. So check out the blog posts if you're interested. And the key is that you can still be very experienced, but still a newcomer. And that's absolutely fine. And I think it's actually beneficial to the ecosystem and the community. So if you are, please speak up. You want to hear from you. And that's it. I don't think I have time for questions. So thank you very much. Thank you.
Open Food Facts: Learning and using Perl in 2024 to transform the food system !
I'd like to welcome Pierre, I've got your last name, Pierre. Pierre Slamish. All right. That's, oh yeah. I think it's one of the more recent World Projects started, isn't it? We created a plan for the pack in 2012. So it's just over 10 year old project. That's value of a teenager. Right. Let's welcome Pierre. And thank you, Lee, by the way. We use, we depend on your work. So I'm going to talk about open food fact. And it's not going to be a very technical talk, but more like experiences of people getting into Pearl in 2024 to contribute to food transparency and to transform the food system. So I'm, yeah, I'm Pierre. I'm one of the co-founders of open food fact. So I'm not the technical guy. I'm the product manager, but I double in a, in a product opener, which is our Pearl back end. So on the menu, I'm going to briefly introduce open food fact to both of you who don't know it yet. Then I'll have a part on starting Pearl in 2024. So some portraits of our contributors, how you can have impact on the food system with Pearl, and finally some Q and A. So about the open food fact, it's the answer to a very simple problem. How do you pick a product in the supermarket? You have many products, a lot of information. It's hard if you want to pick one for your children. It's very hard. And then when you get to the nutrition table, you have this long ingredient list, but sometimes you can't read. The nutrition table personally, I have never managed to make sense out of it. And so you have to make decisions every day to get food. So open food fact is all about empowering users to have an impact on their own health, but also on the environment and the food system at large. So we kind of have this slogan, don't panic when you're in the supermarket, organize. So trying to get together and have an impact on the food system. So we've been nicknamed by the media, the Wikipedia for food products. We have over 3 million products in 160 countries and languages. Our data sources are crowdsourcing. So using your mobile phone, you can actually add photos, add data, manually and using machine learning help. And the food industry, which is beginning to realize that closing their data doesn't make any sense. So we want transparency to become the norm. So I'm going to show you how pale code in production is having impact every day for millions. This is, so the first thing is a nutrition score, which you may have seen in Belgium, in France and in other countries. We started computing nutrition score in 2015. It was a scientific formula at the time. So we decided, okay, let's compute it on all the products we have and show it to people in the app. And we helped democratize nutrition score before it passed into law. So this is a screenshot of something one of our contributors had done at the time. He pasted some nutrition score on all the, using image editing software on all the products. Fast forward a couple of years, you go from digital to actually seeing a whole supermarket ale full of nutrition score, which shows that you go from digital to real life impact. So I mean, not only people who run the code, who use the software, but everyone can benefit, but everyone can benefit, even people who don't care about it. So from pale code to real life impact. And it goes even beyond just displaying the score. We started to realize that producers are actually changing the nutritional composition of their food products. So it's a systemic impact. Code can have a systemic impact on the food system. It's absolutely bananas. What you can do also with a path of tour is compare products at very large scale. So for instance, we are able to monitor the composition of Fanta. And as you see, it's not the same in every country. So basically we can show what's the industry is trying to hide us. We also have help producers improve their product. So one of our part of our software stack is the producer platform. And we do some computations based on the nutrition table to actually provide reformulation opportunities. If you reduce sugar by 20 milligrams, you can actually go from nutritional score B to nutritional score A. So computing helps also change the products. And yeah, brands are starting to... Oh, sorry, I went a little bit too far. Yeah, brands are starting to... All those brands have actually started to share data and use the import system, the mapping and import systems that are in OpenFood fact, that are kind of hairy XML parsing and all of that. And so yeah, they are sharing data in many countries at large scale. And to code this stack, we have Stefan, the founder of the OpenFood project, but we also managed to get more coders on board. People who just picked up Perl just to be able to contribute to the food system transparency. We started learning Perl in 2022, 23, 24 just to be able to have an impact. And Lee, I can confirm you that newcomers don't have any preconceptions. So for instance, Yukti picked up Perl in 2022 and she's improving the backend code quality. So she's very serious about food transparency. She doesn't look at the front, she looks at the back where the nutrition tables are. She wrote a lot of tests, bug fixes, and she's into Perl correctness. And she's obviously like a soul trying to convert all people she meets into OpenFood users. Stefan, who coded much of the code, learned Perl in 1998 when he was at Yahoo. He likes to do origami in his free time. And some of the code base are things that he coded perhaps a little bit too quickly 10 years ago when he launched the OpenFood project. And he recently paddled in a 10 degree water. Monalika, she picked up Perl in 2023 to improve the UI, the test, and the code. So it was part of the program funded by the Perl Foundation to include more people in computing. So she worked on product image protection to ensure that data quality stays constant and misuse and user management, email misuse and user management. Alex, who's a Python person, but took the Perl Camel two years ago to contribute to OpenFood fact, who's part of the permanent team, and who's using some of the tools you code in this room, so Proxmox, Sanuid, and many, many more. Benoit, who picked up Perl in 2023 to improve the data quality system, and he's learning nutrition science almost as fast as he is picking up Perl. And John, who didn't do much Perl before and started learning Perl in 2022. And he's spending one day a week leveling up in Perl to be able to contribute to OpenFood fact. So I'm going to go a bit faster, but as you see, the dynamic of people picking up Perl is actually very much alive. Young people, girls, etc. are actually learning Perl to be able to contribute to OpenFood fact. So John, he introduced Perl's critique to the pipeline, and we thank him for that somehow. So a bit more technical. So our backend is ProductOpner, so it's the backend for the web version. It's a monolithic system for the web, so there's no like front-end backend thing. But it's also providing the API of OpenFood fact. So it provides the database. It provides the ReadWrite API, the website, the producer's platform I talked about, analyzing and enriching of product data, so a lot of regs in every direction, and the computation of scores from the nutrition table, from the data. We are then able to calculate nutrition score, about nutrition, know about ultra-processing, and eco-score, which is even more complex to compute about the environment. So a lot of ingredient parsing, very hairy stuff, and what the architecture looks like. So we use ModPel and Apache to basically query the products which are stored in a storable file system. We are then able to fulfill the user queries, and for aggregated query, we store everything in a MongoDB database for more complex queries. So the data structure is very hairy. OpenFood is a very complex matter. As the year evolved, the data structure became more complex and complex. You have probably one-tenth of the data structure. And we store all the... So this is the old interface, and we store everything. We store all the revision of the food products as well to ensure that... to see the evolution of food products over time. I told you that producers were evolving products to make them better. So we are able to basically go back in time by storing revisions of the products at a given time. So when people scan, they will require product.store, but it will require the last revision. We are also exploring for aggregated queries the possibility of migrating to Postgre. So yeah, that's how we do a MongoDB query. And so the tags are the normalized version of the data, and then we are able to return products that match the specific query. It's very powerful. You can do very powerful stuff like require orange juice that are organic and are sold in Belgium and possibly contain no additives, etc. So the website is in Pell. The business logic is in Pell. So ingredient parsing and data analysis. We have those taxonomies to structure data and data navigation. The score computations as well, and importing the data from producers and even a GS1 connection. GS1 is the standard as ways to share product data. And we also have a knowledge panel system, which is basically driving completely remotely the app. So rich content, images and all of that. We've already done... One thing we realized is that we have to make contribution as easy as possible. So we dockerized the project. We started adding documentation. We are also working on a better API. It's not like very restful or API. And we refactor on the go as we add features, because the food system is currently evolving. We also want to have a more service-based approach as opposed to the monolith. So we have introduced open food fact query for aggregated query. FoxLumniangine for the additional properties. And our machine learning stack is called Robotoff. And we are currently revamping search with such issues and introducing key clock for identification. We are also trying to better document the API with open API and adding more tests and integration tests. Because stuff breaks and stuff breaks often. Things we'd like to do on the technical side, the API v3, lower the barrier to contribution. So probably using a modern web framework, we don't use any. So I saw that there was a Corina talk. We are also considering Corina instead of hash maps. Anonymous hash maps. So it could be, or data structure could be more documented. And globally, we factor the code in smaller chunks, like something for NutriScore, something for EcoScore. But one thing, we are not giving up at all. The core of open food fact is and will remain in Perl. And then, yeah, also more like design stuff. Because our interface is still monolithic and people need to be comfortable with Perl to actually do front-end stuff. So what's next for 2024? We go perhaps a bit faster. We are going to improve the mobile app, do some more machine learning. And also do something on open product fact. So NutriScore is going to evolve this year. So a lot of computation is going to, we basically have to change the algorithm. It's still a very controversial thing at the European level. Italy is trying to block the NutriScore. And so once we compute, we will make it available to everyone. We have also the question of prices. We are launching into price collection due to inflation. So we want people to be able to compare prices and be able to make sense out of the ongoing situation and also scientists. And the last and probably more interesting, Perl-wise, is the fact that we are going to merge all of our projects together. We currently have open food fact for food, but we have also open beauty fact for cosmetics and open pet food facts for pet food. We actually launched those as jokes as April Fools a couple of years ago. But now people are asking to be able to scan anything. So we have four installation of products opener on four different servers and we need to be able to bridge them all together. So in terms of architecture, you can imagine that it's going to require a lot of retooling. So open product fact is all about providing circular solution to augment the lifespan of products. So ensuring that they have a second, a third life that you are able to repair them, to give them away. So augmenting the life of objects with open product facts. So data platform for the circular economy and computing the carbon impact of products and also open beauty fact and open pet food fact. So actually work is just starting if you'd like to get involved. That's just about the right time. We haven't started actually retooling the product opener for that. So in terms of helping, how can you contribute? I'm very well aware that you are already maintaining probably a lot of projects. So the casual way is basically to scan and add products in your country. Translation, spreading the word, designing and of course for those of you hacking and hacking the code and fixing the code. The best way is just to try to install the docker on your machine. It should be straightforward. Also if you'd like to mentor, we will be part of the JSOC program this year. Hopefully we will be. We will also probably try to submit a project through the Pell Foundation. So if you'd like to mentor Pell projects on open food fact or actually to become a mentor yourself, it's not just students anymore. As a professional, you can actually be part of this program. Be sure to get in touch. So how can you get in touch? Those emails, you can install the app using this QR code. And if you scan this QR code, you can actually have a link to leave your coordinates and we will get back in touch if you want to become a volunteer. Either a technical task or non-technical task. And that's it. So perhaps if you have any questions or no. Thank you.
Synergy: a chat bot framework
Thank you. Welcome to Ricardo from Boxville. Thank you very much. Hey, okay. I did a timing run of this, but I had like zero sleep in 48 hours. So either it's going to run shorter or longer, maybe right on time we'll see, but we'll get the tape in. Second. There we go. All right. So, imagine it. It's the future. The year 2018. And at Fastmail, all of our critical systems run through our chat bot. Right? Like you want to deploy, you go to the chat bot. You want to set up a task for somebody else to do. You want to remind her, you go to the chat bot. And the chat bot is written in IR, for an IRC service because we're cutting edge company. And it's in charge of everything. Right? And then I got this email from Slack and it said, hey, just so you know, in like three weeks we're turning off the IRC gateway. And I talked to the shareholders who said they didn't want to close the company. So it turned out that we had to take this thing. This is Synergy, our bot, and go through a raging quick process to upgrade her. To talk to Slack. So this talk is about that project. But it's also about the fact that when we did that, we totally rewrote not every line of code, but all the lines of the interesting code to make it not horrible to deal with. Because it was, this is the three of us who did this. Matt Horsfall, who's at the top middle there, is a frequent person at Pearl Things. He happened to be in town. He mostly works remote. And we said, great, let's drop everything else we're doing. I've been sitting in a room for five days and rewired our chat bot. And it was great. It was written originally for Pearl 516, which at the time was cutting edge. And it was written using Poe, which was not cutting edge. Who here has ever used Poe? Yes. Yeah? Okay. Sorry. This is me looking excited back when I was younger. Like, yeah, Poe. No. This is Poe code. It's, look, you don't need to know everything about this code, but there it is. The thing you should notice is that it's pretty weird. Like, dollar underscore square bracket arg zero. What the hell even is that? Like dollar underscore heap. I use Pearl so I don't think about a heap, right? Like that's what I'm doing. It's, it's a mess. So what you need to know about Poe to understand this talk is like nothing. Don't worry about it. But in, even in 2005, right, when I first, very first started writing the first line of this code, it felt weird to use. And it's not really Poe's fault. The problem is that for a long time, any kind of concurrency felt weird. At least for me and at least in Pearl, right? Anything you're going to do, you're like, why is my code now coming from outer space? And Poe is just more weirdness that I didn't want to deal with. So my strategy for building the software is really simple, right? Do as much as possible without Poe. Don't write the Poe code. That's where everything gets messed up. So only use it when you absolutely need to, like all this asynchronous talking to the network server. And you can simplify that. You can make that statement generic by saying concurrency is weird. So weirdness is hard to cope with in your program. So minimize the weirdness by writing less concurrency in your code, right? Minimize how much your code has to do concurrency. So you imagine the program looks like this. You've got like that magic IRC thing. That's where all the Poe lives. That's where it's weird. And then like a thing that gets messages and dispatches them to something that does something. And we tell ourselves that it works like this, right? The concurrency lives over here and then there's the good code that we wrote. And the magic IRC thing does its magic and it calls the dispatcher and the dispatcher calls the handler and sends them the IRC thing and you're good, right? That's it. And the problem is that's not how it works, okay? Like some abstractions let you believe lies and they're good and some let you believe lies and they hurt you. So you imagine it, like subroutines form a stack, right? Subroutine calls a subroutine calls a subroutine and it returns and it returns and returns. And you can violate that but like don't tell me about it. The handler down here has to return. So either the dispatcher is getting the return value from the handler and passing it back to IRC to send a reply. Or the handler is doing some weird thing to send a reply before it returns, right? So what's actually happening? Let's say it's the first one. We're going to like engage in a little Socratic method. A little go through the logical process. Message comes in to IRC, a network message. And it turns it into something that can go to the dispatcher. Dispatcher sends it to the handler. Handler sends it to the IRC thing. The circle of life, right? Great. No. What actually happens is it comes in from the network and it goes to the dispatcher and it goes to the handler. The handler is like, I got this but it's going to take a minute. I need to look up 70 million rows in the database. And meanwhile, everybody at IRC is still sending these other messages. And you're not talking to the network anymore. You're like asynchronous thing is sitting there like, I would be so busy being asynchronous if you would just yield to me. And you don't because you're voting, putting concurrency everywhere you can. And pretty soon the whole thing falls apart. And like you lose all your messages and everybody's like, why aren't my deploys working? Because of IRC. So the other option has to be true, right? The handler is doing a thing. So a message comes into IRC. It goes to the dispatcher. It goes to the handler. The handler has to do something. So, because this thing's happening. So it sends it back to IRC but now it's blue. Right now it's a different message. It's not the, you've got a message. I want to send a message and everything's good. This is no longer just IRC. It's all your async. Your stuff comes in, it goes over there, it keeps going, you're good. Now you need something to handle both kinds of messages. One for, you know, you've got a message, you're going to do something. One you're going to send a reply. These boxes should be labeled differently. You're fine. Every kind of message that comes in, you've written your own simple, pretty much blocking, but okay handler. You don't even need to dispatch anymore. You just tie it into the message. Here's where I go and you call me. Great. Your code got simpler. The problem is making a ticket involves like talking to the database which in non-blocking terms means starting to talk to the database, doing the talk to the database, finishing talking to the database, dealing with an error. So you have to write all of these little pieces of code wonderfully. They're not concurrent, right? They just block if they need to or they just get called once. They're not doing anything weird. And then your program looks like this. Ah! There's, they call this, this is roughly like the dumb ass version of the actor model. So like Zoolander code. But it's not, it can be good. I just came from the Erlang room. The Erlang room is cool. Actors are cool, but you don't write pearl code that way which means your pearl code feels weird and we actually want to write pearl code that feels like pearl code. So here's what we do. We make a message and the message is going to contain its own reply handler, right? You're like, I'm going to send you a message and don't worry, you don't need to go write all these million things. When I send you the message, I put like a self-addressed stamped envelope. If this happens, send this envelope and that's your little reply handler. And now your code still looks like pearl code. You're good. And you do this all over the place. Like when you're setting up the listener, you're like, yeah, okay, I'm going to bind to the socket and if there's an error, here's what you do. And if you do connect, here's what you do. And by the way, after you've connected, once you start receiving packets, here's what you do. Right? And you're nesting all these envelopes and it's great. Like you've all got a pile of top level envelopes and you're like, yeah. So I'm going to listen and then maybe bind and then maybe connect and then maybe accept. And over here, like I'm going to do an L stat, the block in the file. So now it's easy. I'm going to create a file over here and then do some stuff with it. I'm going to k-pole and like we've got all these nested things and everything piles up and everything is like an envelope, an envelope, an envelope and it doesn't look like pro-code anymore. There's a name for this pattern by the way. They call it callback hell because this is what it feels like. Thank you. Like all of your code is just all callbacks. There's no named subroutines anymore. So you wanted to write this code. Okay, this is all you wanted. You just wanted to make a ticket. You got a message from the network. Put the whole thing up here. I was going to go through it line by line but whatever. You get a message and you parse it and it's like here's the ticket you should make. That's the plan. And then you say if they're allowed to make the ticket, good. But if they're not, you reply no and you return, right? You're done. Crash early. And then you make the ticket and then you reply, I made the ticket. That's the code you want to work. This is the perfect platonic expression of a chat bot. I got a message and I did a thing. And the problem is these three things block. And this is where your whole program just starts falling apart because you've got like 75 kinds of event handlers that all look like this and they all block. So it's okay. You can fix this problem by using sequencing, by leveraging promises or futures. And all you have to do is make your code look like this. This is just another kind of callback hell, right? You're just saying here's all this stuff to line up all up and like you will end up being like I'm living in the future. It's amazing. I can like write all my non-blocking code, but you're so sad inside because it's all these anonymous subroutines that you can't debug and like they're real bad. So remember when I said this earlier, right? Concurrency is weird. So minimize it by minimizing the concurrent code. That was bullshit. Don't do that. You need to lean into it, right? You need to, the problem is this. When you minimize the concurrent code, you write crappy programs because you write programs where you're like all the weird shit's over here and everything else is coping with it. Right? All of your code is just, I'm here to cope. Don't do that. What you want to do is get the language to hide that complexity for you. The language is like, don't worry. You write the code you want to write and I'm going to make it work. And then you make the code concurrent at the slightest provocation. You're like, oh, this might be able to block concurrent. That's what you do. And you can do that now because of asynch await. And that's what I'm going to talk about for a little while. I promise I'm going to talk about the chatbot. So you take this ugly ass code, right? Where you're like, do this and then call this other code, but then call this other code. And if it fails, you don't need to read this. I've read it once. That's enough for all of us. You write this. It's just like that beautiful perfect platonic code except I stuck some green stuff on here. Right? This sub is now asynchronous. It can yield. Then this line of code, I will yield here if I need to. That's all you're saying. I identify this code might be blocking. I don't know. Let's something else figure it out. And how does this actually work? Well, when you do this, something, it's called future asynch await, something takes this and it like pulls apart the whole subroutine into different units. And it's like, I'll put these together the right way. Don't sweat it. I'm going to make it work. And kind of what it puts it into together into is this. Kind of. The reality is what it's really doing is really gross and scary and it involves like mangling optries and putting them together. But that's what all pearl code is anyway. All this time that you write pearl, it's just building some crazy ass optry and like maybe there's one person in this room who thinks about optries every day. Hi, Paul. Most of us. Most of us don't have to do that and you still don't have to. So the conclusion of this long digression about asynch await is you should embrace this weirdness, right? Make your code concurrent easily all the time. Embrace the stuff so hard that like all the weirdness becomes part of you and you don't think about it. But the weirdness is there making you powerful and making your code better and just use future asynch await. Okay. It's not. I'll talk more about it later if somebody asks. I like talking about it's very good. So let's talk about synergy. If there's an unopened bottle of water in this room, I would definitely drink it. Okay. So you can find synergy here. The link will show up again later. You can ask me for it. You can install it. It's super cool. If you install it and it doesn't work, I'm sorry. And that's all you're getting out of me. I might answer a question. We don't support this. This is software written in the open and not a public project. We want everybody to use an adopt. If you come and find bugs, we might fix them and we might say, that's a cool bug you found. Here's how it works. There's basically three abstractions in synergy that you need to know about. Channels where messages come and go. And when I say messages, because in, you know, concurrent object oriented networking code, messages can mean a lot of things. Messages means chat messages, right? Like, hello, how are you? Those messages. And a reactor, which decides, should I react to this message? So that is the synergy software diagram. There you go. That's it. You understand synergy now. And I'm almost not joking. Like, it's really about the simple, which is why it's nice to use. But let's look at the code. And answer the question. Most of the time, when you work with synergy, you connect synergies, channels to your chat system, and then everything is about the reactors. What does the bot actually do? So that's what we should look at first. This is a reactor. It's a reactor that I use a lot when I don't understand why synergy did something. I ask for the uptime and synergy says, I've been up for four seconds and I say, aha, well, you just crashed. Here's how it works. It's a package. It's a class. Everything in synergy is written with moose. And this one does a role called synergy role reactor command post. So the role reactor means it's a reactor. And command post is so that later at the bottom, we can say, here's a command. You can write reactors in lots of different ways. I've been spending lots of my free time converting all of the old style reactors, which was called easy listening, into new style, which is command post. You do whatever you want. I don't care, but use command post. It just lets you write a bot really easily. And then the meat is a single way. The command takes a sub and that's what runs. So when someone says, hey, synergy, what's your uptime? This sub routine runs and it says, well, figure out how long I've been up the duration since the process started and reply. So we got a message in this event and we call reply on it. So you are the best. I am guilty. I have here actually stuck into the message, the ability to reply directly to it. There is some small amount of callback hell. That's maybe the last instance of it you'll see. So this is a reactor. You don't really need to know almost anything about asynchronous code other than make sure you write async and await in the right place and everything will work. So you could at this point install synergy, connect to something and be happy. But we're going to keep talking. The one last thing I should talk about on this slide is dollar event. Dollar event is the object that represents the message. I'm really sorry that I've called it both message and event taking, you know, two useful names that mean the same thing and using them to mean the same thing when I could have made them mean different things. I guess that's better. Here's what the event looks like. And then has text. That's whatever the user typed and has a channel it came from, right? So we said channels are how you connect to your chat network. That's the channel. It has a from address. If you are an IRC, that's like the channel again. Sorry. It has the user it came from if it came from a known user. And was it said in a public channel or in DMs? Was it said at me? Right. So like did someone say synergy? What time is it? Or did someone just say what time is it? Because you don't want the box to respond to everything and send a reply and send a reply. But this time it was an error. That's it. So this is basically the stuff a normal reactor does. So now you know, right? Channels, reactors, and you've seen a specific reactor. Great. Now you know how to handle events. You get an event object and you call reply on it and you do whatever you want in that sub. Where they come from, they come from channels. I'm going to talk about how channels work and how you can make one. But the short answer is don't. There's a Slack channel. You might remember from the top of this talk that we needed a Slack channel and that's why we wrote this whole stupid thing. Synergy's not stupid. Synergy's great. There's a Discord channel because I don't do my personal chatting on Slack. There's an IRC channel, although it doesn't work. I'm probably going to try and bug Paul to get some help on it. It works for a while and it falls over. There's a Twilio channel so you can SMS with your bot. There's a console channel we'll talk about. And there's a test channel because of course you can write automated tests for the thing. Channels are kind of a pain to write. This is the place where you can minimize all the complexity for those things that you thought that you could make not concurrent. You have to make those concurrent but it's easy. But at some point connecting to a remote web service over web sockets and handling different kind of frames and dispatching and all that and reconnecting. That's complicated. So there's an irreducible complexity here. The good news is you won't need to write one but I'm going to show you very roughly what it would look like. You'd have something like this. So this is a stupid subroutine that's like every five seconds sends an event. What does it do? It makes an event object saying the user said boop and it tells the hub to handle that event. The hub is that box and the diagram with Synergy's face on it. It drops it in there and everything good happens goes to all the reactors. But to see how the channel really works we're going to look at the console channel. The console channel is for working at the terminal. I'm sorry if you can't read this stuff. I did what I did. So here I'm going to run Pizzazz. Pizzazz is my local testing instance of Synergy. It just fires up Synergy with a bunch of reactors sitting in the console so I can test with it. I run it to get my little, you know, I've started up and I say uptime. That's the reactor we've all seen how it works. And Synergy replies and says I've been online for one second. So that's it, right? This is how I use Synergy when I'm developing. I stick the reactors into the console and I test there. Because if you've ever tried connecting a chat bot to Slack, you'd think that for a company that makes a chat product they'd want to make it easy. But they do not. It is a real pain in the butt and about every 18 months they change the way you connect a bot. Discord's much easier and it's documented in the repository how to do it. Slack I haven't bothered. But if you look at the top of the screenshot, you see console online, console channel online, console channel online. Why are there multiple console channels? That's a great question. I'm glad you asked. Here's another reactor. This is the announced reactor. Back when we are on IRC, we didn't have our chat for work on our phones, right? That before Slack at all. We just didn't have it on our phones. But you might be at lunch and lunch is running long and you want to say, I'm really getting back. And there was a Twilio channel, right? So you text the bot and you say, announce I'm still eating. And then Synergy would receive this message on the Twilio channel. It would go to this reactor and this reactor says, okay, I got in the vent and it's not from the channel I want to send to. The two channel name. We'll come back to that. And if it is, I say, like, what are you doing? You're telling me to announce something but you're already in the announcement room. And otherwise, she'll look up the channel, the two channel, and send a message there. Send a message there saying this. So when I would text the bot saying I'm still at lunch, the bot will post a message in IRC saying, Rick says he's still at lunch. And this all works because you can have multiple channels in your Synergy. This is one of the really keen things about writing asynchronous code. You can have lots and lots and lots of things in your process and they all work. You can have lots of consoles that talk to each other. So here, in my testing environment, I've spun up several console channels. Now only one of them is getting my input because I can only type to one terminal at a time unless I want to do something really weird. And I've set up the announce plugin. And I can say announce, yeah, I was going to do a live demo of this, but I didn't because I've got enough going on. And what you see is on the input-output terminal Synergy says, great, I announced it. Thank you. And on the announcement thing, you see the message come in there. So this testing environment is simulating multiple channels. I still have a purple channel which you won't see in this deck representing Twilio. So you can say like this should page somebody's phone with an emergency and you'll get the page showing up here like, yeah, I would have sent a text message, you're good. So it's all nice and simple. The one thing you might be wondering is what's up with two channel name? So in the world where that's like IRC, two channel name here might be private IRC server, just a string. And it says that's how you're going to go find the channel off the hub. Which channel am I sending to this one? But where did it come from? Like how is this set up? How is it configured? Well remember all the channels and all the reactors and everything else, they're moose objects. So there's an attribute on the object called two channel name and it's a string. Now if that's all we did we'd be a little screwed up because at some point someone would try announcing and we'd realize we had a typo in there and it would crash at run time. So also when the reactor starts up, when the wind synergy is really booting up and connecting, she'll say, do I actually have a channel called that? And if not, crash early, crash early everybody. But that's it. Everything, all the reactors work this way, they're all configured with attributes on the objects which is what you want. That's just one more turtle, right? But where did it come from? This is the bottom turtle. It comes from a config file. So you've got a config file where you list all the plugins that you want, all the reactors, all the channels, and all their properties. And somewhere in here at the top you'll see the announced channel, the announced reactor, and it says here's the address that I send to, and here's the channel on which I will send to that address. And then you'll see all these other reactors that are configured just the same way, the clocks reactor. Which time zones do I care about? Melbourne and New York. There's a DC reactor that you can use to run DC calculator programs. I didn't write that. Okay. So now we've written channel, we know how channels work, and we know that all the stuff comes from configuration. That's great. Now we're going to talk about linear. Linear is not part of Synergy. If any of you don't know about it, linear is a bug tracker. It's like a work tracking system we use for running our scrums and stuff. It's really, really good. I like it a lot, and I'll tell you all about it when you want. But what you do need to know is that linear, like a lot of web services, does webhooks. So you can say something happened to one of my issues, and a post gets sent to wherever you want, saying a thing happened to one of your issues, and you can respond. This is great for like, I track a calendar, right? And if somebody moves an event on the calendar, I get a post telling me this thing's been rescheduled. Consider whether your whole day has just been upended. Webhooks are great. And linear uses them, and we want to react to them. One of the things we use them for is escalation. Escalation inside a fast mail basically means a customer made a ticket, the support team who are great. They don't really know what's supposed to happen next. They escalate by taking the ticket and saying escalate it. They put a flag on it, and it goes to the developers. And when we do that, we want to do something like this, right? Make a message that just says, this issue got escalated by so and so, and here's the link. And we send it to the escalation address, right? Which is pound escalation and fast mail slack. And this is straightforward. I think if you've followed things so far, you follow this, except you might be wondering, where do you put this code? Right? Like, it's got to go someplace, so that's a good question. But you're not going to put it in a command, like uptime, because there's no command to say like, hey, check it, you got a webhook. That's not what a webhook's about. And it's not in a channel. Remember when I had to, like, tediously explain that channels are about chat messages, not just generic messages? So it's not in a channel. Where is this post going to go? The answer, the answer is it goes in a reactor. It doesn't need to be in a reactor. It's where we happen to have put it, but it's not because it's a reactor. It's because it's got this role called HTTP endpoint. And you say, in addition to reacting to chat messages, this thing is a web handler. And you say, I wanted to take the path linear notification. So when you connect this thing up, slash linear dash notification will now be a path that you handle. And how do you handle it? Well, you've got some async sub that is a plaque handler, because if you're writing web stuff in Perl, you probably want to do it with plaque. And that's kind of, I mean, look, there's a whole bunch of code here that's figuring out, like, getting the thing, authenticating it, figure out who's who. But this is basically it. You say this hunk of plug-in, and anybody can write a plug-in, requests a path for web service and mounts a plaque application on there. You know, and then at the end it says, like, yeah, and then return 200. So now this is HTTP endpoint. How does that work? Synergy runs a web server. And you say, I want web service to be provided on this port, and all of the channels, all the reactors, all the every other thing that has a HTTP endpoint, mounts onto those paths, conflicts are detected at start time and it crashes. And then when a request comes in, Synergy dispatches to the right place, and because they're all asynchronous, they can all interact. And that's a really important point, right? All of this whole diagram, this, all of the, every reactor, every channel, every HTTP endpoint, they're all in one process. It's just like one program that's running with everything loaded in it. And to share data, they share memory. There's no IPC, and this is a big win. Like, I don't want to say that IPC is bad, and IPC is the enemy, and I certainly don't want to say everybody should share memory. To share information, like these are big, broad claims. But we do have to talk about IPC sometimes. IPC solves problems, right? What does IPC mean by the way? It's inter-process communication that lets you have two processes talk to each other. But that's not the solution to a specific problem, right? That's not valuable per se. It's valuable because you have a problem that you could solve by having two processes talk to each other. And the question is, when does that problem arise, and when is IPC the right solution? Well, a good one is, right, if you have different parts of your system that scale differently, need different kinds of resources, need different access to things, maybe different processes are useful. You can scale up more workers, thank you. You can scale up more workers, you can scale down workers, that might be really useful. Maybe you have security constraints. This process needs access to certain constrained resources, needs to have these namespaces, needs to talk to the kernel, whatever. And this part of the program, the part of the system doesn't. That's a good reason to have two processes. And maybe you have to do work where multiple things need to be happening at once, and you have multiple processes to eliminate blocking, causing your code to be sequential when it's not going to be sequential. This is, you know, where we often would have multiple programs running or things forking. And it's fine. But remember that any time that we add a solution to a new problem to our program, we're almost always adding more code. And when we're adding more code, we're deforming the program from that ideal platonic version that we're like, well, if I could just write it, it would look like these eight lines. And then we go add all the code that solves all the problems we don't want to think about. And what we always want to be doing as programmers is picking the changes that deform our platonic program the least as possible. And program is always compromised between these things. Once upon a time, it was pretty clear, especially in languages like Perl, but kind of in a lot of programming, that if you had to eliminate blocking, the easiest, most effective thing to do was to go to have multiple processes, right? Fork is a great example. I need to be able to handle a lot of requests. I'm going to fork. Yeah, that makes sense. Forking's easy. It solves a lot of problems. And then later you have to introduce IPC because that's how life goes. But, you know, that's what you're going to do. I don't think it's this clear cut anymore. I think that at this point, we all need to be reevaluating when we want to eliminate blocking and have more communication between multiple kind of concurrent operations. Whether forking IPC is the answer to jump to in Perl anymore, I don't think it always is. I think it often is not the right answer anymore. And that's because of Asynchowate. Asynchowate's really, really powerful and it really moves the lever on where you should be picking which solutions. It's not just a Perl thing, by the way. Hopefully everybody here writes in other languages. Also, it's important to put your eggs in multiple baskets. You'll find this abstraction in a bunch of places. It's very good. Okay, one more thing. I've got a little time left. So, take a breath. Got quite a bit of time left, which is good. So, we've got channels and we've got reactors and we understand those. And we've got these HDP endpoints. And there's some other stuff we've gotten here. Maybe we'll even talk about more. But at some point, I thought, you know, it would be really cool to stick inside of Synergy a telnet server. So, we built a thing. It's not really telnet. Telnet's actually a protocol and has all kinds of weird stuff in it, like control characters. And don't learn. It's a netcat server. So, there's a netcat server built, they call these TCP streams. There's a netcat server built into Synergy, which is called the diagnostic uplink. So, here I am back at my terminal. I run my local development server with a diagnostic uplink available on local host 4321. Because I like those numbers. And when you tell it in, you get greeted with this. Welcome to Synergy. All right, you have connected to the diagnostic uplink. Would you like help? Of course I would. I don't know how to do anything. So, I say slash help. It's like, here you go. You've got some diagnostic commands. You've got notifier commands. Stuff for inspecting or running Synergy. Because when Synergy is acting, when your critical chat bot is sitting there and acting weird and you don't know why it's doing that, and you don't know what's happening, you know, you can reboot it. And that's fine. Thank you. You can, you hope that's going to be okay. You can, like, look at the logs. And I make a lot of logs. That might help, but most people don't write logs, and that's not going to help. But another great answer is, like, yeah, just connect to the thing and ask it questions. So, you can say, like, tell me about your configuration. I'm running a web service here. Here's this file. You can say, I don't show here, show me all the endpoints that your web service listens to, so I can see all those. You can say, show me all of the notifiers currently connected to the event loop. And it's going to show you, like, all these things are going on here. They get names as they're generated, so you can see things like, yeah, there's 47 open web requests all talking to GitLab. Well, that's probably a problem. Really useful. You can also get this guy. This is so good. Eval. So, you can say, I'm going to connect to the diagnostic uplink and instruct my Perl program to evaluate a string of Perl code in the context of the running bot. So, here I am saying, hey, bot, tell me your name. I'm Synergy. Great. What's your reference address and memory? Here you go. These are stupid examples. You never need to know the ref address of the bot. But you can do things like connect to the bot and instruct it to change its configuration as it runs. You can connect to the bot and add and remove reactors. You can do anything that you can do with Eval as long as you're happy typing it into one line because I have not implemented multi-line input. It wouldn't be that hard. I'm super lazy. Okay. That's everything I plan to talk about. We have a couple of minutes left. I'm happy to take questions. Yes. This might sound confrontational, but it is not. So, actually, it does. I'm going to make Elixir developer. The code you showed looks very much like how you would actually write an Erlang language. Yeah. So, why actually use her for use cases like this? No, it's a great question. The question is that it? Can I just say, well, I would say maybe the async stuff, like the tasks could be per, but actually the... The framework, yeah. No, it's a great question. The question is why do this in Perl when Erlang has a much better language for it or Elixir? I'm not to make you put any tone into that at all. That's the question. I think it's a good question. The answer is a boring answer. Well, the original version was written in Perl and all the little handlers were written in Perl. What was easy to do? Switch this to Perl. I also really like Erlang and I really like Elixir and I think that they're really well suited for this. In fact, in a lot of ways that we didn't talk about, like any one of those reactors crashing has to be handled by the hub saying like, oh, an exception happened. Don't worry, I'm going to catch it and recover. And like if a channel crashed, you have to figure out reinserting the new instance of the channel into the hub and what about its pending messages? That stuff's all solved, right, on B-machine languages. But we wrote it in Perl because we write Perl. And I think that if I had said, guess what everybody, we're rewriting the bot in one week and we're doing it with OTP. We would not have written that bot and nobody would have bought me a beer that night. Yes, in the back. It's hard that async await is much better than callback hell and also said that some of the loop is kind of callback hell put with upgrades. So can you expand a bit on how async await is better than callback hell? Yeah. And you mentioned that there may be some duplicates, a definite difference in debugging, but anything other than that? Yes, sure. So the question is how is it the case that using async await is practically better than callback hell? Larry Wall says that you can never eliminate the complexity in your program. You can only move it around, right? You can move around the lump under the carpet, but the dust is all still there. And my view is often that what you want to do is take the things that are complicated and obnoxious and pack them into an infinitely dense ball that lives at the center of your program. And everything else is beautiful and living on the outside. I got one minute, so this is maybe my final concluding remark. You want to put all the complexity deep, deep down in the middle and have everything else be simpler and built on that. Callback hell makes the programmer writing the application think about the complexity. And async await makes Paul think about the complexity. It makes one person cope with that. And I think that is why it's practically superior. Just curious how many in this room have endured future async await? Yeah, who else has used async await? Six, seven people? Yeah. It's very good. It's very good. It's got problems, but mostly they don't come up. And I use it every day because mostly they don't come up. Okay, if you want to run synergy, that's the URL. It's really good. Don't expect to get technical support. I'm going to change stuff whenever I feel like it. That's it. Thank you very much. Thank you.
The CPAN Security Working Group
It's all right. I am early or on time? I'm on time. I'm punctual. That's brilliant. So, hello. My name is Salve Nielsen. I'm one of the few fellows that hack around with the Netherlands in Oslo, Norway. And last year, I bumped into, with some other people into thinking about security on the seapen. So, stuff happened and I'm going to tell you about that now. And a little bit, it's kind of an introduction to FOSTA. Similar talks have been given at other conferences already. And a little bit of an update. So, I hope you can bear with me. So, we were established at the Perl Tulsen Summit in Lyon last year. And the purpose here is to basically feel an void about caring about seapen security. There are already people who care about security in the Perl community. Mostly they live on the P-File porters list. But when it comes to the seapen ecosystem, a couple of us raised our hands and said, okay, we'll try and do something about that. These are the people that showed up at the Perl Tulsen Summit. And a bunch of these are also on the seapen security working group. So, what's in scope for this working group? We are, there's a lot of people who are interested in the security of the Perl. So, we try to do security outreach. That means information work. It's maybe not obvious that's needed because of course everybody knows how to Google and figure out something. But we try to think a little bit about how to do things that are connected to the security of the Perl. So, that includes making sure that important security issues that are probably registered as a CVE. That if there are anything that show up in the CVE index that they are responding to in a good way. And we're not solving the problems. We're helping the people who are involved in the project, for example, that doesn't have a responsive author will make a little bit of an effort to try to find a replacement or solve it that way. This is basically what happened with spreadsheet parseXL and parseXLX. And we are super happy somebody stepped up and actually resolved those issues. And we do some coordination with other scientists through the search.org VINCE interface. And so, we are trying to build up a network so we can make sure to report things properly and share the information we have and help those people who need help. And there is some triaging and coordination going on there. And the goal here is to make sure that important vulnerability issues are not ignored. So, that's one of the major topics we're working on. We care also about having a good vulnerability index. There are, I think, one or two options right now. This one, called C-Pan-Odeta, I think, has something going on there which is useful. But it needs to be up to date and we want to help with that and maybe see if we can integrate it with other indexes out there. Furthermore, let's see what's going on here. That was not the point. Okay, the screen is saying hello. Okay, sorry for technical problems here. It looks like my computer doesn't like the USB C connection here for a moment. Sorry about that. Okay, let's throw it out and put it in again. That's always how it works. There, sweet. Yeah, it would manage to fix itself or the old computer are just saying. So, yes, vulnerability index. We also care about what's called supply chain provenance, which is basically where the stuff come from and how did it become the way it is. And in general, supply chain security. Things that we are working on there. Look here. It's already disappearing. This is a bit annoying. I'll try to continue. We want to make an effort to make sure that all the C-Pan clients use HTLess by default, for example. So we connect quickly to the servers that we want to download from. We want to make use of something called the update framework, which is used by other packaging ecosystems for securing the whole process of publishing and sharing the modules out there. We want to introduce repository signatures and author signatures at some point. We, moving on, we have, come on. It looks like I'm having more trouble than is necessary here. This is quite annoying. No. No. No. All right. So we are looking also at, oh, this is the wrong page. Interesting. We're also looking at tracking all the changes that happen on the software. Look here. Using S-bombs, software below materials. That's a huge topic and demands from that downstream when people in running software on critical infrastructure, for example, have now, they're obliged by law to keep track of dependencies and what's going on. And this whole field also includes solving the problem of how to refer to the depends across package ecosystems. So with that, there's something called package URL, which is currently in use by a lot of systems out there that and S-bomb standards to refer between two packages in different ecosystems. If all goes well, we'll actually have C-pans as part of the package URL standard, sometimes this weekend, I'm hoping. I talked with the author yesterday at the party, at the conference here in Brussels. And we want to improve the indices in general when it comes to interoperability with other indexes, package indexes. Let's see. Since we don't have slides here, this is really annoying. So I'm sorry this doesn't work as expected. Does anybody have a USB to HDMI connector? USB C. No, no, that's, I need to, I need to, female HDMI. Ah, okay. Let's see if this helps. Crossing fingers. Because if it doesn't get better now, then it's not my computer. All right. There's something called transparency logs. There is some tooling called six store and six some that we want to take inspiration from to create transparency around what changes happen on C-pans. So if something is updated without anyone knowing, we want to detect stuff like that. We also would love to have a way to do a patching of C-pan distributions when an upstream of there is completely unresponsive and we have no way of resolving a crisis quickly. So to publish a patch in a structured way so that, say, for example, a client can detect, oh, there's a patch that is not applied here, but we do want to download or something like that. We'll see how that works. It's a current dream we're having. We do care about compliance and privacy. So having an idea of what kind of legislation is relevant for us. That's super important and documenting that stuff is part of that. So we have a reading list already published. We also want to have good tooling for software composition analysis and like checking finding ways to detect if something, some of your dependencies have certainly gotten a vulnerability or something. So we say, for example, during a test run configured, oops, there was something happening. One of the dependencies you need to update is lots of good ways to do that. There's already some tooling in place actually, but these are what we want to do. There's also the act of having a project management. So we're taking that part of that and that means creating a good charter, having a pre-release disclosure agreement. That tells us under what terms we can share information or not. And do general information around how things are put together as an organization and which place we play in the larger ecosystem. Funding is also an important part of this because I have to be frank here for a moment. And that is working on security issues on behalf of others on a volunteer basis isn't always fun. Sometimes it can be increased like horribly boring or frustrating or just solving problems that I don't have. I imagine this is the same for everyone. So we're looking also for finding ways to actually fund some of the work that we want to do. And there's a whole lot of other stuff we want to do. And the most important thing for us is that while Perl isn't the super big thing it was 20 years ago, it's still used everywhere in critical infrastructure and in important businesses that with money is earned right now. So people call it legacy systems these days, but we have to remember legacy means also earning money. So we cannot just ignore and say I'll rewrite stuff later or we'll just update. No, no, we need to update stuff now and we need to figure out exactly what's running and to make that happen. We need to enable a whole lot of things using the stuff I already mentioned. And there's also some cultural things worth mentioning. And in the Cpan and Perl community, we don't always think actively about security. So we're hoping to be a little bit of a catalyst for over time to change the culture also. And that means learning new stuff, not only being a DevOps, but thinking also how to become a DevOps, or Sec DevOps or whatever it's called. The security becomes part of how we operate. And in my opinion also we're pretty good at having our own ecosystem where things have worked for a long time and we know we can trust it and it's been very predictable. But we're not that good at interoperating across the ecosystem boundaries. Like say for example if you package something in Debian, it's like how do from Debian's perspective is what do we have to do to make whatever these guys are doing work in our environment. When we could have used say good standards for communicating dependencies in a machine readable and common way that works across all kinds of ecosystems. That's a super interesting problem that people are working on right now to solve personally and I hope we can be part of that. So why do we do this? There are new security demands coming from the EU and from an old executive order in the US. These are specifically aimed at institutions and companies that write software for critical infrastructure and that could be anything from power internet access, street light management, water treatment plans, administrative systems, all kinds of places throughout society. If something breaks it affects the normal operation of society in a negative manner. That means these two directives applies. For the cyber security sector which is still upcoming, it's more about internet enabled devices which basically means anything from toys to phones and all the systems that connect to and update those. So that means everything. So we will be affected. These laws are coming this year and will be rolled in in the next few years. I think it's 18 months or something. So this is upcoming stuff. That means we have the legislative guns pointing at us basically. We would also love to find ways to show that those of us who publish things on c-pan have our ducks in a row. We have the things in order. People can't trust the code we publish and we do that's what's necessary to make that happen. So there's some awareness raising. So we're discussing blog posts and all kinds of other ways to get more people involved in this. Who are we? Brenno, Graham, Inge, Jose, Andreas, Leon, Olaf, sitting there, Pete, René, Sam, Salvis, Mi, Stig, sitting there. Tim, Merein isn't here today. And a whole lot of others. We, these are a couple of the people that were at the Peralta Julesin Summit. I'm there. It's a photo of me where I don't look horrible. That's good. So that's Stig and Inge and Leon and Merein and Brenno. And the reason I say all the names here is to make a point actually. When somebody talks with you about supply chain security, there are people like this and the group picture that are actually working on the supply chain, the bits and pieces that make that up. On a volunteer basis. And meaning humans. It's not like a black box where suddenly stuff appears. We have to actually think about this as almost like our open source colleagues. We work together with these people. So what I want to hear is to ask you to join us. Do you care about open source security? Do you have some extra tweets, some time to spend over? Do you have a manager that is aware of that there's a security commons out there that is shared and that needs to be updated and kept alive and kept healthy? And you all would like to fix security yourself. Please contact us. We need help. We are a bunch of volunteers right now, but we do not have all the time needed. And at the moment we don't have the funding either. So there's that. So to find us, we are on our seal and there's a link there. You can find all the necessary on security.metasep.org. You could also use the security.zip.org and the mailing list where we coordinate stuff is the zip and security. It's closed off, but with a little bit of dancing and singing, you can get into there. So I don't know. We probably don't have time for questions and comments. Two minutes. Two questions. Yes. Three, very short remarks. First, I'd love to see a module creating a sports, lively. Yes. Working on that. Okay. If you want to help, talk with me. Second, I'd like to have 502 support for a stick support for any of the big frameworks we have in the fall. More delicious or dancer tour. We won't do anything on that, but if you want to publish something, go ahead. I've been looking into that a little bit. Okay. Who in this room has a Vince account? I have one. I like it very much, but please make yourself. Vince is a vulnerability sharing system that search.org runs. We have a couple of us on the have it already. So if you are scared about security enough to have an account there, you're welcome to join us. That's a very good criteria. But of course, please actually help. We have a lot of people that are having bystanders looking at. There's something called the bystander effect where lots of people look at an accident and waiting for someone else to make the first move. That is, we cannot have that. We need people to actually want to make sure it happens. Having a Vince account, maybe not enough, you have to publish yourself and say to them, hey, we use the problem. Yes. There's a whole lot of stuff. More questions? One question. Well, you get the difference from everyone, but for me, it's that we need more people who are actively working at the moment. But you have a whole lot of stuff, which is all of them are good things. I should try to paint a picture of today. And if something tickles your brain, then you're quite welcome to join us and make something happen there. If you know something we don't, then please tell us. We're in the process of learning. I'm getting an idea that this is the end, so I will say thank you. And I hope this was useful for you. And please get in touch if you care about security on CPAP.
Corinna—Perl's new object-oriented system
Ah, good. So if you're on YouTube, you probably just missed the first five minutes of this. I said nothing. Don't worry about it. So I decided rather than do what I had done previously, I'm just going to give an overview of all the major features of Karina for the minimum viable product that we're putting together. So it can be, you can have a fairly complete idea in your mind what's going to happen, because I actually haven't done that talk before, and you probably don't want to go and read a multi-section RFC and all the work we did to put that together. So since PURL 5, object-oriented syntax here was just less in IZA. There's a little bit more than that, but this is primarily the bulk of it. The model was mostly stolen from Python, and I also do Python programming. I can see the similarities. Larry regrets stealing it from Python. I can understand why, even though I like Python, I'm wrong. But blessing is that all they do is say, we have methods, and where are those methods? I'm taking the short version of this, because we're not going to spend a lot of time talking about the original version of object-oriented programming at PURL. Because it didn't give you much. Basically if you wanted everything that you want out of a class-based OO system, then you've got to write your own constructors. You've got destroy method in PURL, but destruction is non-deterministic, so that's kind of a frustration. It doesn't work as well as you'd like. If you want to maintain state, if you want encapsulation, all the sorts of things that you expect to have out of an out-of-the-box OO system you don't have with bless and IZA. And everyone had to redo it themselves every single time, and if you're a programmer, you know you don't want to do that. You want to abstract that away. So people have abstracted that away a lot. It's going to depend upon your definition of what a class is or support for a class is. Well over 80 modules. This is not an exhaustive list. I just decided to order them alphabetically by link. Have fun picking out the one that you happen to like. If you're familiar with the Lisp Curse, or if you're not familiar with it, go out here, your favorite search engine for the Lisp Curse. It will be the top hit, and it will explain how that mess came about and what we're trying to fix. So let me make that a bit larger because I can't read that. Okay, so not everything that you see here is implemented, and not all of it's going to be implemented, but you do want to see object pad that Paul Evans put together. That's a test bed for many of the ideas of Karina. So we can make sure that it actually does what we want it to do. And there are companies who are using this in production. It is so valuable to them. So some of the things you might see will change. It's work in progress, but I think I've tried to strip out anything really problematic. I'll call out the things which are what you're saying is work in progress, but this is pretty close to what we can expect. A simple class. It's very simple. It's not exciting. You create a new person. Name is Ovid. You print the name Ovid. Here you give them a title. You print the name. It automatically pre-pens it with the title. So there's Dr. McCoy. Very simple. This is not complex. On the left-hand side, that's how you're going to do that using Bless in Old Style Pearl. Here's how you do this in Karina. Note that almost all of this is very declarative in nature. You might quibble on one point. We'll come back to that later. But it's very short, very concise. You probably didn't notice this. That will mean your code's not going to work correctly because you misspelled the name. It's not even going to give you a warning. It's just going to silently fail. Sort of bugs we love to have, silent failures in code. In Karina, because that's a lexical variable field title, that's going to be a compile time error if you misspell it. That's Moose, by the way. Moose didn't gain us a lot. Not true. It does have Izzah. Izzah string for those various things. You could do non-empty string for one of them might be better. We argue about that all day long. But basically, Moose is not more terse. And it also has a lot of startup overhead. It's not slow per se anymore, but it's not the fastest thing in the world. But it does make writing an OO code better. In Karina, same thing, much more terse with the exception of the Izzah. So let's just walk through this so you can understand what's going on. To declare a class, you just say class, person. It used to be to declare a class, you couldn't. You would say package, person. And then you would bless a reference into that package. And it wasn't really a class or package. It was kind of this thing. Now they can be separate. They have a future where we can truly disambiguate these things. I might add, you can also do it this way with the post-fix syntax. I prefer this syntax. I will have it on the slides. I argued strongly, as the lead designer, I thought I could get away with this, that we're going to require the post-fix syntax. I lost that fight. And so everyone basically almost everyone disagreed with me. So I went ahead and said, OK, we'll go ahead and make this optional. But a lot of my examples, well, the post-fix syntax, absolutely not required. So don't stress about it, because I know people gave me grief at first a lot. Field, dollar, name, colon, param. That is an instance attribute, or instance field, instance slot, depending upon the language you're coming from. That's just a piece of data tied to the instance after you construct it. Because it has colon, param, it is required in the constructor. You cannot not pass that, or else it will blow up. Same thing with field, dollar, title, except it has the equals on death. That means it is optional in the constructor. You do not need to pass it in. Or you can use equals misses or something. You can give it a fake default title if you want to. Anything after the equals, you can just evaluate and run it, and that will be assigned as a default value. And then we have our name method down here, where we just access those variables directly. This gives us a chance for a lot of performance benefits. It also tremendously encapsulates this data, something which has been traditionally very, very hard to do with older Perl, because you could always reach inside the object and do stuff. Many languages make it easy to reach inside the object and do stuff. When we eventually get around to implementing a meta object protocol, you will be able to reach inside the object and do stuff. But we're going to make it harder. And the intent is you will be allowed to do it, but when you're doing things you shouldn't do, you got to put some more effort in there. It's going to be easier to show up on code reviews or just with grep. Karina, out of the box provides constructors, destructor, state, composition, encapsulation, private methods, and so on. The private stuff might actually not make it in the MVP. We won't cover that. But basically, most of what you want out of a class-based OO system is there in a very short declarative syntax. Just like that, very easy. But there's more than one way to do it. So I mentioned this is mostly declarative. You see the method down there and you're going, I don't have any way I can change the name and title. Everything by default is pretty much immutable externally with Karina. So I'm not mutating that. So why am I even calculating it every time? I could just make that a field. Reader equals if defined title, title name, else name. And that's computed once and only once at object construction time. And fields are generally evaluated in the order that they are declared, which makes it much easier to reason. In Moose, I think it's evaluated alphabetically. No, hash order. Hash order. Oh, sweet. Thank you, Steven, for just making me feel even worse about it. But I've long wanted to submit a patch to see if I could fix that, but they've said no more patches. Which is fine, I totally get why. So because they're constructed in the order that they're found, you can now have the potential for deterministic destruction because you can track that order and unwind them in last in, first out order. I don't know that that will be in the MVP either. Okay, there's only four keywords. By the way, class, field, method, and role. We actually had a lot more originally and then Damian Conway came along and did a deep dive into the spec. And he pointed out a way we could reorganize everything just by having four keywords, class, field, method, and role. And then attributes to modify their behavior. Tremendously simplified the code, made the logic much easier to follow, made the structure much easier to follow. And now I apologize, this is a much bigger slide, probably harder for some of you in the back to read. Class character is a person, that means we've inherited from person. Karina is single inheritance only. You'll notice there's a number of OO languages out there which allow no inheritance. Some of them allow only single inheritance, they almost invariably give you a way to work around that, such as interfaces or mix-ins or something else. Or you can do that with delegation, which delegation is much more powerful than people think, but there's not a talk about that. So I've now declared this class. And you'll notice I have an underscore defense for my reader. I don't have readers or writers for anything. Reader means that you can call target arrow underscore defense and read that value. There's something called trusted methods where you want methods to be callable by other classes, but you don't want people outside to be able to call them. We have done a lot of bike shedding on how to get there, and it's not gonna happen anytime soon. So for now, I punted and thought this is a reasonable compromise. We use a familiar pearl syntax for saying underscore defense. That is, think of it as a trusted method or a private method. And as a result, you can call that and people outside know not to. Notice the only methods we have public are isDead, adjust hit points, and attack, because you want your interfaces to be as small as possible. Because later on, if you have to change your interfaces, you're stuck if you've exposed everything publicly. So, Karina by default forces you to add the colon reader and colon writer keywords to fields because you have to choose, you have to opt in to making your contract public. Rather than with moose and moo and others, the default is everything's public. And if you want a private, too bad. And we have this constrain function. I'll talk more about subroutines being imported. But basically constrain is a function. Again, this is something I don't think we're gonna get to in the MVP. The intent is methods and subroutines are not the same thing. And you should not be able to call a subroutine as a method. You should not be able to call a method as a subroutine. And you can disambiguate them even if they have the same name. But just something to think about for later work. So, we did our subclassing, there's a little Dorothy there. And we create a new dothvader object, a captain Kirk object. And while not Kirk is dead, Vader beats him with his lightsaber until Kirk is dead. It's just very simple, it's easy. It works, yes, Vader will kill Kirk. I'm sorry, I do for Star Trek to Star Trek to Star Wars. But in this case, yeah, Vader, yeah, he wins. Very simple, very easy, and there's nothing when you get down to it, there's nothing really complicated about the code. It's simpler, it's easier to write, it's well encapsulated. But I want to talk about constructors a little bit so you understand some of the design work that we put in here. A lot of it we argued, I think it took like three years of arguing to finally get to something we could agree on. So, we have key value pairs, named arguments to the constructor, name, title, and offense. And it is absolutely required that you do that. You can create an alternate constructor if you want, called new unnamed and have a delegate off, but we do this for readability. And there's also some other benefits. So right now, here's a constructor in Java, character of Vader equals new character. And then if you didn't know what those were, it might not be clear what you're constructing. And in fact, you've got alternate, you've got optional data for your constructors. So you have to create multiple constructors. I won't go into details, but you might have to create multiple, multiple constructors. If we have in this particular example, or use a hash map and extract it manually. It's a pain. Karina, you don't have to do that. You have a declarative specification at the top of your code. Here's how our instance data works. So, writing the manual constructor in Java for a car, that's actually very readable. It's very easy to read. Calling it is not. I don't, I just looked at the code. I wrote this code and I don't remember it. I don't know what those numbers necessarily mean. So, that's why we try to avoid that. And in Perl, we have named arguments. Yes, you have to do a little bit more typing. This is for maintenance. You absolutely want to make it easier to maintain your code. And it's gonna kill you a few times. And you're not gonna be happy about this, but you'll get used to it because it's gonna become natural, I hope. So here, that's not character class. That's a person class. And we've passed in offense. Offense is not defined as one of your param fields. So that's gonna die. And I've heard people argue, well, I should be able to pass in extra data. Maybe my son class will use it, or there's some other way I can handle it. Yes, there is other way you can handle it, like every other authoritarian language does. Provide something which is actually going to properly capture that. But the real reason is, remember, title is optional. So if I misspelled title, it would think it's simply optional data. Now, because it's mandatory, you can't pass in anything which is not known to the constructor, then that is going to be a fatal error. And it's gonna be a very hard to detect bug that you don't have to worry about anymore. If you want to pass in extra optional data, make a parameter called extra. Extra column param equals hash ref. And then just allow them to muddle around with that. It's much cleaner. Moose allows you to pass in a hash ref instead of a list. We do not do that. We want one way of being able to call the object because it's just simple. This also preserves the ordering of those in case that becomes necessary in the future. Also, a hash ref will, any duplicate name in the hash ref will collapse over a previous one, which is kind of annoying. There are ways you can work around that if you actually want this behavior for setting defaults. But we decided this was the safest way to go, just to make one and only one way of calling the constructor. Thank you. So, I didn't talk fast enough, apparently. Here, field name, dollar name, dollar name, in both of those, those are lexically scoped. There is no conflict anymore. And with bless, if you had a arrow, open print name in your hash ref, but your parent did too, you're going to blow up. Here, it's completely encapsulated until you expose it. Now when you expose it, I have column parameter each, and I now have two param methods, and that's going to blow up. You can't override params. We might restrict that later. You can override methods. Sorry, methods automatically generated by param or, sorry, field and other things. I got ahead of myself. Never mind. So I can do this param car name. That means now you pass that to the instructor's car name, and there's no longer a conflict with parent class. Your parent and children classes should always be able to trust their internal implementation, always. So when they hit an external implementation, they're making a contract, and then they've got to negotiate and find out what works. Here's another example. Those are also going to blow up. That's the case where we're actually generating methods, but we cannot override those directly. You can create your own little stub method if you want to override it. Again, you can rename those in order to allow that to be safe. Class data, field, num characters, colon common means this is class data. You can also slap colon comma on a method and call that a class method. Adjust is called after the object is constructed, or actually it's called when it's hit, sorry, Paul. Is it called when it's hit or after the object's constructed? It's called when it's hit, right? Adjust was run as part of the constructor, yeah. Okay, destruct will run when the object is destroyed. So here I can track how many character instances I've created. It's very simple, works naturally in the language. And then I have another class, my world class. I can figure out the difficulty of my world. I've got my class method available. I can figure out how many characters and I can tell them how difficult the world is. Again, it's stuff which is now built into the language and you don't have to worry about that anymore. Is there anyone here who does not know what roles are? Okay, just in case roles are kind of like mixins you'd find in Ruby or interfaces with default implementations you'd find with other languages. And these allow you to take a small subset of behavior which doesn't necessarily belong to a class, a specific class, and move it often to its own role. And then you can compose it into the class. And then you will get that behavior. However, those methods are flattened into the class directly. There's no tricks with inheritance, there's no funky dispatch or anything like that, it's actually in the class. So method as hash ref, because this is what we call a forward declaration, because it doesn't have a body for the method. Anything with a forward declaration is required to be implemented by whatever is calling it. It can be implemented by the class directly or if the class consumes other roles as other roles might implement it. And then to JSON, here's another example where we want to get to the point where we can disambiguate. This is probably a terrible example because you don't wanna confuse those. But the reality is you should be able to call those separately and have them work correctly, even though you probably shouldn't name them the same. But it gets you some safety in the code and avoids the odd case where you called subroutine as a method, and believe me, I've hit that before. And self is injected directly into the method. You don't have to declare it in your signature. If you have a common method, so self, you also get a dollar class variable, which is the class name of the invocant. If you have a dollar common attribute, that means it's a shared method, which means self will not be available, but dollar class will. And again, those will fail at compile time if you get them spelled wrong. Which means if you declare something as a class method with a colon common and you're trying to access dollar self in there, that should be a compile time failure. You don't wanna use this code, but here, field dollar cash, once again, my implementation should be able to trust its internals. So nothing else actually gets to see the dollar cash that I have declared in my role. You don't wanna use this because this would work if you can guarantee your objects are immutable, but you can't. So you actually probably don't wanna cash those. But this is one way you can have of accessing data inside the role, which you don't share with others. And then using a role, it's pretty simple. So there's my serializable role, this one just does JSON. My character is a person, does serializable. All I have to do is define a hash ref method. And hopefully, when it's called up there, it will properly serialize into JSON, depending upon. I did a lot of hand waving there. But that's basically how it works. If you're familiar with roles, it's what you expect out of roles. So here's the various attributes we have. Class attributes. We have is a and does. Is a, again, is single inheritance. You can put one class in there. Okay, great, I've got plenty of time. Does, however, can have a comma separated list of roles that are allowed in there. If you're familiar with roles, there's ways you can exclude or alias methods. We don't actually provide that syntax here because we argued too much about how to make that work, and we just punted on that. I apologize. Well, attributes, it simply does. Roll serializable does some other role, whatever. Maybe it does a YAML role, an JSON role, and a TAML role, and can serialize all those different things if it's given the right data structure. Quite possibly cannot, but that's how roles work. Roles can consume other roles. And we do want to make sure we preserve the commutative and associative behavior so you can mix and match roles any way you want to in any order. In any combination, and it should work correctly unlike with inheritance and mixins where if you shuffle the order, you have no guarantee your code's gonna work anymore. Field attributes, this one's a little bit more. Reader, or you can rename your reader. Writer, automatically propends the name with set underscore, because we're disambiguating between the reading and the writing. And there's reasons for that dealing with return types and not being able to overload things properly. And also wanting to discourage people from writing mutable objects, but making it easy for them to do if they wish to. But it's available there. Param, whether or not it's available in the constructor. Week, to create a weak reference. Column common means it's a class data. Method attributes, do we override a parent method? If you want a method to be abstract in your parent class, just again, just declare it as method, method name, do not use a signature. And do not provide a method body, it's automatically an abstract class. And it must be overridden in a child class or with luck it will be a compile time error. Common, so you can have a class method which does not inject the dollar self variable. Around before and after are the standard method modifiers that you have. To be honest, I wish we had gone with something like, sorry folks, Python decorators because it's so much easier to use. But that would require attributes to be modified and how they actually get handled. Because right now the data inside of the arguments to an attribute is just a simple string, can't be parsed effectively or can't be run effectively. There's some discussion, I think Paul has been handling some of that, about how to maybe change that in the future. Some of the things we have already written in just the very beginnings of Karina. We have Stella, an actor model for Pearl. An actor model basically means if you have a box of toys, they know how to play with each other, you don't have to play with them yourself. That's the simple explanation. What's that? Okay, thank you. I'm very curious to see that. We also have a yellow cooperative message passing concurrency event loops, actors, promises. That one looks like a lot of fun. That's also done by Steven. You don't like that? Okay, these are some of the early prototypes we've been building with this. I used Karina a lot. This is a rogue-like tutorial that Chris Prather has been putting together. You've seen Rogue before, most of you. And I elated some of those, but basically parts one through six. He hasn't done more than that. What amazed me is I thought we would have to have much more of Karina built for it to actually be useful. I was wrong. Even a very tiny subset, properly designed subset of a class-based system works very well and is very powerful. I was really surprised by that. It also might force you to use composition and delegation more often, which trust me, that's your friend. I won't go into it right now. And I'm sorry, that was very fast. It was an overview. It was probably one of my least exciting talks, but I wanted to be able to have something that I can refer people to this and say, look, here's a short overview. If you want to have a video instead of reading the RFC or something like that. The actual RFC is at github.com, Perlapallo, Karina, BlavMessor. I'll put this up a slideshare. There's the seven stages which are referred to in that MVP of what we're trying to implement, unknown timeline as to when it's going to be done. It's already much more powerful than I thought. Really surprised by that. There's lots more to be done. If you want to see this, the single best thing I think you can do is download it, compile it, start playing around with it, send bug reports to Paul, give feedback, write tests for it, write documentation for it. We need that because conceptually it's very small, but under the hood, there's a lot of stuff which has to happen to make that done. And anything you could do to help Paul take some of that work off of him means we will get it out there faster. Does anyone have any questions? No, yes, sorry. Please speak up by the way, I'm a bit hard of hearing. Yeah, you mentioned the overrides as a way of following my pessimism. What happens if you have a base method and a derived class method with the same name without the overrides attribute? Right now I think that should be a, if the method is defined in the, sorry, what happens if in a subclass you're overriding a class which already has that method defined but doesn't, but has a body, so you're overriding something which already exists. That's something I, one thing a parent class generally should not know who or what is subclassing it. It shouldn't have to know that if that is at all possible, because that winds up coupling it too tight with the subclass. And as a result, if we try to put any sort of annotation on the parent class saying this is sub, subclassable, we might want to be able to allow a final attribute on something so you can't overwrite it, but we had to get an MVP out there. So right now it's a method body's defined. If you overwrite it in a subclass, adding the override tag is good. And I would like it to have a warning if you override something and you don't have the override tag. Or if it's an abstract method and you don't overwrite it, then it's fatal. Or maybe if you override, you don't have the override attribute, then it should be fatal, but we can punt on that. Any other questions? Can the rules have a method body? I'm sorry? Can the rules have a method body? If it's a required method in the role, it cannot have a method body. There are ways you could work around that. You could create a default method, which has a separate name from the required method. And inside of your methods, it's going to, no, you'd still have to have the other method required. So it's a yada, yada, yada, operator. I found a very nice. Oh, I forgot about that. So basically you make a method and then you just, the body of method is dot, dot, dot, which is the yada, yada, yada operator, which was added, I don't know when. 5, 5. 5, 5. So it's been around forever. And all it does is it just blows up different times. It's died with no messages. But it's very useful for, yeah. Yeah, that might work. Any other questions? Or do we still have time? Two minutes. Not you. You were exporting stuff, or exporting subroutines. Lexical, exportated. I've been using it and it's been working quite well with it, Corinna. And it doesn't seem to conflict. Oh. Lexically exporting subroutines. And then it removes the symbol. Yeah. So it's bound but not callable. Yeah, in the built-in package, there's an export Lexically, right? And then you put that inside your import, you can export things flexibly. And then they're entirely different scope. Nice. OK. I very much like that. I'll show you. OK. Actually, talk to Paul, because he's the one who's going to be doing some. We'll talk 20 minutes and I'll talk about it. What's that? Wait 20 minutes and I'll be talking. OK. One last question. OK. Thank you very much. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you.
Updates from the PSC
All right, last session. Paul Evans, working hard behind the scenes to make sure we have a soul through 5, 6, 7. What is going to be at 3? Don't you change it? Let's call it, I don't know, 100. Who knows? All right. Man, I'll be dead by then. All right. Hello, welcome. Hello. So this is updates from the Pearl Steering Council. So a bit of history first. We've had some yearly releases of Pearl for a very long time now. 532, that was out in 2020, middle of the summer. And then kind of, you know, every year or so, kind of like Clockwork. We've had new releases every year. This is a thing. People maybe don't realize this. Some recent changes we've had. So in 532, we added this Easey operator. That was kind of cool. 534, we added this Try Catch syntax. These are some new things we've had. 536 was a lot of new stuff. So we added loads of things here. Brief list here. First big headline thing, Stabilized Signatures. So finally, that nice little signature syntax there. That's now stable part of the language. You can just use that. You don't have to fiddle with the at underscore array anymore. It's very, very nice. We added this multi-variable for each mechanism here. Come in, come in. So if you want to iterate over multiple variables at once out of an array, for example, you can just pull multiple of them and it works. It's especially nice for iterating on hashes. So you have a hash here. You get each key and each value inside the body of the free-trilute. It's wonderful. I love it. What else have we got? We've got defer blocks. So you use feature defer. And now you've got this defer thing here. So you can put a piece of code. If you're familiar with Go, this is not like the Go ones. If you're familiar with any other language that has defer, it's exactly like that. In Go, they decided that defer blocks would always push onto a stack and then at the end of the function, it would run the block. Whereas every other language said, no, that's kind of crazy. We'll just do it lexically scoped. So you have a defer and then you get to the end of the block and it runs it. Every other language does it that way. Even C, some people are discussing adding defer to C. Because if you don't have this crazy array for the function, you can do it mostly statically in the compiler. And it's just kind of shorthand for putting it in the compiler. And every other language does it this way. Don't know why Go does it its own weird way. It's a bit weird. Anyway, so we have defer blocks. And you can put finally blocks on try catch as well. It's basically the same as a defer. But people seem to expect that if you can do try catch, you can do try catch finally. OK, we added it fine, whatever. Another thing we added in 536 is this built in namespace. So for years and years and years, if people wanted things like we can and blessed and ref adder and so on, they'd have to get them out of scale util, which is another module you'd load off the file system. It's a bit annoying. These are now built in to the language. So you don't have to use anything. It's just right there. It's already you can always use it. But if you want, you can you can import it as well. So for example, we have this nice index function that plays very nicely with the multi variable for each, for example. So this indexed, you give it a list of things. It gives you back a list that's twice as big where the first value is prefixed with zero. The second one is prefixed with one, someone and so forth. So if you're iterating a list out of an array at every element, you can see the index of that item out of the array. It's really, really nice. And it's built into the language. It's this here is really just telling the parser for this scope here. I want to have this indexed word available, but it built into the interpreters always available. And who is it? People were talking about lexical imports earlier. These built in is lexical. So basically what that means is you've just written some code here. You can just see it, but it's not putting that word indexed into your package. So if you're writing an object class that you don't get the word indexed visible as a method. It's not visible from outside. It's only visible within this scope. Really nice, really handy, excellent way of working. So these built-ins are very nice. Alongside the built-ins, we finally, finally have actual Boolean values. C originally didn't have Booleans either. And then eventually in C99 they realized whoops, we should have Booleans. It's taken us until 536 to realize we should have Booleans, but we now have them. So we've got this built-in true and false. Look at that, look at that, my T equals true. Guess what that does? There won't be a prize. It's not that subtle. But specifically we have this isBool test. So you can ask, here's a value, is it Boolean or not? So one and the string one, well they're not Booleans, but this real true, well it really is a Boolean. That's kind of handy. It's particularly handy because things like data dumper knows about it. So if we print this array here, got 2 plus 2, 2 concatenated with 2 and 2 is equal to 2. Well that gives you the fairly obviously expecting 4 and the string 22. But it also gives you this pling-pling-1. That's not very nice, but the reason for that is because everyone uses data dumper wrong. Data dumper, it's one goal is to output valid Perl code. It doesn't know that it's trying to output debugging values for humans to read. It's sole purpose is to output valid Perl code. And it doesn't know that you might not be running this on an older Perl, you might not be loading it back in to an older Perl that doesn't know what the true keyword is for example. So it's going to print pling-pling-1 because it has no other choice. This is really more a comment of please stop using data dumper for human debugging. What you want to use is something like data printer. Data printer is specifically designed for outputting pretty things to humans. And this slide doesn't show it, but that comes out in color. It colors the strings and the numbers and the keywords and the surrounding shapes all subtly different. And it looks really nice on the screen and it's lovely and data printer is so much nicer. If you're debugging stuff for humans, use data printer. So thank you to Breno for implementing. It's true, it's so good. That's not all. And the JSON-PP also knows about, yeah, there we go, JSON-PP. You encode this very same array, you get 4 string 22 and true. The JSON-XS version, Remy is still looking at it. Last time I looked it was about three, four days ago. It's not been merged yet, but he's working on it. Hopefully that'll come soon. The YAML modules is Tina around? Tina was around earlier. Yes, hello Tina, thank you. This slide is for you Tina, look at that. Four, the string 22 and true. And JSON-PP as well, they all do it. So thank you for Tina, for doing that one. So yeah, real Booleans, use them, use them, they're nice. Moving on to 538, the newest one that's currently around. Somebody wrote this class thing, I don't know if you've heard of it. Have you heard of it, Ovid? I don't know. Ovid obviously talked quite a lot about this class system earlier. So I'll just go through and brief. Here's a small example of, here's a small piece of code that you can write to implement like an object class. You can create here, we have these points, and they have some values. Yeah, they're great, you can have another point. What kind of behaves in the obvious way you'd expect from looking at the code? There's several things about this that I kind of want to point out as again, kind of covering similar stuff to Ovid earlier. There's a lot of low level stuff that this thing just does for you. So you don't have to write sub-new anymore. You don't have to write a bless, sorry, wait for that, noise outside to finish. You don't have to write a bless expression anymore. You don't have to call these accesses to get at your instance fields. They're just accessible directly as lexical variables. They're nicely there straight away and you can just use them. Specifically thinking of Java programmers, Python programmers in particular, this slide is for you. You write a class, you declare that it has some fields, x and y, here are the default values, and that's it. Nowhere did I have to unpack self.x equals x or args.x or whatever and work out did they pass a value in, take the argument otherwise take a default value. No, you don't have to do any of that. Here's a method. I've just straight away got access to the local fields and I've got the self and notice that I didn't have to put dollar self in the signature here. So I didn't have to put dollar self. No, I didn't even have to shift self in old school style. I was writing some Python class code lately and I kept forgetting to put def method open, self comma. Why would I put the self in the arguments, in the parameters to the method when I don't put the self in there when I'm calling the thing? As soon as you start getting used to using method, you forget about taking self as an argument. It goes out of your head. It's again nice and neat and lovely and it just takes things away that you don't have to think about. More things you don't have to think about anymore. So as I said signatures, we added these in 536. So here's an example of a signature subroutine and here we're taking a parameter. Are we taking this optional parameter? This one's fine. I'm shaking all over the place. This y here. If you don't pass in a value for that y, you get this default. So here we have an x is 20 and a y, well, you just take the default of 10. That's all very well, but the way these work inside, if you specifically pass in an undef, well, you've passed in a value, right? But that's probably not what the author of this code really intended. So it kind of breaks a bit if you pass in an undef. It gets a bit worse if you're just passing in variables because now, well, you'd have to check carefully is $a1 defined or not. And if it's not defined, then I'll just not pass it in. And it's messy to write some code like that. So new in 538, you can now use the defined or a sign operator to declare your signature parameter. So you don't pass, it sort of internally behaves much like this, where you look at is the value defined or not rather than just did it exist or not. So as you'd expect, you pass in one value and you just get the default. If you specifically pass in an undef, Pearl goes, ah, you've passed in an undef, that's the same as if you haven't passed it in at all, I will take the default, which means that passing in two arguments is a lot neater. I have another talk where I go into more detail about specifically what's in 538 and I actually point out if you were to have, say, five parameters to your function and four of them were optional ones, you literally couldn't do it without this operator because you can't literally not pass the middle parameters and still pass the last one. Whereas you have to pass in an undef. And so suddenly with this operator, you can have those kind of middle ones missing and still put in a value at the end. So it makes that kind of thing possible that you literally couldn't do before. So pretty much any time you're using default values in a parameter, to be honest, you probably wanted this defined or because specifically passing in undef is almost never a thing you want to distinguish from just not having a thing at all. So that's quite nice. And these two things combine together quite nicely. So for example, when you have a class and you have some default values on parameters, you can of course just use the defined or operator there. So once again, it means that things like this, where you're constructing an object by just passing in whatever values you have in variables, if those happened to be undef, it wouldn't matter. It would say, okay, I'll just apply this default of zero. So that's all very nice and handy. Other new things in 538, we have these plugable infix operators. So for event and time in part, we've had plugable keywords, which is how a lot of the weirder syntax modules like syntax keyword try and future async weight and object pad, those will work with this keyword mechanism where they can tell the parser, I want to implement a whole new keyword, give me control for a bit. So new in 538, we've added more support for doing similar kind of stuff like that with infix operators. So that means that we can have even more cpan modules to experiment with things that might become new syntax in Perl at some point. And we've got a few things to play around with those. So people want like an EQ operator. People always ask for an E in operator and I've explained in great detail why that's not as easy as it sounds, but there's a few examples there. And there's things like a zip and a mesh and there's a few other modules there. But for example, this one in particular EQU is a nice behavior that at some point we might add into Corp whole. So behaves very similar to the normal string EQ operator, but it knows that undef is different from the empty string. So this is really cool. And here it's literally this new infix operator. So you use it very much like EQ in that two strings, two strings that are either the same or different, it tells you about those, but it knows that undef is equal to undef. It knows that undef is not equal to the empty string. And in none of these cases, it will print a warning. So that's quite often the sort of thing that you want. It's slightly nicer than using EQU and defined tests all the time. And this is exactly this kind of experiment that's really useful to be able to test it on CPAN first and say, hey, do we like it? We'll use it in a few places, go along, maybe decide eventually. Yes, we'll put that into the language, maybe not. Kind of depends. So one thing you might have noticed from pretty much all of these examples so far is that every single one of them starts with this use, the 536 at the top, all these other ones, or use V538. There's a reason for that. The use version mechanism. It allows you to configure effectively the language from the very first line of your source code. So rather than you just deciding this is the version of Pearl I want to use today, your file says I want to be 536 or 538 or whatever. And it's a thing we've always had in Pearl, but people haven't necessarily used it as much as they should. And I keep trying to point out how good and how useful it is and why you should do it all the time. Because, for example, it implies a feature bundle. So you say, for example, use V536, you get all of the features that were enabled in 536. So rather than having to ask for all of these things individually, if you just write use V536, you get all of this good stuff, like say and signatures and maybe some of the other ones are good as well, but those two by far are the ones that I just tend to use all the time. Everything is just say and sub-routine signatures. So those are all very nice. But it gets better. It's very similar to things like when you compile some C code, you tell the C compiler which version of C I want to be using here. So it means that just because you've installed a new version of GCC, if you don't tell GCC that I'm compiling C99 code, well, you can still compile C89 code or whatever. Just because you've updated your compiler, you can still compile old programs. It's even better than that because it's not just applying to a file, it applies anywhere. You can just put a use version inside a block. And you can say, inside this block, I want to behave as if it was 536. But I'm not going to put use V536 for the whole file just because I still happen to have some older code here. For example, this thing using prototypes. I didn't want terminal signatures here. So rather than going to fix up my entire code base all in one go to work on 536, I'll just do a small bit here today and then maybe tomorrow I'll do a bit more. And so I'm like an incrementally update to using the new stuff. It gets even better than that. So not only does it imply a feature bundle, but ever since 512, it turns on strict. So any time you write use strict at the top of your file, you always do that, right? You can instead just put use V512 and you've already got strict. Oh, and you've got all the features. Oh, but new in 536, we added warnings. So if at the top of your file you would write use strict, use warnings, by the way, you should always do that. We don't have to. You just can put use V536. And now you've got strict and warnings and all of those features. So it's really, really nice. It gives you your choice of the latest features. It means that we can maintain that compatibility of the language. We can add new stuff in Perl. So like you noticed 536 added a lot of those keywords like try and defer and so on. If you don't write use V536, you don't get those. But that's fine. It means that if in any of your code you had something called try or defer, well, we haven't broken that. We can add new stuff in Perl without breaking your code. All you have to do is put use V536 or use V540. What that means is, yeah, we can update Perl without breaking your code. That means you can update your Perl binary without breaking any code. Hands up if you've ever installed a new Perl and something has broken. Interesting. That means we failed. That means we failed. If you install a new Perl and something works. A few years ago. Yeah, yeah. Like really early ones, sometimes they didn't go so well. But more recently, I mean, you know, so for example, I think it was about last month or so, I updated a bunch of stuff on my email box and all of my email scripting stopped working. And I looked into it and I discovered actually Proc Mail has, there's a little bug in Proc Mail now, it's a bug, something in Proc Mail had changed that meant that a piece of Perl code I wrote over 15 years ago is not being invoked properly. And so all of this thing stopped working. But the script that I wrote 15 years ago for handling all my email works perfectly fine to this day. Like I haven't bothered touching it. I'd almost forgotten that I wrote it. It just works. And it's all because of this use version mechanism. And so when people say, oh, why do I have to put use version or use feature or whatever to turn on new stuff? This is exactly why. It means we can update Perl and you can update Perl and not break your stuff. But it means you have to ask for new things. Speaking of asking for new things, I've been mentioning a lot of these things are quite experimental. So some terms here. So stable stuff means it's long-term guaranteed. What that means is if we put something in the language and we say it's stable, that means in a decade's time, in two decades, like all of the stuff we're talking about now has been stable for like the last 20 years. And it's all of the stuff that if you update your Perl, you don't have to think about because it's all the stuff that's there and stable and working. Experimental simply means a lack of that guarantee. All the experimental means is we don't guarantee that this will still work in 20 years' time. But it's no worse than random stuff I downloaded anyway. Like if you install stuff off GitHub or C-Pan or other languages, things like NPM or PIPI or whatever, if you just download it and the author says, oh, actually next week I've changed my mind, it's going to work something else, that's only the same level of guarantee here. So don't be afraid of experimental. We're not saying, oh, it's crazy, it might break and blow up your code. That's not what we're saying. What we're saying is if you use it now, we don't guarantee it'll still be around next year. But maybe it will. It's not about does it work. We know it works. We have lots of tests. Things don't get merged at all unless they actually work. So things like the object system and try catch and all of this lot, it works. We know it works. People use it in production. The question is do we like it? It means you. Do you like it? If people come back and say, yes, we like this, this is great, then wonderful, we'll take the experimental tag off. Nobody comes back and says, hey, we've used this, we like this. How do we know whether we should commit on it? There's things, literally this week that we've been staring at to do with lexical subs, that if more people had been using them over the last eight or nine years since they were made non-experimental, we might have encountered sooner and said, actually, yeah, that's a bit of a design flaw. Whoops, that's a shame. But hardy knows when to be using them. So we didn't know. So now it's a little bit late to change them. So this is a request. This is the one takeaway from this talk. If you learn nothing else to learn this, please use experimental features. Not necessarily in your production, I still want this to run in a decade code. But if you're writing some small little test thing that maybe is only going to last for today or a week or whatever, or you're just grabbing some data and mangling it and fiddling around with it on your laptop, and you're going to throw away the script after lunch anyway, please play around with these experimental features. We're not saying they don't work. What we're saying is they might not exist next year. But if you're writing some code that doesn't exist next year, who cares? So please try them out. So with that said, what are the current experiments? Well, we've got try catch. That's still a bit experimental because ideally I would like when you catch an exception, you get more information out of it than just the string of what the exception was. So we might expand a bit on that. Differ is experimental. There's a few reasons for that. To do with if you throw an exception while you're deferring, while you're unwinding another exception, you've got this kind of double exception collision thing going on. It's a bit weird. Multi-variable for each, that's just because it's new. Some of the built-in functions, they're currently experimental, but they probably don't need to be. Class is obviously very experimental because we're changing a lot of stuff around. That will change and devolve over time. There's one particular experiment that I do want to draw attention to, and that's when we got rid of, when we unexperimented subroutine signatures overall, we did leave in one thing, and that's if you use the default arguments array for some reason inside a signature sub, that does currently print an experimental warning. The reason being it's kind of annoying to implement, and if people stop doing this, then we can get rid of a whole bunch of the implementation and make all of functions faster in Perl. Please stop doing this, and then we can make your Perl faster. Sonia? Any mac-in-the-feature? Any mac-in-the-feature? We could, yeah, it could become a feature. Maybe, maybe. We'll see, it's complicated. Talk to me at lunch. Anyway, so we've only got 10 minutes left. Coming up in 540, new release that we're expecting to be out sometime this summer, most built-in functions should become stable. So things like, at the moment, things like ref type, you get experimental warnings. When you do use 540, you won't get an experimental warning anymore, because hey, fairly simple, fairly stable, seems to be fine. We also are going to get built-in bundles from used versions. You know how I said, use v536 implies all of these things? Well, use v540 will add another one. So that means when you go, use v540, you get all of these built-ins for free, which means you can write, use v540, say ref type, and you just get the thing. And obviously, we're going to put that in with the capital E as well. So you can just do pearl-capital E, say ref type. Look at that, that's lovely. Everyone likes to do ref type in their one-liners. Yeah, I don't know. These ones are all bit, like, it's hard to come up with small examples, but it's nice that they're there. It's nice that you don't have to ask especially for them. You just get them. So yeah, use v540. So I want to talk a bit about the process behind some of these things. So we have this thing, the proposed pearl changes. It's, you speak a lot of C's. It's a formal process where people can request changes in the language. So already we've seen in v536 we had the enter time for, that was run by Nick Clark. We have deferred the Booleans and the name, and, well, no, that says Booleans. That should say built-ins. That's a bug. I wrote those ones. Xenu wrote the command line flag for slurping. It's just a small little bit. Rick wrote the built-in index to one. These are all the people who wrote the documents. These aren't necessarily the people who implemented the code. These are the people who wrote the documents. So part of the whole PPC process is about saying, if you have an idea for pearl, but you don't know how to implement it, well, that doesn't matter. Write us a document to explain the kind of thing you want. And if we accept it and we like it, we'll say, yes, we will work out how to get that implemented. You don't have to implement it. In v538 we got rid of the back tick for the package separators. That was the Nicholas Mendoza who wrote that one. Over here you did the module cruising. And I can't remember who implemented that. I did some of it, but someone else did. Who? Chromatic? Yeah, Chromatic wrote that one. Yeah, you just suddenly surprised us one day. I said, oh, by the way, I've implemented this. Wow, OK, fine. Yeah, so we have the module cruising. That's quite nice. And the lexical exports. Sorry about that. We're going to have to change them. Yeah, chat to me later. We're currently testing. There's only one little thing that we're PPC that we're testing at the moment for 539. That's the load module built in. It's going to be quite nice. It's just a nicer way of doing require where you have a package name in a string. It's just rather than having to do all of the horribleness of turning it into a file name. You just go load module. It's quite nice. There's a few other ones that we're in the middle of implementing. So things like English names for punctuation variables. So rather than doing dollar splat like that, you could just ask for dollar eval error. It's quite nice. Template strings. I'm almost upset you didn't. You had a sprint if in your code earlier, Rick. I mean, come on. So if you would finish the implementation. Yeah, it's hard. Sublexing is hard. So this horrible thing, especially with objects, like if you try to implement it, if you try to invoke an object accessor inside of a quoted string, you'll know you can't do that. And so you're always having to break out of the quoted string and stuff like that. So we've stolen this thing from a few other languages like quote, quote, quote, template strings. So now you can just put expressions in your code. It's lovely. It's nice. It's horrible to implement. If anyone knows how to implement it, let me know because I've had about three attempts. Anyway, other ones that we're in the middle of implementing is optional chaining. So a Python actually a couple of weeks ago said they were considering this thing. They call them the none aware operators where you have, you want to do this method call or a hash lookup or whatever it is. But the thing on the other side might be, well, in Python's case, it's none, but in Paul's case, it's undef. And you want to just return undef instead. So we have this wonderful idea of just put a question mark on the operator name. So that there, if the hash key exists, it'll call name on it. If the hash doesn't exist or if it's undef, this whole expression is just undef. And that's often a thing you want to do as well. It's nice and neat and tidy. I like it. And the metaprogramming API. So all these crazy things that you do with no strict refs and glob refs and all this other stuff that's horrible and messy. We're going to make that much, much nicer with just you get a meta package and you get the symbol out of it, you get the value in it. It's all lovely. It's all inspired by things like package stash. And there's a bunch of other things on cpan, but we want to make this an official part of core pearl so that we can tie it into things like the object system as well. It just makes that much more powerful. A few other little upcoming ideas at some point, but probably not going to be in 540, are I'd like to have named parameters and signatures. It'd be nice to be able to have these named things here. But I want to do more stuff on class. I've not really added anything extra in class for 540. So roles would be nice. The convenience accesses might be nice. It's possible by 540 I'll get around to the easy one like reader, but even something like writer is going to be a little bit awkward. But even just having readers in 540 might be nice. I'll see if I can get around to it. And I've got three minutes left. Yep. And the last thing I want to do at some point is renumber 5.whatever into 7 because I really want to be able to type use v7 and just have it work. And with that, I'm going to say there's the end. There's a link to the slides. There's also a link down here to some slides and the video of my talk that I did what's new in 538, which goes into a lot more detail about the new things we added in 538. And then I will say we will take some short questions, but our minds now the last talking here. So afterwards I'm going to go for lunch. If people want bigger chats, we can chat over lunch or in the hallway or something. So with that small questions. Yeah. Question. In the chat support for thoughts, do you expect or is it hard to plan to implement interfaces? So the question is about interfaces. Do we plan to implement interfaces? I mean, in summary, no. I mean, Java's idea of an interface is all about defining what kinds of methods you can call on a thing up front. It's all to do with static typing. That's exactly all that it is. And Pearl doesn't have static typing in that sense. Like if you have an object, you can always at compile time write the code to invoke any method you like. I mean, maybe at runtime the method may or may not exist, but it doesn't. Whereas adding the concept of static typing to a dynamic language like Pearl basically turns the entire language upside down. So the idea of a pure interface isn't really a thing that we want to add. But we definitely want to add roles because roles are statements about an interface, but they can also have implementation with them. So it's all about gluing small bits of functionality together to make a larger class. So we definitely want roles, but pure abstract interfaces are not really a thing that fits in dynamic languages. Oh, good. Comment on the question. Java allows default implementations now. Oh, does it? And residence. So basically they're roles. Yeah. Yeah. Okay. They're much nicer than they are. So we've only got one minute left. For x, y at array, if dollar y happens to be in depth, how do we know because the value is in depth or because we hit the end of the array? It doesn't matter at that point. It's just, oh, with the default argument, the signature parameters thing. No, the multi four. Oh. So I want to know that my array is even sized if I'm pulling out an even size. Yeah. Yeah. So for the, for the, for each, for each when you have multiple arguments, yeah, if, if the size of the array doesn't exactly match, it's like a, not a whole multiple. You will, you will get just undefs for those last missing positions. We did think about other bits of behavior, but I think in the end we decided that it just doesn't match because like if you just did my x, y, z equals array. Like when you get undefs in those last few values, you don't know whether that's because there were undefs in the original array or you just ran out of values. And so you've got the undefs. So it's kind of the same thing. If we did consider implementing something where you could tell the difference, then we'd start to, you'd sort of start to ask questions about, would you put it in other features as well? So like a, a, a large part of, of kind of trying to do language design is saying, well, we're not just going to do this one isolated feature. We have to consider how does it play with all these other things? And so running out of the array is a thing that happens in a lot of places. So I think that's, that's the end of, of questions now. So we'll stop there, but if people want to chat more, I'm, I'm happy to chat over lunch, but thank you very much. Thank you.
Open Source DocOps
Welcome. Our first speaker will be Lorna Jane Mitchell. I always say Lorna Jane in one word. I think everyone knows me. Yes. But you probably already know Lorna and she's going to talk about open-source top box. Take it away. Thank you. Hi everybody. Thanks for coming. It's a busy room and you've had a busy day. I hope your brains are not too full for something more. My name is Lorna. I'm VP of developer experience at a company called Redocling. We make API tooling including documentation tooling. I've worked on docs projects in a couple of previous roles. I describe myself as an engineer with a writing problem and I'm very happy to be here with some like-minded individuals. I'm also passionate about open-source. Yeah, my background is in software development. I learned in the open-source community. I'm an open-source project maintainer, open standards contributor. And I want to bring to you today how open-source and doc ops works together. So this works better if I plug it in. There we go. This is the second talk of the day. I'm not sure I'm still got sentences. Okay. What is doc ops? It's in the talk title. You believed in it enough to be here. Documentation operations is about allowing documentation to be created, but also maintained and published collaboratively and in an efficient manner. It's really about being able to make changes and having confidence and being able to make a lot, a lot, a lot, a lot, a lot of changes with lots of contributors. And the way I think about doc ops is that it, from some of the more traditional documentation practices, doc ops is a culture shift. Some of you are enough in the software space to have seen the DevOps culture shift and we're bringing something very similar to our written word. Everything I'm going to say in this talk really builds upon the concept of docs as code. If you are not treating your docs as code, you cannot benefit from the cool tools that the coders build for themselves that we adopt into our tool chains. This especially includes source control. Git is the key to many of the workflows that I'm going to talk about today. Text based markup so that we can manage multiple change sets simultaneously and bring them together without pain. I personally enjoy rebasing, but you shouldn't have to. Bringing continuous integration and those practices and also having a good local setup. If you have to push to see if you did it right, that's not a good documentation creator experience. And having good tools all the way through the stack is what makes this a really effective workflow. It makes you very productive and lets the machines do the heavy lifting. For a long time I used to say the software developers, the coders build the tools that they want to use, but I don't think they should keep them for themselves. I think we should take them and bring them into our world of documentation. Open source, you're at Fosdame, in English I would say I am preaching already to the choir. Open source means freedom, but it also means not having to build the same tool in every team that needs to publish a docs platform or check that the links work. It means being able to run that tool wherever you want to. Tools that fit into continuous integration systems are typically open source by default. We don't expect it by licenses or sign in, we expect them just to run on our temporary compute platforms or on our local machines. Best of all, there's no vendor lock in. So we can choose this tool or that tool and because we chose that one we're not stuck with having to use another one. We're using standard formats and open source tools. Just because we didn't have to build and rebuild the tool doesn't mean we don't have to build it at all. We all need to be participants in the tools that we use, reporting bugs, fixing things, thanking our maintainers when we see them. It's all part of the story. So I'd like to share with you some of the tools that I use on my docs projects and I've tried to pick just a few categories of things that I think are vitally important. We'll start with the obvious. You need to be able to preview your docs change before you publish it. Everybody should have access to preview. Everybody who contributes to the documentation or reviews any docs should have access to a tool like this. This is a screenshot of VS Code. I'm editing an open API file on this side and this is the redockly rendering on the right hand side and I typically work like this. So I always have local tooling that updates immediately. I can see instantly, oh that didn't render like I expected. There's something wrong with this. I can clearly see that's broken. My table is missing a cell because I've got that live preview response and this is part of the story. It doesn't have to be embedded in your IDE. You can run a local server that updates or use a watch command to rebuild your static site but you should have fast preview when you are working on documentation. You also need to be able to see the build areas locally if there are any. I see too many places where that's hidden away somewhere hard. The other place you need preview is in your pull request. You open the pull request. That needs to build exactly as it's going to ship. We need to spin up a per pull request preview. Don't muddle through the branch and put it on a staging server and hope. Pull request builds for previews and that also enables the reviewers. So it gives them a nice view. I used to think that previewing docs was for people who weren't technical enough to read mark down. Now I'm a VP. It's just people who are too busy. You put the web page in front of me. I can review it. If I have to go past something in a pull request somewhere, it's a bit less likely to happen. Okay. Link checking. Who has link checking in their docs build today? Yeah. It's not very many and it's the thing that is most easily rotted in your documentation. There are two problems. One is all the links between all your own resources which are just super easy to get wrong. And the other one is other people breaking their links making you look like a fool. So I use a link checker to check both of those. It automatically does like a click on all the links. When I'm looking at it for a long time I was building the HTML and checking the links that after render, which is cool and works. Now I'm working on more of a dynamic site. I actually have a tool which checks at build time. So I'm using MLC. There are lots of others. Pick your favorite. So it can read mark down and so then it can just check. This link makes no sense. Your syntax is terrible. Please do this better. All those things. Either approach works, but I think it's very important. It's an easy thing to add. You can run that tool locally. You can run it in CI. The downside of checking all your links is really other, I mean all the problems are really other people, aren't they? All the problems are other people. Sometimes the internet goes wrong. I used to work on a documentation platform which relied on an upstream open source project. Whenever that project launched a new version, all its links were broken for 12 hours. There comes a point where you don't want to know what the explanation for that is, but it meant that all of our builds failed for 12 hours because the links were broken. No, no, their links are broken. So I have a couple of different strategies for this. One is to only check the links in the files that are changed because especially on a big documentation set, you don't want to have to deal with something that's gone wrong in a link from another section might be owned by another team. So I just do that and then I do a weekly check all the links job. If that job fails, it opens an issue. So if something's decayed, we'll catch it maybe not always faster than a customer, but fast enough. So these are some things to think about. Whether somebody else's broken link or downtime should block your build or your release because I think that's a other people's links are outside of your control and so that can be a hazard. Let's talk a bit about validation. If you're coders, you are accustomed to working with syntax checking tools. Some programming languages will error at build time before you even run them. Some of them are more interpreted so they don't go wrong until you run them. We don't historically do that with our documentation, but the tools are there, especially when you are doing docs as code. So we don't necessarily do that. We don't necessarily do that. We are doing docs as code. It's got all the advantages of working in code and it's got all the disadvantages of working in code. It cannot be obvious that something is wrong. The errors can be super subtle. You have a full stop where the comma should be or the wrong sort of bracket. This stuff is even when I work with it all the time, can be very difficult for humans. Super simple for machines. So we can build on those tools and let the machines do the work. The other thing I like about having the validation errors automated, I can run them locally. I never do. I always push it and then wonder why it's failed. The other thing that's nice about that is when you push your pull request and you are missing a comma or you have the wrong sort of bracket, perhaps this is personal to me, but it feels kinder coming from a machine than having someone else criticize my use of a bracket. So that kind of, and I don't have to wait for a person to come and review it. I immediately get that very impartial factual feedback that my bracket is in fact wrong. And I think that's what I like about using validation like this. I was going to say the bots are not judging me. What a horrible thought, are they? The validation tooling, you have a few options and it depends a bit which flavor of markup you are using. I'm working mostly with markdown these days, although let's just say it's not because it's my favorite. Let's keep the markup language war for later. I'm using markdown lint. With markdown I find it very good and very, very configurable. So like all of the linting tools and the same with open API which I work with a lot as well, probably some of you have API reference docs, the default settings for all of those linting tools, the volume is too loud, especially if you were not already using those linting tools at all. Markdown lint is really configurable and it has really excellent documentation on what all the options do. It is remarkable how few documentation tools have a genuinely good documentation. This one does. For restructured text I've mostly been using that with Sphinx and Sphinx has really great validation and I think it builds on the docu-tools so you can use that by itself. All of those also come with command line tools, IDE plugins and you can put them in your continuous integration. So github actions, Jenkins, whatever it is that you use in your setup, set that up for your pros content exactly as you do for your code. If you're using open API you should also be at least validating that. I've already given my open API talk today so I will attempt not to rant about API linting and standards but put those tools in, set your standard and make sure that you are consistently checking that. Again it goes in your tooling. Disclaimer I make, Redock Lease CLI, that's my day job. Other excellent competing open source tools also exist and I'm probably not the right person to take a recommendation from. I'm very biased. So we talked about validation, very closely related to validation is formatting. Again software development does a lot of reformatting of code and that is to give a very consistent presentation. We always use the same white space in the same way, the same indentation, the same wrapping rules. It makes it visually very consistent. So when you work with the same code base all the time it gets easier to read. We can do that for our mark, mark up, mark down, restructure text, ask, skidock, whatever. We can do that for our tools too. By allowing things to adjust our new lines, our white space, the indentation, the wrapping, things like do you need a white, do you need a blank line before your bullet list or after your heading. Lots of tools don't care when they're rendering but by getting that the same you can make it easier to read the raw text and easier to look at it and spot problems because the layout is so consistent. I've only recently started doing this. I write a lot of docs that are in the same repository as the code and we just turned on the engineers prettier tool for our mark down. It's actually really nice and I was initially, like of course you can, I don't mind. Now I'm turning it on everywhere. So yeah, I really recommend it. I also really enjoy prose linting. Now I don't see enough of this. I'm using a tool called Veil and I'll be honest, I don't know very many other tools in this space. Lots of people nodding. Good. I'm also happy to be contradicted like tweet me what I should have said. With this it comes with, you can give it a dictionary. So it's going to do all of your spell check for you. It can also do quite a lot of grammar checking. This is brilliant for me. I work with almost entirely non-native speakers. So having a little bit of help for me and them to get the words out correctly is brilliant. I am a native speaker, doesn't always help. So Veil helps me a lot. Also you might be able to tell from my accent, I'm British. My company is standardised on American English and at this point my spelling can only be described as mid-Atlantic. So having Veil just to catch those common, we have like a Britishism's rule enabled and it's because I'm here. Typing all these British spelled words into our American docs. It catches repeated words. You can teach it product names. In my previous employment I worked with a company that published a bunch of open source database products. You have to get people's trademark product names correctly. Up a case, lower case, trademark. This has to be legally, this has to be correct. So unless you want your lawyers to have to think about this a lot, you just teach it to Veil. Veil explains it back to you really regularly. The other thing we did there was we put a bunch of collars common misspellings in. So we worked on Kafka. When I set up a search for Kaka, loads of hits. We also banned the English word flick because we had a product called flink. And indeed we just don't need this word in English because it probably is a misspelled product name. So those are the sorts of things that Veil can help with. I know we have a Veil talk next. Yes? A little cheer. So I'm not going to say more about that. Veil's amazing. Stay and listen to the talk. Okay. Let's talk a little bit then about how all these amazing different tools that solve different problems and they have your back. They support you in lots of different ways. But let's talk about how they fit that life cycle, that work flow. The key is that you are using exactly the same tools with exactly the same config everywhere between your laptop and your production platform. And that's the goal. Every contributor needs access to the same tools set up the same way. The tools, if you haven't used them or you don't yet feel confident because I know lots of people who have been using Git for years and still think it might bite, which is fair. There are lots of things to learn. Source control. I'm focused on Git but I've been doing this long enough that I learned on something else and I don't doubt that there will be more transitions in our future because that's technology. I like a workflow that's called GitHub flow where you have a main branch, you make a small change, it gets reviewed, it comes back in. If you see another spelling mistake, don't put it on this branch. Put it somewhere else. And it means that you can branch off lots and lots of shoots that are waiting to be reviewed and merged. And in this way you can multiplex lots of changes and sometimes as a feature it's waiting for review. Be confident. Actively practice changing branches because it will give you the momentum to switch a branch, make an edit, push it back. If you are writers, you are probably editors and reviewers as well, these are the skills that will multiply the stuff you're already good at by helping, getting the tools to help you. I've talked a bit about the continuous integration. Always hook everything but you find useful locally, maybe you get an extra VS Code plug-in. Figure out how to put that into your continuous integration setup and apply that tool to every pull request. This way we can never forget to check the formatting or the links because it will just be there. We won't, all that one's a bit risky, I think we should deploy to staging and check it. The preview will always be there and the machines will always be on your side. It helps the reviewers to do a better job and it maintains documentation quality. One of the most important places to have exactly the same tools and exactly the same config running is on your local machine. The smaller your feedback loop, the more quickly you can adapt and correct it and move forward. So having to open a pull request to get the build to see if it's okay, that's a big feedback loop. It's not ideal. I have one project where I need to do it because we have amazing test harness setup and it's much faster to run the tests there than here. So I like open the pull request to let the build build because it's quicker to do that than to wait for it to run locally. But for most docs tooling, they should be a few seconds at most even on very large doc sites. You must have them locally. If you use an IDE or similar, you can use a local machine to run the tests and take the time to figure out how to plug in these tools to that setting. Lots and lots of them are supported in both places and you can have it in context. I use Vim. All of those tools are plugged into Vim as well. So it's not modern, hand wavy, cutting edge. This is stand practice. The other really important thing is that this is all written down. With documentation specialists, everybody, write down how to set up the tools, how things are configured, where we publish to, where the sources, how the remote sources come in, how things are set up, maybe some troubleshooting guides. Write that down. The onboarding should be easy, whether that's a new hire or you get a new laptop someday. Set yourselves up for getting it right because again, we're looking for confidence and efficiency and this sort of thing is part of the culture change. There's a saying in software about move fast and break things. Dark ops is about move fast and don't break anything. I mean maybe it doesn't matter as much in documentation because it's easier to iterate than it is in code or especially in API interfaces. But the goal here is that we have professionals who are really good at what they do, but the tools can make that faster, easier, simpler, more accurate. They can catch us on things that we might slip up on. So bring the tools but also the dark ops mindset into your projects and see where it can take you. I am pretty much out of time. Here is a list of useful resources. My slides are linked to my session and I will say thank you for your time. I think we have maybe like time for two questions. Would anyone like to ask a question? Yes. This is a really good question. Do I have tips for helping with the translation of documentation within the process? I haven't worked on a lot of projects that have this. The ones that I have, Git is the key because you know which files have changed and which things have changed. I have mostly seen where the translation is a mirror and whether it's a week or a month or however often you pay your translation people, you can snapshot the pages that have changed and get those re-translated. So I think source control helps a lot with that. One more question. Could you imagine that you have also very strong opinion regarding documentation and information or something? I would like to hear it. I will repeat the question for the stream. The question is do I have a strong opinion about having documentation in Confluence or Notion or something like that? I have two strong opinions, not too strong because we are being recorded. The other one maybe we can talk in the bar. Using a tool like that hurts collaboration because you can't all make multiple changes at once and bring them back. Like one person is editing, if you were editing, it's very tricky to do that. The other reason is the lack of standards. So on a very personal level I have some accessibility needs. If you switch your documentation platform to Confluence or Notion, I can't do my job anymore. So Doxxus Code is the way because it lets everyone choose the tools that work. Thank you. All right. Thank you very much. I think we have this.
Easily Going Beyond MarkDown with Material for MkDocs
No, it works. Okay. Gotta get, there we go. So, thank you very much and enjoy. Yep, thank you. But before I get started, is this readable in the back or do we need to blow it up? A little bit bigger. How about this? Okay, good. All right. So, welcome to my talk on materials for MKDUX. Let me quickly introduce myself and my co-speaker, author. So, Martin isn't here today, but... So, I'm Kenneth Hosta. I'm an HPC system administrator at Gantt University. HPC is high-performance computing, supercomputing. Some people may not know this, but it's lots of servers, lots of noise, lots of money, lots of annoying users as well. There's a lot going on there. I'm the lead developer of EasyBuild for the last decade. So, EasyBuild is a tool for installing scientific software on supercomputers. It gets a lot, it gets very fun, I can tell you. I'm involved in way too many FOSS projects. I patch things and try to fix things left and right. And I've been attending FOSSDOM since 2013. If you think FOSSDOM is total chaos, you should try organizing a dev room, doing a talk and planning live demos during your talk, which is what I'm going to do. So, I actually had to run out of the HPC dev room, which I'm co-organizing. I'm a big fan of Material for MKDocs since I discovered it, and I think more people should know about it, so that's why I'm here. The other person on the talk as an author is Martin. Martin Donut, he's the lead developer of Material for MKDocs. I reached out to him to ask him, please submit a talk on Material for MKDocs to FOSSDOM. He said, I can't make it this year, then I said, OK, I'll just do it myself. And I had a call with him to discuss what should be in the talk, so he's been involved. All right, so why do I want to give this talk? Well, Material for MKDocs is great. More people should know about it, more people should be using it. It's very easy to install and use. You get very good results with pretty minimal effort, and I'll actually show this hands-on. Tons of great features. It's actively developed. It's open source, of course. And there's a very interesting part of how it's funded as well, and I'll cover that as well. And I was very shocked that there's never been a talk ever at FOSSDOM on MKDocs or on Material for MKDocs. That's just wrong to me, so I'm here to fix that. My personal journey is, I've actually haven't been using it for very long. It's pretty recent, basically, since 2021. I had to create a tutorial website, or I wanted to create a tutorial website for EasyBuild, for the tool I'm mostly working on. The existing EasyBuild documentation was in Sphinx. I wasn't terribly happy with that. It felt slow. It was using RST. The syntax didn't make sense to me. It was very difficult to work with. We were not getting a lot of contributions to that documentation, so I was looking for other things that could be possible. The tutorial was a totally new project, a totally new website, and I started looking around. I found Material for MKDocs, and I was sold after like five minutes. That tutorial was built with Material for MKDocs, and shortly after we started porting EasyBuild documentation, also our HPC documentation in Gantt, we also started porting that to Material for MKDocs, because it just made a whole lot more sense and was a lot easier to use. And also new projects that I've started since then have always been using this tool to create documentation and tutorials. So to start with, what is MKDocs? How many people here are familiar with MKDocs? Who has used MKDocs? About half of the room. Good. MKDocs is a static site generator. It's not a very complex tool, I think. It has a very strong focus on building documentation for software, so technical documentation, code, all these kind of things. The documentation sources themselves are written in Markdown, which is one of the things that sold me to MKDocs. Markdown is everywhere. If you're doing pull requests on GitHub or GitLab, issues, formatting there, Wikis, it's all Markdown, so the documentation that you're writing should also be Markdown, just to make the jump a bit smaller. To configure MKDocs, so how the site should look like and all the bells and whistles it has, that's all done in a single YAML file. Maybe you don't like YAML, but at least it's a single file that you want to look into and figure out how to configure it. MKDocs itself is implemented in Python. That's other than when you install it, you don't really notice that. That's probably a good thing. But it is very easy to install, use, customize, and extend. So how do you get started with MKDocs? This is a bit of a long list. You install it, pip install MKDocs, basically. You started creating a landing page, so an index.md and a docs folder. Typically, you can change that if you want to. You create a minimal configuration file, and then you launch MKDocs. You do MKDocs build that will take the Markdown that you put in the index.md. It will generate an index.html from that. You can open it in your browser and you're good to start with your documentation site. You can do MKDocs build strict as well. If you have any mistakes in your documentation, like you're linking to a page that doesn't exist, for example, it will go ahead and warn you about that. And that's very useful in CI. If you're making changes to your documentation, you can run this in GitHub Actions, for example, and it will warn you that something is wrong and you shouldn't be merging those changes. There's a way to live preview the documentation while you're editing it as well through MKDocs.serve. I'll show you that as well. And now you can go ahead and write your documentation. So showing all of that on the slide is very boring, so let's do it hands-on. And let's see how quickly this goes wrong. All right, so I'm essentially starting here from an empty folder. There's an empty docs directory, just so I don't forget to put stuff in there. The first thing we'll have to do is install MKDocs. It's not here. So this is just to pip install MKDocs. If you're a little bit familiar with Python, you know you have to be careful if you do pip install because it may end up got those somewhere. So what I want to do here is create a Python virtual environment. If I remember how to do that, all right, so now I'm in the virtual environment and in here I can just do pip install MKDocs. And if the Wi-Fi works, that should be working. So now I have MKDocs available, whatever version is there. Okay, so that's the pip install part, that's step one. Now I can create a very minimal MKDocs.yaml. And all you really need to put in there as a very minimal thing is the name of the site. So let's just toss them here. And in the docs folder, we want to create an index.md in markdown. So let's say hello fosdem, this is a demo. Okay, that's all we need. We do MKDocs build. This should be very quick. It generates a site directory with a whole bunch of stuff in there including index.hdml. We can open this in our browser and it looks like this. Hooray, it works. We even get a search function here. Of course now there's not a lot to search yet. You can search for fosdem and it will bring me to that page. Okay, so the search functionality is already built in and ready to go. Now once you start creating a couple more pages, let's say getting started.md. Like this, if you save this, you have to do MKDocs build again. And you have to refresh this site. And then here you see there's a getting started page as well. Now Firefox gets a bit confused because this is all static html. So it says what do you want to do? I want to open the page. What's more interesting if you do MKDocs serve. So now you're getting a small web server here running locally. You can click this. You see the same website. But when you start changing stuff, for example, in an existing page, let's say magic happens. As soon as I've saved this and I switch back to the site, this should not refresh. Oh right, okay. See demos always go wrong. Try again. You save it and if you switch back it pops up there. So you're automatically getting live preview while you're editing the documentation. To me this is absolutely brilliant. Okay, now what I don't like about this is this theme. Like what the hell is going on? The lines are white and getting started is here. Where's my hello page? So weird stuff is going on. That's where I think Material frontman kdocs kicks in. So Material frontman kdocs is a theme for MKDocs. It makes things a whole lot better, nicer to look at, just straight out of the box. Very easy to use. And it comes with a whole bunch of plugins and extensions. So extra features that MKDocs cannot do by default. So I see this as MKDocs with batteries included. So this is actually how MKDocs should be out of the box. Again, easy to install, use and configure. All you need to do is in your Python virtual environment. So I'll have to kill the serve here. You do an pip install MKDocs Material. So you just install an additional Python package, which will bring in a whole bunch of extensions and there's a whole lot of stuff going on here. I'll serve this again. Now, if I look at the website, nothing has changed yet because I have to change the theme as well that's being used. So in my MKDocs, I say theme, name, Material. And as soon as I hit save on this and I switch back, I think it needs a refresh. Why is it not working? Oh, something went wrong. Ah, demos. Okay, let's try restarting this. Okay, I'm not sure what went wrong with the live preview. Usually that works. So this already looks a lot better. So at least now I'm seeing my pages and the search. The search here is amazing. It's blazingly fast. Even if you have pretty big documentation, it highlights the things you can customize the search. You can rank pages up or down. If you want them to be more prominent in your search results, there's a whole bunch of stuff you can do. All right, so they get started with Material for MKDocs. Just with pip install MKDocs Material. You change your MKDocs YAML to use Material as a theme and things start looking a lot better already. And now the fun really starts because there's a whole bunch of plugins and extensions you can start using as well. Now, I'll do a quick cheat code here because the MKDocs YAML I'll end up with is going to be pretty big because I want to show you all the bells and whistles. I'm not going to type all of that, so I have a hidden file here that I'm just going to move into the right place. And we open this and you can see there's a whole bunch of stuff here going on now. I'll explain it in the slides what's going on. So one of the first things you can do is you can start playing with the colors. You can change the accent colors. So here I use like Fossdam purple. That's very easy to change. You just say, palette primary color should be purple. The accent color, so that's when you hover over stuff, should be blue. So it's very easy to play with the colors if you're interested. But it's also very easy to do is introduce light and dark mode in your documentation. So with a little bit more stuff in your MKDocs YAML, you can say I actually have two color schemes. I have a light mode and a dark mode. The dark mode is called Slate for whatever reason. The light mode is called Default. Okay, now you know. And what actually happens when you do that, so here when I moved that big configuration file in place, it actually already did a re-render and now I have dark mode here as well. And it's actually a dark mode with tuned colors. So I'm getting Fossdam colors in my website now as well. So that's one small thing that's very easy to do. Now let's show off some of the additional features. Let me start a new page here. Let's call it MaterialMD. And let's start showing ContentTabs. Material for MKDocs. Save this, go back. It should be picking up on that straight away. Okay, so ContentTabs are a way of getting tabs and like a subsection of your documentation page. And the best way to show it is to really do the demo. So I'll copy paste this markdown code in this one here and I think it needs empty lines in between or it will not be happy. Right, and now here I have tabs in my documentation. So that's very nice if you say I need to show different examples with C++ Python different code, for example, this is a very nice way of doing it because people can just pick what they're interested in. You can also make sure that people can somehow give a preference, like always show me the Python stuff. And it will remember the first time they picked something and throughout the whole page it will show always the Python example by default. So it does some caching of this as well. To enable this you need to enable two extensions, SuperFences. So SuperFences is something where you can embed content into each other so you can start with ContentTabs that then includes other stuff and like it basically goes recursively so you want to enable that. And then you do Tab and UltimateStyleTrue. While it has to be true, I don't know, but fine, it works if you do it like this. CodeBlocks is a very nice thing as well, also built into Material. So let's show that off here. We can do Code, Block with Python code and that looks very nice. So this uses Pigments to do the syntax highlighting. You tell it that it's Python here, so it doesn't figure that out by itself. You have to tell it I could try rendering this as Shell and it's probably going to look a bit funny. Okay, so it looks reasonably okay. So all of this works out of the box. They don't have to install any additional stuff to make this work. It knows about pretty much all the programming languages out there. If you want to try Fortran here, it will probably still work. Another very nice feature is what's called Admonitions, which is a very strange word to me. I'm not a native English speaker. I'm not familiar with this term, but it's like nodes, warning tips. So all of these kind of boxes you can include in your documentation is called Admonitions in Material for MKDocs. So a small demo of that is here. Let's do Adma, whatever, nodes and stuff. Again, it needs an empty line in between or it will not be happy. And you start getting nodes. You can use custom titles here. So all the Admonitions have a particular type, which mostly defines the color and the icon you get here, and you can change that title here. So you don't get the default. The default would just be tip, I think. So if I remove this, you'll see tip instead. I think there's a more normal name for me. Sorry? Callouts, yeah, okay. Fine. Over naming you can always discuss. I didn't pick the name, so don't blame me. Blame Martin for that. No, no, it's fine. To me, it's a confusing name. Another thing I really like and I know very much that not everybody is a big fan of that is emojis. You can use emojis in your documentation. I think this is great. It makes things a bit less serious, a bit less lighthearted. You can have some fun in your documentation as well, because some people think it's very boring to read documentation. So for some people, this works. There's emojis, there's icons as well, so there's an arrow in here. This arrow right is not really an emoji, it's an icon. So this works pretty well as well. Again, I want an empty line in between here. So be careful if you have too many Belgian beers. You may get sick in the morning. All right, so this really works well for me. And the documentation for Matilda from K-Docs, there's a search engine here, so you can look for beer, and it will give you all the options that you can use. You can look for arrow, and if you click something, it will copy, if you click in here, it will copy paste it to your clipboard, so you don't even have to type it over. Really well done. Over 10,000 icons, so you can probably find something that you can use in there. All right, another very cool feature, which I haven't used myself very much, is using Mermaid. So Mermaid is some kind of JavaScript tool to create diagrams. So with a block of code like this, a block of mermaid code, you can start including graphs in your documentation like this. And these could be very complex. They render very quickly, and it not only supports diagrams like this, but you can do pie charts, you can do UML diagrams, you can do a Git branch workflow kind of stuff, so this is very rich in terms of what it can do. Again, you have to enable the corresponding stuff in your markmkdocs.yaml, so you need super fences with some custom fences and yada yada, just copy paste this. You start playing with diagrams in your docs. You hit save, and if you're quick enough, where is this site here? You start having diagrams in your documentation as well. So if you need this kind of stuff in there, this to me beats putting pictures in there, because here you can copy paste stuff as well, right, from your diagrams, so this is better in many cases. All right, I think I'm doing quite good on time. The last big feature I should say that I want to highlight is the block support. So this is quite new in material for mkdocs. It has been in the works for quite a while, but now it's finally there in the open source version. So this is something special, a dedicated plugin for integrating a blog in your docs. All you do is you do plugins enable blog, right, and then you can start in a special structure here, so you can do docs blog posts and start creating markdown files in there. Let me show you what happens if you do that. So we want to exit here. You want to make sure that the blog part is set up. So this part is auto-created here by mkdocs. As soon as you enable the blog's plugin, it will create your landing page. There's no post in here, of course. So this is empty. I can create a small markdown file here. Let me check copy paste. So here you can see this has a date. This is basically the publication date. So if you put this in the future, it will not show up yet until that date hits. I think so. I can try if that works. So here blog. This is our blog post that we just added, and this has a dedicated page as well, which here it's hard to tell, but in the URL field, it will actually use the date that you've put in the mkdocs as well. So it's like everything is nicely date stamped and so on. I think if I put this to a future date, it's not going to show it. So let's try February 5th. And now the post is... Ah, okay, it's still here. All right, fine. But there is another way you can do draft through, so I don't want to show this yet. And then, at least on the landing page, it should not be there. So as long as it's a draft, it will not show it. If you flip it to draft equals false, or just remove that, it will come back. Okay, so this is built in into material for mkdocs. This is quite amazing. All right, so lots of features. There's lots of stuff I haven't showed. It can do a whole lot more. So please take a look at the documentation of material for mkdocs itself. It's a very nice tool. Another aspect I want to talk about very briefly is the way this is funded. So funding is a very big issue for lots of open source projects. And Martin here has come up with a way that works amazingly well, and it's actually pretty simple. So material for mkdocs is what's called sponsorware. So there's an open source version of it available to anyone. You just download it on GitHub. You can pip install it, and you can start playing. But there's actually a private version as well, which has a couple more features already implemented, but they are not available in the open source version yet. To get access to this private version, you have to become what's called an insider. So you become an insider to the project by doing some kind of monthly donation to the project. And I think it starts as low as $15 per month, so it's quite affordable. You can also do a yearly donation if you're up for that. And then what happens, you get access to all these new features that are only available in the private version, in this insider version. But eventually these features also come back to the open source version. And that happens when a certain funding goal is being hit. So Martin sets goals. For example, I want to get $10,000 a month of income. And then all the features that I list here will become part of the public version. So as soon as they hit that target, it works. And this is nicely covered in the documentation here. So on the documentation of material for MK docs, you can see they're now getting over $13,000 a month, which is quite a lot, right? So Martin is actually building a team, a development team route, material for MK docs, thanks to this funding. So right now this is the funding level. And he says as soon as we hit $16,000, we will move all these implemented features from the insiders, from the private version of the tool to the public version. And then they stay in the public version forever. So as soon as they hit this target, that works. Now this is interesting because they've hit the $14,000 target already, but then some sponsors dropped out, and now they're back to a little bit below $14,000 again. But that's fine. Once it's public, it stays public. What's amazing to me here is that the private version is just a private fork on GitHub. So you get access to that private fork, you get added to the fork essentially as a contributor, so you can access that code. But this model somehow works. So you could say, okay, if I get the private version, I could just give it to anyone, right? And then it stops working. But for some reason, that doesn't happen. So it's like this honor system, like if you sponsor the project, you get access to it. And literally here at the bottom of the page, it says, please don't distribute the source code that you get access to, and apparently that works. And they keep getting new sponsors over and over again. They're hitting these goals every couple of months. So that's maybe an idea for other open source projects as well to take a look at. So yeah, Martin told me that this was a bit of a jump, like a gamble, let's see what happens. And it's been working amazingly for them. So he's able to build a development team rather than just having to work on this himself. Okay, a lot of features that I didn't cover, which I'm not going to get into here, check the documentation. One thing I do want to mention here, it also makes it very easy to publish your documentation on GitHub pages or GitLab pages. So it has an MK docs GitHub or GitHub pages something. And if you integrate that in your GitHub actions workflow, it will push it to GitHub pages and nicely integrate that in your GitHub account. Yeah, that's all I have. And hopefully there's time for a couple of questions. Thanks. Let's have a couple of questions and we'll see how fast they go. First question. Very quickly. I have two of them. Do you know if the icons, not much the emoji, but the admonition or whatever, and also the charts, are they vectors or a roster? So you're talking about these, right? So the question is, are these vectors or are they bitmaps? I'm not sure. I think they're vectors, but I'm not entirely sure. You could check, I guess, if you zoom in, where do I have that website open? Here. So if you zoom in, you can tell that these are probably vectors, right? Yeah, right. So they look pretty good. The question is, if there is any kind of, maybe it's a stupid question, but is there any kind of translation to this kind of documentation that I would say PDF? Okay. Is there a way to export the documentation into PDF? I think there, the answer is no, but they're very much aware that this is a missing feature, let's say, and that's something they want to work on. I'm not entirely sure if that's correct, but I think that's what Martin told me. Like, compared to other tools, there's like Docosaurus and there's Sphinx and there's other things. Some of these tools can do a little bit more. There's a plugin. Yeah. There's a plugin for PDF export. Look, yeah. So the plugin system in MKDocs is very nice. Yeah. Yes? That's great. Yeah. One of the nice things about Sphinx in MKDoc is that you can easily do a code documentation so you just send something into the documentation and grab the link. You mean like generating API Docs or? Yeah. Yeah. There's a plugin for that for MKDocs. I didn't show it here. I actually don't have it on the slides. Okay. I had it somewhere, but I think it's MKDocs. Oh boy. Doc strings. Yeah. So this is MKDoc strings. That's the plugin you want to generate API documentation. I'm using this in the Easy Build documentation, for example. Works fine. Yeah. So the question is, did you run into issues with complexity because it's like tool or another tool and when it talks, you know, maybe it's easy to contribute. And yes, so knowing when to look if it's issue in the material or if it's underlying MKDocs and how to do it. So the question is, did you run into issues with complexity because it's a tool on top of another tool and if something goes wrong, where's the problem? Not really because usually if something goes wrong, you get like a Python crash and you can tell whether it's in a particular plugin or in material or in MKDocs itself. I haven't run into many issues like this, but if it happens, it's usually quite clear. And if you don't know, you just report the issue to material for MKDocs and one of the maintainers is going to tell you it's not an issue here, it's an issue there, you should report it there. And you say one thing on top of another, that's not entirely tool, it's like plugins. So they do like integrate with each other and there's some complexity there, but yeah, usually it's quite clear if you get like a Python crash, you can tell from where it's coming from. Can I stop here? Yeah. Thank you very much. Okay. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you.
Embeddable code playgrounds for fun and profit
Okay. Well, cool. Yeah. So, displayed at her. I like usually like a standing and jumping around, but here I've got to type. So, you forgive me for sitting down here. So, we'll talk about embeddable code playground specifically in a dogs. Let me ask how many of you prefer dull static dogs compared to interactive dogs? Because you probably have to maintain them. Oh, okay. Okay. I thought you are writing dogs. Yeah. Well, understood. Well, so, just to know who I'm, Peter Seitz, Anton is actually the author of the code we'll talk about, but unfortunately, he couldn't get a visa to come here. So, you stuck with me. But if you have any like a super advanced questions, there is Anton contacts and you can send it to him, he's very responsive guy. So, if you think about their more interactive, well, code playground, interactive scenarios, they generally better work for the explain the topics and also allowing to engage the reader. Maybe not all the readers, but I think the best ones, the most curious ones, which actually want to understand how things work. So, we'll look at three items in this short presentation. Their use cases, their approach, what we have in this open source project, and implementation. First, let's look at their tutorials. If you look at the tutorials, you often want to explain something by example. I think we can look at this very, let's say simple case, we are actually using some real live or the SAS out there, which is provides a very simple database where we can push some simple JSON object. We can go ahead and run it. Actually, what happens in this case, well, it does interaction as described above. Then sends their object to the database. Well, what we can also do is to go ahead and go ahead and to modify that and run. Then we can see this object was stored. Now, we want to demo the Cloud API. In this case, to play with it, we can go ahead and also use the get one to play with it somewhere. Let's say you have a second message and we have, and if you can say, is there like some message number 45? We can see, well, it's not there. Again, if you really want to experiment and play what works and how it works, that can be the very beautiful way to do it. Another cool way what we found it's being used is their release nodes. What we have in this case is this example. If you look at the Golang, they have just made recently very important changes, how the variables operate as related to coroutines. If you look in this case, that makes it look like a little bit counter-intuitive. We have coroutines called in the loop, and for some reason they are not showing different loop counters. Well, in Golang 1.22, that was fixed. If you guys want to showcase their feature in the release don and the commentations, but really let people to explore and put the holes, in this case I think that's a wonderful tool. I'm not sure about you, I often then read in some features, I do a lot of work with databases, and they say, hey, we implemented that new feature, and I wanted to put a hole, oh, did you implement that option, or does it work in this way? That is a very easy way to play with it if I would go in through with all their installation process, and so on and so forth. Another example we can see is some of their describing, some of their options in a documentation. Like if you think in this case as a corral, everybody could use corral, it has this wonderful JSON options, with this very cool, correct, but also very mouthful example, which we can also go ahead and provide example for. Say, hey, that is a JSON object, we post that to the server, that is what we get in return. This HTTP bin, that is actually another like well-known open source project, which essentially allows you to post something to that and then get in return, what exactly you posted with Othead, and so on and so forth, very convenient for debugging. What you can also do here in this case, if you are curious to say, well, interesting. So, corral has the support for JSON. Does it validates JSON, or just sends whatever stuff we have? Well, let's check it out. Well, we can see in this case, we are getting the error response back from the server, rather some sort of corral output, that means it doesn't. Again, that can be very helpful to get the user to explore kind of what he's not very certain, which may not be quite explained in this portion of documentation. Or we can also showcase example, what existed in the docs, how we can send the output from file. Pretty simple here. Okay. If you are looking at the deep dives, there may be some interesting in terms of more functionality. Going to a database, space where I spent a lot of my time, let's say we want to describe what is an absurd in SQL. All right, anybody heard what is absurd? Right, well, that is something like, we want to insert the data, but if it's out there, we want to update it. The very common. Okay, so we want to say, let's say we have this table out there, right? And we want to go ahead and use their MySQL insert or replace syntax, right? Then we want to, well, as I said, like to update one employee's salary and then also add another one, well, we can go ahead and run it. Like why use this as example here? Because what you can see is we are not showing everything in example, right? We are working with some sort of like a seed data which is, well, was pre-created as a part of a previous scenario, right? Which is very common. Here is also another example of the same thing, but with Postgres, right? Where we're using a different scenario, right? And you may ask, well, okay, this is how it works, but I know also the Postgres SQL has a syntax on conflict do nothing, right? So what would that be if that's what we do? Oh, well, in this case, we can see what the Emma's salary, right? Which was a conflict in row, was not changed. So again, we can play with those things. Okay, well, these are kind of setting landscape, I think, what things can be useful, but now let's look in terms of what is approach and how it works. Now, if you think about the tools and the doc creation, right? You would find what it is not easy to find the good technical writers, right? Or documentation offers, right? And they also can be rather, well, let's go like a selfish of a time, right? They don't want to do a lot of useless crap, right? In this case. So we want to make sure that writer experience is important, not just the reader experience, which we already defined, has one of those interactive playgrounds. So what approach we took in this project is saying how we can make it as sort of like a seamless as possible, right? We don't want to say, hey, you know what, you are going to create our interactive code playground in completely different tooling, right, separate from documentation, right? And then figure out is that going to live in the same version control, right, and so on and so forth, right? Or, you know, things like that. What we want in this case is to have your documentation, right, which was this, right? Just as easy as possible, add the ability to run and to edit and run, right? So you can say, hey, we added, you can see the run and edit here. And if I run, you can see what is the output of that documentation example is. So how can we approach that? So it is easy or integration which is easy on writing. Well, it's actually quite easy. So what we have is you are writing documentation in the same format as you got used to, right? Let's say maybe it's a markup language, as in this example, or something else. And then you can embed this like a code API widget. That widget itself will figure out the previous code block and make it interactive, right? So there is no, like, some special thing required, and that pretty much works in any documentation thing which already exists, right? So you can see that example here. So the code which existed here just gets interactive. So, well, of course, hello world is always easier, right? Let's look at some more complicated examples. One, I think, which is very important is the template approach, right? I think what I briefly mentioned already, if I want to show something like this, right? That is like a relatively, you know, complicated query, right? For that to be meaningful, I also need to pre-generate table in this case, which I probably do not want to have on my documentation thing. And this is designed done by providing a template. So a template in this case is basically something which is run before the scenario is done, right? And in this case, I can write some text and, hey, I created a table, but I'm not really specifically final comments because there's a irrelevant in this case. I populated with some data and then I have a code, the code which was created before, right? That is how template would look like. So I can highlight, right, where exactly in the context, I want to run that code which was, which is interactive part of the documentation. Okay, so here is another thing which you will find quite helpful. So if you are building some sort of tutorial, right, building the tutorial, right, you would often want to say, hey, there is actually multiple steps where I need the user to go through them one after another. And that is an example here. What you can see is what we are defining the function in a one-code block and then we are using that function in a in a another code block, right? We can, and I'll show you in a second, define dependency between those code blocks. That means when you are running this second section, the first section would always be run, like let me, I don't know, let's say break this code, right? For example, and then I can go ahead and run the second one. It says, oh, well, you know, things got broken, right, on the previous stuff, right? And how that works is what we identify, we refer to the first one as a cell number two, right? And then identify the second snippet as a cell which depends on a cell number two, right? That means pretty much that the content of that cell is going to be run. Then the second cell is run, right? Even if you, as users, don't click run, right? If you say, hey, I don't want to go through all those like five steps in tutorial, I want to start with step number six because that is where the real meet happens. You can do it, right? You can just jump in the middle. Okay. So finally, so how does that all things work? Well, there are actually a couple of ways it can work. One is we can have a browser playground and then a sandbox environment, right? Which is pretty much docker-based, right? And that's where we can use browser API, JavaScript and whatever. The second approach we can have also is web assembly, right? So if you can say, hey, you know what? We want no kind of serocomponent, right? It runs completely in a browser. We can do that, but probably in web assembly, it can be sometimes heavy, right? Especially saying, well, you know what? I want to showcase how, you know, like a Postgres operates when, you know, getting all that Postgres pulled in, the assembly started, right? That may not be the best experience, right? Especially with slower connections. So that is where docker, right, can be very helpful, right? So with docker, you can implement whatever you want and the setup of this service is an open source project, right? So you can roll your own as well. There is a variety of existing playgrounds which are supported at, you know, core API website, right? Which can get you started pretty quickly. Yes, so here are some examples, and I will of course share, well, slides if you actually slide there. The online, this is a live tutorial. You can see there's like a number of projects already started to use that with, you know, pretty good success. And you can see with core IP.org showcase that is where all the examples exist, right? Here are specific projects, right? There are kind of two sub-repositories. One is for JavaScript kind of client side, and other four, the server side, again, it's split because you may just want to use their client side if you're using like JavaScript or something where you don't need a server component. And yet, if you want to ask some more questions for Anton, right, or get some feedback, Antonz.org is his website. So that's all I had, and I would be happy to answer questions or get out of the way because I think I'm the last thing standing between you and your viewers. Yeah, we started with docs code, and now we've gone to code docs. Any questions? You don't understand, the back-end is also part of this project or not? Yes, yes, so in this case, code IP, that is your Docker back-end, right? Code IP JS, that's your, I think, so both of them are open source. What do you mean? Oh, you mean in terms of what people run, right, what kind of, so not right now.
A universal data model for localizable messages
Hi, I'm Emily. I work for Mozilla. Yeah, so this is a talk I literally don't think I've got. I could give any wealth except an audience like the translations of ROOM at Vosdom. So I thought I did. I would. In my work at Mozilla on localization systems and tools and standards, recently I've ended up spending quite a bit of time participating in the Unicode Consortium's project to define message format 2, an evolution of the ICU message format standard and a bunch of other things. And I'm here not talking about that like specifically, but more like a side product of what we've ended up doing through that work, which is defining a data model for messages. In particular, messages that are not just a single segmented phrase that you've extracted and you might be able to send it to translation, but more dynamic content as well as everything else. And one of the interesting things about what we've ended up effectively not discovering, but kind of stating the obvious, is that there's an upper bound to this sort of what makes up a localizable segmented phrase or message really. That this is limited by the fact that the keyword localizable there because it's dealing with humans. Humans who need to understand it, but also translators who, well still, are mostly humans who need to be able to take in the the source message and be able to produce an output from that that is understandable in their locale. And this ends up depending on a limited number of different dimensions in which messages kind of vary. Variants have kind of hidden it there as the first one and there of course spoiled everything by saying so. It's the way that messages, message content can vary depending on inputs like numbers and their pluralization categories. You have no apples, you have one apple, you have 15 apples and gender-based determinants, grammatical gender, personal gender, all sorts of various things in different locales languages. But this is one dimension. If you can express that, hey, we've got this variance happening, this message depends on these input values. This is a dimension that we can express. Then of course, once we have a single pattern, a single sequence, it might include placeholders. It might include the number n for how many apples you have or it might be something entirely different. But then finally, we've ended up at least through the message format to work, determining that markup should be kind of kept as a separate thing from placeholders. So markup here means something like HTML. It doesn't need to be HTML. It can be any sort of syntax or any sort of indicator that is saying that the content in this part of the message has these attributes or these something about it. Then within a placeholder, we can have values like numbers that we need to deal with. They can be just strings that are coming from an external source. We can also have annotations on them. We can say that this number here, yeah, it's coming in as a number, but I want it to be formatted. So it has at least two fraction digits, for instance. This needs to be accounted for in the whole message, how it ends up being formatted. Then as I mentioned, we need to be able to deal with variables because we are ultimately here talking about the scope of dynamic messages. So we need to be able to say that explicitly that this message might be getting a number from the outside. It might be getting some string content. It might be getting anything as input, and it needs to deal with those. But sometimes we need to, within a message, want to also do a little bit further processing on a variable value. We may want to select a certain part of it, capitalize it if we're talking about a string, do other sorts of transformations, or express the same sort of value in multiple different ways within a message. So we need a little bit of a tooling to deal with variables. And that's it. That's like through the working message format two, for the past like four years, we've not come up with effectively anything else that really is core, driving the qualities of a formatable message. And that's ended up meaning that one of the things we've produced out of this whole project is this data model for what does a message look like. When you don't consider the syntax of it, when you consider it as a data structure, I'm not going to go through like all of this. But roughly speaking, we can say that a message has two different forms that it can take. It could be either just a single pattern, single sequence that we're dealing with, or it can have some variance. That's the select message over there, which then has some selectors from that when formatting guide us towards selecting one of the variants of the message. The declarations help us declare these are some of the input and local to this message sort of variables that exist. And then the variants of the catch-all key end up defining how exactly do the, when we have multiple message variants, how does that work really? And then when you get to within a single pattern, again, as I alluded to, it can really just, obviously, contain literal content, a string, or it can have expressions, placeholders that is, or it can have markup that can be starting or opening. We also included standalone there, so you can have an element, for example, an inline image be expressed within a message. Then we can have literals, variable references, and the annotations that I mentioned. That's it. That's like these two slides are defining the whole data model that we've ended up dealing with. Okay, I left some like tiny little details out, like for example, the annotations, sorry, the expression, it needs to have at least one of an argument or an annotation in order to be valid and stuff like this, and minor details. But that's it. This is, we think, a universal data model for representing messages. And I'm here basically saying, hey, I think this is kind of cool. And this is not necessarily relevant for just the work specifically to do with message format to the syntax. But more that this is effectively a data model that can allow us to separate the concerns around the syntax of whether your messages are stored in get text files, ICU message format, fluent, any, literally any format. You can take that syntax and you can parse it into this data model structure representing a message. And this is, I think, leading us to a world where we can consider more of a UNIX philosophy for, okay, what do we do now? And I've, sort of, separation of concerns here. And I have, yes, cherry picked explicitly the part of the UNIX philosophy where it says to do one thing and do it well. And not included, for instance, the part about, you know, make sure that you're just dealing, you're communicating as strings the values from one process to another because that's kind of not necessarily working so well. Because we need those parsers. And if we need to understand all of the structure in a message every time when we do it, we end up, for the most part, mixing up the syntax concerns with everything else we're doing with messages. So as some of the things you can do with this data model as ideas is that if you can read and write from a syntax to this data model, and you can do this with multiple syntaxes, this is effectively an interface from which you can take messages from one syntax, turn them into this data model representation, and from there to any other syntax with caveats, but roughly. Another thing is we can build tooling on top of this. So you can build a linter or a validator on top of the data model representation of messages, rather than any syntax representation. And this means that you can use the same validation for all messages independently of what syntax they might be coming from. And if you have these capabilities, it means that when you have an established many localization systems right now are very much monolithic. They have expectations about this is the exact syntax in these sorts of files that are used for messages or resources. This is exactly how you deal with them. And this is what you get included in your output or your program, and this is exactly how it works. But as we're defining here a data model that can read any of these syntaxes, it means that you can build a different sort of formatting or a runtime on the same syntax. So you can start from the way you are now and maybe consider if you want to change how you're doing localization. You don't need to change necessarily everything all at once, but you can take just the formatting runtime change that to work with the same messages you've got, and move on from there. Or vice versa, actually. You could change the message structure, how do you store your messages and still use the same runtime because this is bringing in an ability to very effectively transform your messages from one syntax to another. And you can, when you're dealing with localization, you of course need to deal with translation. And this means that you need to somehow present to your translators the messages that they're working with. And if a translation tool or a framework is going through the message format to data model, it means that you can build an interface for localizers. With the localizers, don't need to know what is the syntax underneath for the placeholders, the variables, the markup, anything else, but they can be presented the same thing for all syntaxes, which might make things a little bit easier for everyone. So those are the ideas I came up with here for what could be the next steps from here, but I'm here saying, hey, this is a cool thing. You guys should play around with it. For us, the current and ongoing work is to extend this sort of a definition to also include method resources and also include the sort of comments and metadata that is quite essential for communicating the context of a message to translation, which as I'm kind of hoping some of you noticed was completely left out of the earlier. But that's intentionally so that we can separate these considerations from each other. But that's it for me. Thank you very much for listening. I'd be very happy to have any questions, comments. In another talk, I heard about message format 2 and function invocations. How do function invocations, how does the data model work, or how do they relate? The question is for how do function invocations relate to all of this? And this, yes, they are represented here in the function annotations here. So something like, for example, plural selection could use a function with a name of plural, for instance, for being this element existing in a select message, selector expression, which is there. Question was whether there are a set of built-in functions that are supported. And message format 2 does come with a relatively small set of built-in functions. The data model itself does not presume this set absolutely. But the set of functions can be extended. For message format 2 in particular, we are looking at a very minimum of effectively number, which also does plural and ordinal selection, but also acts as a formatter. And then string, which is a sort of default for string formatting, but also does the same sort of selection as ICU message format select does. And we are still discussing for message format 2 what other things to include here. Now, of course, when representing messages coming from some completely other syntax, it is entirely conceivable that it is not directly possible to express these messages using the functions that message format 2 defines out of the box. But the data model does allow for you to define that a function like this must be used here, and you can otherwise define how that function works, if that makes sense. And it's possible to make translations between these function meanings. Anything more? The reason to separate context from the minimum required effectively, and here I'm jumping into the answer here, the minimum required for formatting a message is that the context is absolutely required for the translation work. But the context is not absolutely required for the message formatting. So we need to be able to represent it, but we do not absolutely need to have it be a part of the message itself when it is being formatted. And this is why we are dealing with it slightly separately. They are very much related concerns, but we've tried to find with the data model the minimum required for representing a message. And when you trim down the minimum, the context kind of ends up as a thing we can define externally, so we've chosen to be doing that. And I mean if you're interested in that, in particular the specifics of what should we include in the sort of base set of metadata and context fields, here's an issue link where we're discussing this right now that I would be very happy to have your input on. Anything more? Regarding the edit the translator tools, so now most translator tools present a string and expect that the translator will write in a string. Do you imagine that this will change and that the translator will see the elements of the data model in a more graphical way and choose translations through Google boxes, or something like that? Or do you think it will stay as a string representation for the translators in the future? I have no idea and anything is possible and that's kind of cool. So predicting the future of what the translator experience is going to be here is shall we say a hard question. One thing I do think is that this sort of a data model makes it easier to build tools that can present to a translator more the options and opportunities that they might have in modifying a message and content like placeholders and markup which might just show up a syntax when presented a string and be a challenge to even realize that I could change how this bit of it is styled. But if we can present interfaces that can read the data model and understand from this that hey hang on this could be tweaked this way, interfaces that are richer could be built. However of course we do need to keep in mind that such a vast majority of cases are just it's best represented as a pure string. So a majority of work is not going to change but the corner case is where it gets interesting and challenging for those there might be opportunities to present messages in a more translator friendly way. And one part of this I kind of skimmed over it was mentioned in the Ujjwelts presentation yesterday on message format too is that here the selection for variants is not an inline component as it is for example in ICU message format or fluent but the variants all of the variants need to be complete messages presented at the sort of top level of the whole message which is entirely intended to guide towards structures that are easier for translators to deal with rather than needing to to figure out you have and then a selector of apples. Instead of that you have a selector which has you have one apple you have three apples and this sort of an interface. But yeah anymore? If not I would like to thank you very much for your time and yeah that's it for me.
Long Term Effort to Keep Translations Up-To-Date
Okay, can I start now? Okay, sorry for the delay. So, I will try to present about how Indonesian teams doing translation for several transition projects I have involved with. Okay, this start from one project. I mean, I have the calculate, I have create statistic from one project. And then I try to add other project that I have greatly miscalculate the effort need to do the statistic. But you will see how difficult, maybe because I'm not skilled enough to create the statistic long term one. Maybe some of you can help me. So, this is the start. Why we Indonesian team doing translation? Because in Indonesia we have around 276 million people there. And from this many people we have several major languages. Actually we have more than 1000 languages, especially here. Every, every, what do you call, every small density, we call in Indonesia is Desa, Kampong. Village, yes, village. Every village has their own languages. But do we really need to support or translate to everything? No. So, at least we need to consider that to translate in this big six, one, two, three, four, five, six languages. But I, myself, I only fluent in Indonesian and Japanese, so I cannot help the other languages. So, my, my, my, I try to, okay, let's start with one language that you say, used by the most of people here. So, so we can, we can start doing something then can be quickly used by many people. Okay. So, then in my talk I will compare about the Bravis, GNOME and Ubuntu. Why three of them? Because I thought they have sufficient data to create a long time historical data. I thought, but let me see how difficult it is to, to extract this. The other thing that, that three of them has periodic release schedule, good release schedule. For instance, LibreOffice is in February right now. I think last year they, they released in March and September. But I think starting this year, February and September. And then GNOME, GNOME usually a little later than, than LibreOffice. They also release two times each year. Also Ubuntu, Ubuntu you can see from the version numbering. Something 04, something 10. So every, every April and October Ubuntu release the, the, the, the, the, the, the, the platform to, to the translation for current one. They have very detailed, very good statistic. But if you want to see the previous version, especially the version mentioned that maybe it's already out of support or something. It's, it's not easy. So with that in mind, I plan to compare how many string changes in its release. Is it easy to get this kind of data? How many string changes in its release? It's difficult. Why, why is it important? Why is it important that, that we translator need to know how many string or how many words need, because we need to, to have size to, to, to try to guess how many, how many days, how many days, how many days. How many hours we need, we need to, to spend to do, to do this translation. I think, I think it, it, it will be helpful for, for every project that they can, can create that kind of data extract to prepare those transient themes. But how far translation for instance, you can, you can use, usually you can use percentage. Some projects use the percentage to, to filter out all this, this language translation is definitely include into release. Maybe less than a certain percent is, is like, like the mother most, it's beta version or something. So, again, for current version, this number is easily, ah, fine out. But, but for past releases, yeah, you need to, ah, I don't know. Some, some, some platform maybe still have the data, the other now. And the other thing is, I want to, how, how, who did this translation? For, for instance, in, in, in my teams, ah, in, in this project, I have, ah, let's say 10 members and this project 5 members, who really do the, the nation? Not all platform can provide that, that kind of data. But GNOME, ah, because the workflow is using, ah, the whole file, download, ah, claim to be translated, then upload. So, so it is quite, ah, obvious who doing what kind of translation. Okay, so let's see. LibreOffice. Right now, the, the main platform is, we have played. It has powerful search facility. Too many options. It's quite difficult to, to get, ah, ah, the, the correct, correct query to, to, to, to, ah, convey my, my need to, to extract. Ah, data I need. It also has, we have played API. Maybe I'm, I'm old school, never use API, so when I try, so many data, so many options, how to, how to create the correct API to, ah, this. So, actually, we have played, ah, it's good, but, but, need, need time. I'm not sure, ah, where, where to, who, who, who has, ah, access, ah, create something from their API. I, I want to learn from him or her or something. Yeah, the other thing that, ah, from the weblet, ah, they create a crown, they create, ah, ah, schedule job to push from this, transition platform into the main repository, kit repository. So, actually, ah, I can choose between, ah, source of data, data source. Do I want to take from weblet directly or from the git repo? I tried to access the git repo also. For instance, ah, ah, I, I get, I create a clone and then, ah, change to a certain release and then, ah, there is a git repo for, for Indonesian translation. So, from the directory, I try to list all of commits. From there, I can try to see how many, ah, line changes. Who do commits, but actually, when web, ah, when someone doing translation in, in web, from weblet, the data stop in, in weblet because commit from weblet to the, the git is done by some special, ah, special account, not the initial translator. So, so, there is data missing there. If we, we, ah, take from git commits, but, ah, some other details, it's good. And actually, the, the data I, I can present here came from way back machine and from my wiki because I, I, I maintain a wiki to, to maintain. I, I, and maybe 20, 30 translation projects to Indonesia. So, so many. So, I, I don't want to make a bookmark. My wiki is my bookmark for translation. So, this is the result. This is the latest status. The office. You see that, ah, UI only 99% of, ah, ah, finished. Only a few, this, this few strings not, not here. It's because we, we are not, not sure how to translate. This is missing context. Now, but, but this one is very bad. Help 70% and even for the newest release, less than 70%. UI for this, ah, ah, two, ah, ah, last four years. So, any, ah, between, between releases is, is, is six months. So, this four years. Relatively, almost 100%. That for help. So, a French version 70, 7.0. We have 3,000 strings. And right now, ah, 18,000. Untranslated. We, we have a good, good, ah, result when in version 6.2, UI and help for admission 100% translated. We, ah, that can happen because we, we did, ah, translation spring, ah, two or three days and how many people, maybe 30 people or something. Going together, boom, zero, untranslated. That unknown quality. Because, because they, they, they start transit only for, for that, for that occasion. Yeah, we, we tried it. Oh, so this time we tried to, to, ah, quantity, not quality. So, let's try. But after that, we, we don't have, ah, they are, they are only involved in, in, in that occasion. After that, ah, the long-term translator do not have enough energy to continue help. I, I, I sometimes try to, ah, maybe 10 or 20, 20 string to translate. Why? Because if UI, one string usually consists of maybe from one word to five, maybe 10, the longest one. But help string. One string consists of 30 words, 40 words, 50 words, ah, so long. So, ah, ah, you cannot compare only, only a certain string. No, no, no. You need to see how many words that really, ah, show how large effort need to, to, to do them. Okay. So, who did? Actually, only two. These three, I think, they involved in updating the, the source, not, not doing the translation, but, but they search, ah, ah, giving this data. Okay. That's LibreOffice. And then, let's see GNOME. These are the latest. You can see GNOME, GNOME, ah, divide their many modules into several categories. What we need to, ah, translate is usually, ah, we prioritize translation in this one, user interface for this, this group. But other groups usually, we, we, for instance, GIM and FRAMPS. This, this, ah, I never touch those for maybe, ah, for many years. Because not every project are easy to be translated. Especially if, for GIM is, ah, image editor. I'm not familiar about the terms used in, in this, this, ah, this community. What kind of translation usually use? So, so I, I, I, I, I don't have enough time to consult with, with, ah, what we call, ah, SME, subject matter expert on image processing, image editor. So, I didn't do that. But other, ah, the, ah, UI is good. Help, ah, still some effort to do that, but, but not quite good. Yeah. This is the statistic. Since version 3.6, GNOME, Indonesian translation is almost always 100%. And help getting better. This period I have too much free time. So, I do many, many translations. And then after that, ah, VC time. This how many commits? This to, to, to calculate who did commit? Actually only two people here. Two people can, can maintain that kind of, ah, percentage quality for how many years? I don't remember. Maybe 10 more. I started as a GNOME Indonesian Transition Connector in 2010. I offer many people, would you replace me and, and when come forward? Nothing special with translation coordinator. It's because no one else want to do that. Okay. Ubuntu is the most difficult one. Yeah. You see here? I'm translating 200,000. So much. Why? Because many of them GCC. Do you want GCC to be translated? Why? Do you want GDB to be translated? Exhib, library. So, so, so, so, because you want to use that, ah, what was the preference? Trans, ah, what, what is the? Transfects. Transfects, eh? They, they do not create a subcategory. So, so, it's quite misleading. How, how good Indonesian was translated into Ubuntu. How good, for any language is, ah, how good translated into this, any language. Okay. Ah, I think that's the statistic that, statistic analyzer, someone said it. So, so, why, why, why I care about this? Because I hope that if some other team can, can recreate the, the effort to, to do that, that, maybe they can use that for create a funding proposal or maybe, ah, can be used to plan transition phase, ah, spring, need for many people, ah, how, how long and target to can, can be, ah, can finish how many strings. It's, it's, it's rather difficult to, to use data from one language into others, yeah. But, but at least we can have, ah, approximation. Okay. I think that's my thought. Any question? Good. Ah, sorry, yes. Yes. Because we are all from Asian, right? And, ah, I'm not very familiar with the Indonesian, but I know you did a lot of translation work for upstream like Ubuntu. Okay. Ah, if I, if I come back to China, I want to do, okay, I want to do some translation for the upstream, ah, over assistance. So, ah, how can I solve it? Sorry. Sorry. How can you start? Ah, sorry, ah, what kind of language do you, Chinese? Because, ah, as, as far as I know, in Ubuntu, they, they have traditional Chinese and the other one. So, so, so you just joined those, those existing team and start translating. I think it's, it's quite, quite easy. I think, ah, I'm not sure what is the number, but, but it's already has a team, already has many translators, but you can join them and then, then continue. I think for a language that has not been started, it's rather difficult, but, but for Chinese, I think. Yeah, yeah. Yeah, Ubuntu, Ubuntu is, is, is quite established, I think, for the Chinese translation. For different projects, ah, you need to check, they, they, they have different, different way to, to do translation, different way to, different platform, different, different. So, for instance, ah, like my, my three example, GNOME, LibreOffice and Ubuntu, different platform, different, ah, algorithm, process, different process. GNOME, they, they need to take the, the whole file, translation file, doing translation locally in your computer and then upload back to the system. The other one, Ubuntu and LibreOffice, you can do two way, you can do translation online or, or you can download the file, translate locally, upload back. Okay, other question? No? Ah, yes. Because in the beginning you mentioned you have had some problems using the WebLay API for your projects. How many projects, different languages actually do you use within the WebLay process? Six, I think, right? Six different Indonesian languages, local languages? No, no, no, no, I only, only one main Indonesian. So it's only one project? Yeah, code name ID. And what, what, especially are problems with the API? Ah, for instance, ah, ah, if I want to see, ah, which one? How, how many strings left? Not translated yet. For help. This version, help. API can only, ah, need, need, need to be, need to be expressed with, ah, URL and then, ah, main function and then project name, project name is, ah, they have a list of, ah, UI master, UI version, ah, certain, and then help, master, help. So, that's the project name. And then the, the next component is, what do you call? Module or something? Module, ah, from, from this, this one, one project name, they has 200 modules or something. And then slash language, ID. Ah, so, ah, it says, I need to, I need to iterate all those things and create the, the summary to add and then something. It's, ah, quite difficult for me. So, it's more an organizational problem? Yeah, yeah. So, so, maybe because, because I don't understand enough the API, so, so my, my, my approach is not efficient enough. So, that's, that's why I, I need to consult or discuss with whoever, ah, more familiar with this, this API, that API. Okay. No more questions. Thank you.
The importance of Web Performance to Information Equity
All right. Hey, everybody. Welcome to the Web Performance Dev Room. I facilitated this year with Missyla and Wickey Media. We're going to go ahead and get straight into it. And I'm going to introduce Baz Skroten. And, yeah, I'll pass this over. In Yemen, the cost of a gigabyte of data is approximately 6,000. A gigabyte of data is approximately $16. In Chad, the cost of a gigabyte of data is around $20. In Yemen, the average income, or the median income, is about $250 a month. In Chad, the median income is about $60 a month. Hundreds of millions of people in the world live in areas where they spend a significant portion of their income on their data bundles. Often, lack stable charging facilities, and they can only imagine what it would be like to have the kind of high end, or even mid-end devices that people like you, I, or anyone else in this room have. Often, when we think about performance, we're thinking of making our websites faster. We're looking at making them faster and more fun to use in order to improve our conversion rates to sell more products. I'm Bas Faldan. I am a principal engineer at Mozilla, and I am the tech lead for Firefox Performance, and I'm going to talk about how performance is so much more than that. I'm sure that most of you are familiar with countless statistics about the limited means that the poorest half of the world's population live with. There can be no doubt that those people deserve the same access to information as you, I, or anyone in this room. Understanding the importance of that information equity means understanding the importance of web performance. When we're talking about performance, we're usually talking about one of three things. The primary thing, and the most obvious one, is speed. Speed is about how fast and how smooth the results of a user's interaction with your sites or services actually renders on their device. Another aspect of that that sort of directly leads into that is data usage. Obviously, before you've actually sent the data to the device, there is no way that they're actually going to be able to see what you're about to render. But something that's a little bit less obvious is that power usage is also an important aspect of performance. Not only are you going to help the environment by using less power, but you're going to extend the lifetime of people's devices, making their batteries last longer, but also causing them to heat up less, have less fan spin up, keeping the devices more comfortable to use, and decreasing the wear and tear and increasing the longevity. And finally, you're going to reduce the amount of heat, obviously. In the time we have together, I want to talk about how people living with more limited means are at a disadvantage for all three of these pillars of performance, and specifically also web performance. We'll go over what the global landscape looks like and the situation, particularly in the global south, that people are dealing with when we're thinking about some of these things. I'm confident that you will be left more motivated to improve your sites and services, and as a result, you'll pay extra close attention to the speakers that are here the rest of this day. For now, the first thing I want to talk about is raw speed. When I talk about raw speed, what I mean is the performance of the CPU of the device that we're talking about. This is basically how quickly, once a device has all the instructions that it needs to render something onto the screen, how quickly can it do that? What does that look like? Well, over here, I've compiled a list of the most common, most popular smartphones for Africa versus Europe. Now, getting good public data for Africa is actually kind of hard, but that's not too important right now. The most important thing is that this list for the phones in Africa is probably a lot for you like it was for me. This high sense, what now? I've never heard of these things, right? And the important thing here is the trend. Most of these devices on the list for Africa, or actually almost every single one of them, is at least 2x to 3x slower than devices we see here. And do not let their names fool you. That Itel S23, that cute naming trick, that device is nothing like the same model from Samsung. So what that means is that if my LCP takes 500 milliseconds of CPU time, delivering that same LCP in another part of the world will take at least a second. Now, we know that LCP impacts conversion rates quite significantly. A one-second improvement to LCP means a 27% of improvement to conversion rates. And that does not just mean, that is not just about how fast or how likely you are to sell your products. It's also about how likely people are to access the information that you are looking to present to them. And of course, this is not limited to the global south. The performance improvement that that Samsung S23 offers over that Itel S23 comes at a hefty price tag. With a Samsung S23 costing over 650 euros and an Itel S23 costing under 150. It doesn't take a genius to know which class of society is more likely to own one device over the other. But obviously, the raw performance of the devices is not the only thing that is different here, where most of us live, and in the global south. And there are other aspects of those differences that have a much more direct impact on people living with more limited means. And the most important one there that I'm going to talk about is data usage. Let's take a look at what the global landscape looks like when it comes to data usage. I've pulled a list off Wikipedia here for the countries with the slowest mobile data connections. One of the first things you'll notice here is that the mobile speed of none of these countries exceeds 20 megabits a second, or about 2.4 megabytes, right? And for some of these, the landlines don't even exceed 1 megabyte a second. But an important thing to note here is that this list from Wikipedia is built based on results from speedtest.net. Now, we can sort of assume that people aren't likely to run speedtest.net when they're not actually trying in a good connection situation. And we can see that because even the slowest here, 3 megabits per second, that's almost a maximum speed of 3G. 3G has a maximum speed of about 500 kilobytes per second. And let's look at that a little bit. Displayed here is a 4G coverage map for one of the major carriers in Nigeria, the most populous country in Africa with a population of approximately 225 million. What you can see here is that outside the major population centers, there's not much. And a lot of that on the really less densely populated areas isn't even 2G. So you can assume that a lot of these people, the fastest connections that they could possibly have access to are about 500 kilobytes a second. Now, let's compare that to a 4G coverage map by the FCC of the United States of America. Unless you're visiting a national park, you're probably going to have 4G. And if not, you're still going to get 3G. That's a very different situation. But now, of course, the speed of mobile data transfer is not the only component here, which is different between here and western countries and the global south. A more pronounced impact your users will experience through cost. Visualized here is the cost per gigabyte of mobile data. If we look at Chod, the cost of a gigabyte of data is over $20. If we look at the United States, that cost is less than $10. And in most European countries, the cost lies even lower. But even if we ignore the outliers, it's important to realize that the global average is roughly $4 per gigabyte. Compare that to a global median income of about $300 a month. Half of the population has to do with less than that. So let's think about that for a second and think back to the introduction of the talk. A gigabyte of mobile data in Chod costs about $20. A monthly income in Chod is about $60. That means that if your site ships one megabyte to the median user in Chod, that costs them about 0.03% of their income. If your website takes about three megabytes per visit and a user visits it once a day, that will cost that user about 1% of their income. Now, to make that even a little bit more concrete, I went to bbc.com. I looked at about five articles and I read them. During that time, bbc.com consumed about 17 megabytes. If a median user in Chod chooses to read five articles a day on bbc.com, every day that consumes about 15% of their income. Add to that the consideration that 95% of sub-Saharan Africa accesses the internet solely through mobile devices. You can see what an immense impact the data consumption of your websites can have on the disposable income of the people living there. And of course, when thinking about that data usage, it's not just that you're making it faster. You're, after all, on 3G, that three megabytes to show your site takes at least six seconds to retrieve. It's saving you and your user's money, as we already talked about, and it's going to lower the carbon impact of your sites and services. And talking about carbon impact, let's talk a little bit more about power. When we're thinking about optimizing power, we shouldn't just think about reducing the power usage by making our websites render faster. Obviously, if your website consumes less CPU, it's also going to use less power. But more impactful for power is, what are we doing when a user isn't really actively interacting with our websites? We should be avoiding animations, videos, or animated ads when a user is just reading on our sites. Of course, we should be minimizing the amount of JavaScript that's associated with simple interactions. And this comes with a myriad of benefits, even though those two watt-hours a visit to your site might consume might not sound like much. If your site has a million visitors, those two watt-hours become 2,000 kilowatt-hours, 2,000 kilowatt-hours a day for your millions of visitors. The average power consumption per capita, or the per capita power consumption in Africa, is about 150 kilowatt-hours a year. But when we delve a little deeper, there's a lot of other advantages. You're going to be decreasing the amount of heat users' devices produce, particularly on slower devices. You're going to be reducing the amount of fans that spin up, reducing the wear and tear, making them more comfortable to use. But most importantly, you're going to be reducing or increasing the lifetime of their batteries. And this is again the area where particularly the global south is disproportionately affected. There is about 1.1 billion people living in sub-Saharan Africa. By estimates of the International Energy Agency, about 40% of that 1.1 billion people live without access to electricity. I want you to stop and think about that. There are more people living in sub-Saharan Africa without access to electricity than there are living in the United States and Canada combined. For many others living there, access to electricity is limited and power outages are frequent. Of course, those people with no access to power are also going to be less likely to have mobile phones. However, for those millions of users that do own mobile phones and do not have access to power, or more limited access to power, and all those countless users that are going to be coming online over the next decade, they are often dependent on centralized charging facilities to charge their devices. Needless to say, for them, their phone running out of battery is a very different situation. They're for most of us in this room, where your phone running out of battery means you have to grab a charger, it's a nuisance, or maybe you grab your power bank, right? So now that we have a better understanding of what the world looks like in terms of the internet, where does that leave us? The internet is going to play an increasing role in everybody's lives from how you do your taxes and how you are billed by your service providers. And the potential to do good for the internet is immense there, and that potential can particularly benefit those people and organizations with limited means by reducing their costs for staffing, travel, and time spent for those people and organizations that are least able to afford that. At the same time, it reiterates the role that we have as developers to ensure that we have a responsible impact on the most vulnerable communities on the planet. Now, we're here at FOSDEM, which means that hopefully most of us are working on open source projects, and if you're anything like me, other projects, commercial or otherwise, using your code is a great source of pride. And that means that when we're designing our components, our code, we may not be thinking about those particular use cases. We may be thinking about users that are not affected by these particular disadvantages. However, we have to think about what other projects may be using the code that we're writing and what users they might be reaching out to, and those users may be in those vulnerable positions. Thinking about them means thinking about what performance. The great news is that the work we do to make our sites faster, make them use less data, and make them use less power isn't just good for those users. You're going to be making your websites for your users, especially when they're riding a train through a tunnel or riding an elevator. You're going to be keeping their devices cooler, making them more comfortable to use and making them last longer. You're going to be helping the environment. The greenhouse gas emissions that are produced by the internet and the devices that we consume it are vastly more than all of global aviation combined. And of course, this works the other way around as well. When you're making your websites faster for your users here, you're also going to be helping those people in more vulnerable positions. Since you're all here at this early hour, I'm certain that many of you were thinking about web performance already, and thank you for that, and thank you for being here. I'm confident that you're going to be even more excited to make all your websites faster, and there are a bunch of great speakers coming up the rest of this morning to help you do exactly that. Enjoy the rest of your day. Are there any questions? It's an interesting question. The answer... Oh, yes. So the question is, what do we do, or what do I do? I guess that means Mozilla, right? To make sure that Firefox works well on devices with limited CPU. I think the short answer is not enough. I think that, like probably most of you and most developers out there, almost everyone working on Firefox is on fast devices, fast phones, many, many iPhones where we don't even chip our own engine, right? And I think that is a hard thing to change. That's a hard mentality to change in the business of software development in general. We do explicitly test certain low-powered devices and their performance characteristics. But the global landscape is very diverse. I think that the reality of it is that we tend to work a lot on making Firefox faster and consume less resources on fast CPUs, and then we hope that translates to a better experience on slow CPUs. One day, perhaps, we'll have optimizations and work that targets very specifically the different types of CPUs and different compositions of CPUs, in terms of heterogeneous architectures and things like that, that are more common in the global south, but we do not currently do that. Anybody else? The next speaker has five minutes to set up.
Let's build a RUM system with open source tools
Hello, everyone. Today I'm going to be talking about building a real user monitoring system with open source tools. And before I dive in, a bit more info about me. My name is Tvetan Stojchev. I work on MPALS in Akamai. MPALS is a real user monitoring system. It's a commercial one and it serves, I think, thousands of customers, Akamai customers. And my hobby project is basic run, which will be the focal point of this presentation. And really before I dive in, I would like to share a bit more about some of my other personal activities. Every December, I make an attempt to publish at least one blog post on the web performance calendar. That's the best place for the web performance to see us in the year. And the other thing is, sometimes I do a street art. So that's my safety net plan. If chat GPT takes over the world, I still will have something creative to do. Yeah. So let's now move on to the important part of the presentation. And let's take a look how, in general, how a real user monitoring system would look like. So we will need to have something in the browser, ideally a JavaScript agent, that will read some data and it will send it to a server. We will store it somewhere in a storage and later we will analyze this data. And here we just see the most trivial piece of JavaScript. This is the bare minimum that will do the job in the browser. So this piece of JavaScript will read what is the current page URL. And it will create a one by one image, one by one pixel image. It will append it to the HTML and this actually will create, will force the browser to create a request to this endpoint. And here is a really very simple code snippet on the server side, how the code will look like when we need to intercept this data and to store it somewhere. So here is our route where the browser will hit this route. We will read the query parameters, headers, headers, and even we will put a timestamp in the structure and then we'll save it to JSON on the file system and we will return back to the browser a transparent GIF. And eventually we will, on the next stage, when we want to analyze the data, we will go through all the files and we can create a summary for the page visits. And for example here in this example we can see that category four was the most visited page with 427 page visits. So that's the theory. And in 2019 I started as a hobby basicram and that's the initial version and the components that I used to build basicram. So on the browser side I started using an open source library called boomerang.js which collects a bunch of interesting metrics from the browser and sends them to a server. On the server side I used nginx and some PHP application code. And for storage I used mysql and for analyzing the data I still used php and for reading the data and serving it to a frontend and on the frontend I used plot.ly.js for visualizations. And I ended up with something like this. It actually, it's really interesting after five years it's still running. So if you want to give it a try this is the first version of basicram. You can visit demo.basicram.com and you can play with this UI. Now about boomerang.js. Boomerang.js was started 2011 in Yahoo by Philip Tellis who happened actually to be now a colleague of mine. And currently the library is maintained by MPAL's engineering team in Akamai. And as I mentioned the library collects a bunch of interesting metrics like the interesting ones for core web vitals, lcp, cls, fid. It also can track some session data. It can also help users of the library to create a timeline of all the clicks over the page like cycle of a visitor. And also it has more modern ways to send the data to the server like more modern JavaScript APIs fetch, XHR and send beacon. And it can be found on GitHub in akamai slash boomerang. On the back end side that's again like very theoretical but what actually was happening I still was every request that I was getting to my server I was saving it in a file. And then periodically I was running a cron job which here I just marked as a that's kind of a too much overhead and you understand why later. But I was running a cron job that was reading all these collected files and I was creating one big batch and I was inserting this data in my SQL. I also ended up with a database model that's very biased. My previous background was I was building Magento online shops and if somebody ever worked with Magento we'll probably recognize some patterns about all these foreign key relationships and this main table that's in the center of everything. I had to put bunch of indexes here and again this created a bit too much overhead for I would say also on the code level like on the application level but also for me as a maintainer. So I had to take care about again every time when I wanted to introduce some new dimension I had to create a new table and to put a bit more code for inserting the data and it was just too much maintenance for me. Also I had to take care about not duplicating some of the data here and this is because of the nature of PHP. PHP is kind of a stateless so every request is independent from the other request so I couldn't keep some things in memory. If I could keep some references in memory I probably could optimize some things here. And actually question to the audience do you have an idea what this query actually would produce? What's the idea behind this query? Maybe. I can say that. Bucketing? Yeah it's a bucketing for a histogram and I also had to write a lot of kind of queries that are in the data scientists type of queries which also was for me introduced a bit of a learning curve but the system had really had coded in itself such type of queries and this here represents a histogram of the time to first byte. Like we can see that the median is around 1.8 seconds. It's a bit skewed. And with the help of plotly the JavaScript library for visualization I could create such panels for distributions for operating systems and mobile operating systems and I also could write such bar charts that were showing kind of the relationship between the first byte and start render time. And yeah reference to the plotly it's a really cool library really rich and you can create a bunch of panels with it. But I found myself like having difficulties and probably not focusing at the right place. So as I say when you build a real user monitoring system you need to change your mindset and your queries should be more like in data scientist style. And the PHP were out and the ORM that I was using I was using doctrine. It's not really meant for writing complex queries from this fashion. So I found myself writing my own query builder and using doctrine when convenient and using my query builder when convenient but this was again too much maintenance for just for a single maintainer of a project. I also wanted to introduce user management and permission system but again with my limited time and working from time to time on the project during the weekends this was just again too much it was not the right focus. I wanted just to show some meaningful data. And yeah I really love plotly but I just ended up with large blobs of JavaScript here and there and it was more like more and more plotlier. I wanted to see data not writing JavaScript. So I took a break I believe half a year and I focused on my main job but from time to time I was doing research and I was reading some other articles about time series databases and I started exploring some of the open source available open source systems for visualization. So I kind of rebuilt the complete backend. I still kept boomerang but I rewritten the server site so I completely removed nginx and PHP and I used golang. I replaced my SQL with click house and I replaced all the custom code all the PHP and plotlier with grafana. And again if you would like to play with the current version of basicram that's what I ended up with that's actually a let's say a bit of rebranded version of grafana with the specific basicram dashboards and settings. So if you would like to play with it just visit this address and write calendar calendar as a username and password. So where golang was really useful, golang it's just different paradigm it's a different idea compared to PHP. Golang you can compile a single binary that and in this single binary everything that I needed was packaged inside the binary so it's just a process that you run on the server and it has everything inside and this allow me to replace the actually to get rid of nginx because golang has a package for built in htp server and yes PHP also has a package for PHP for htp server but you need to do a lot of work arounds to make it working because just not native in this is not native in PHP. I also could leverage the existing click house package in golang for interacting with the click house database and I took advantage of asynchronous insert which saved me a lot of I could get rid of some code that I had in the PHP version of the basicram. Also in golang it was very easy to create a backup mechanism for all the data that was flowing through the system because in golang I could easily keep stuff in memory I didn't have to write each request to the system on a file and later to batch it and bundle it. I was just keeping these data points and requests in for example in memory for 10 minutes and I could just flush them on the hard drive and compress them and this was again really really easy few lines of code and just natively coming in golang and also for some cases where I needed encryption again in golang there is a let's encrypt package it's a third-party package but I could easily just spin a server and say okay I want to use let's encrypt and I was getting secure connection to this server with it it really reduced the operation the effort on the operation site. I also took advantage of a gip lookup library which is using the maxmind database and why I needed this because in a real user monitoring system you would like to see from which city a visitor visited the website or from which country visited the website this is really helpful when you want to create a report and when you want to figure out maybe in which country is your website is slow. I also took advantage of another library about user agent parsing so this library helped me to extract important information about the browser name the operating system and the user agent family and I also started using my new favorite database Clickhouse. So you remember where I say that I was doing a lot of work when I was like batching and bundling everything and inserting these big batches in MySQL. Clickhouse comes with a really cool feature called asynchronous inserts so Clickhouse allowed me every time when a request reaches my back end to immediately to create an insert to Clickhouse and Clickhouse was internally batching this and it was deciding where it needs to insert in the database so this was not this helped to like not reach some performance botonics. Another thing that I could do with Clickhouse so here you see I have seven tables in the old setup with MySQL but in Clickhouse I actually end up with two tables and I actually could I actually could have one table but I needed this table for showing the host names in the filters in Grafana and just Clickhouse or in general when you work when we work with time series the main idea is that here the the the data is normalized I try to really build to build a user monitoring system in the fashion of a webshop right which is really the wrong idea but when we use time series database the idea is that the data you can just throw your data into this database you you have one large fat table and you throw a lot of data and you don't really need to consider duplication of the data for example here we have this filter's device type and I don't have a foreign key here to another table where I keep references to all the device types I just can insert and insert the same string over and over again desktop desktop desktop and this database will be completely fine with it it will compress the data internally and I won't experience any performance bottlenecks when I filter by this field and here is my other favorite feature in Clickhouse it's called it's called low cardinality data type and this data type is really convenient for columns where the distinct values in this column some less less than 10 000 because this it's optimizing eternally and it's the the where conditions and the filters in this case are much much faster when we use low cardinality we if if we have more than 10 000 distinct values we probably need to go again to something like this and to start introducing additional dimension tables also so here in left is really uh I would say insane I even don't know how I created this I still I'm really surprised with myself and you we cannot zoom in here but this was a process where it included querying my my secure database and I had some application code and I had bunch of cron jobs and this was trying to guess and to find out all the sessions that bounced and what was the duration of the sessions it was just really complex and for example to to calculate the bounce rate with my new setup in Clickhouse I just could use such a query again I got a bit help with this query I don't completely understand it but it does it actually it works and it's much more simple and much much more it makes my it makes basic run much much easier to maintain and with with this query I could actually create easily this correlation between bounce rate and epic and metric and in our case this is time to first bite also I want to say that open source is not only about how great is the open source product that you work with but also the community is very important and that's why I also stick to Clickhouse they have really great slack community and every time when I ask a question I I can say that in the matter of a few hours I get really a good response for example here I'm asking hey I I wrote this query but I feel that it's not optimal I'm not a SQL expert and here another expert actually suggested a better way how to write this query it's it's shorter and it's much more performant and also probably this is the first and probably the last database channel YouTube channel that I will be subscribed but I'm actually subscribed to the Clickhouse YouTube channel and they have really really good videos like they have every month they have like a release party video where the the Clickhouse team is showing the new features and there are a lot of good tutorials so it's it's really welcoming for for beginners and they say you get support from the community and there is really good there are really good materials out there so now let's look at the user interface Grafana earlier I mentioned that I was about to start in my in the first version of basicram I was about to start implementing my own my own user management and login and authentication and Grafana this comes out of the box so it's much easier to add new user to give them different permissions and again this is just the code that I would never want to write again right and in this repository I bundle the basicram version of Grafana it has some customizations also another benefit of Grafana is it's very easy to model the data and what you want to see in the in the visualization panels so for example here we have we can define filters we can have a preview of our data we can also configure different things for example here I'm just showing how I can configure different colors for the different thresholds and also there is an SQL editor so when I write the SQL here this Grafana uses this SQL to fetch the data from Clickhouse and here are other panels that I took advantage of here is the world map so I could it was really literally plug-and-play I just configured few stuff and I say it from where to read the data about the countries Grafana also has a third-party plugin for plotly so I still there were scenarios where I wanted to build some more complex panels and with this panel I could actually build this one which is showing how the device the screen size is the width of the screen size is distributed yeah time series this is the kind of the most the default view in Grafana and also I could present the data in a table this is very good when you want to explore your own data also Grafana comes with different data sources and of course Grafana needs to know how to talk to Clickhouse in my in basic realm I'm using a data source developed by company called Altinity but there is also another one developed by actually official by Clickhouse right yeah and just to say that all these things that I'm showing all these dashboards that are built in in the basic version of Grafana everything there is actually under version control so it's not just that I created a dashboard in Grafana instance and exported it and save it somewhere this I have this repository where I have the configuration for each of the panels that I'm maintaining and then this makes makes it much easier when I need to change something or to add a new panel and I can go through the history and I can understand what actually change if something has to be reverted yeah for example here we are seeing how I keep this row as it's a templated SQL but this is how it's presented then when we look in Grafana and again out of all this source code configuration that I keep for the dashboards I'm building a docker image where we here we have a bit of branding work just removing some things from the default or rewriting some things from the default Grafana image here we are installing the plugins that we need for our setup and here we are importing all the configurations for the dashboards and the data sources and what I found over time when I spoke to different people who asked me about three user monitoring systems very often the conversation was just ending when when I was explaining yeah you need to run this component on this server and you need to run this component on this server and you don't need to run this component on this server and it looks like their use case the use case of the people that I spoke to was actually not requiring them to scale they had pretty small websites or web shops and I work on something a bit more monolithic it's called basicrum o in one and the idea is that probably again probably it sounds from engineering point of view a bad practice but it actually could be really practical thing the idea is to run everything on one big box and I believe for 20 euro a month this could be actually hosted somewhere and I tested it it can handle 1.5 million page views a month and the idea here is we introduced traffic which is a proxy it stays in front of this folder components and it's helping me for SSL termination and routing request because some of the request needs to go to the data collection part and other request needs to go to the grafana to the part where we analyze the data so this is really convenient it's really easy for people if you just want to give it a try and a few takeaways I just have to say that a real user monitoring system is fairly complex system and you need to learn to train yourself you want to develop one you need to you need to learn more about on the data collection site where how the data is collected from the browser how to visualize the data and it will be a bonus if you learn about how time series databases work again choosing the right database to solve the right problem is the key and it's great when when you can transfer a problem from the application on the database layer it just saves a lot of time and yeah grafana could save a lot of time and effort even I recommend it even if you still want to build your own front end maybe just start with grafana to play with the data and to display something it literally will save a lot of time and I got a signal that I run out of time but you can catch me up all right I can take one question so in this project we don't really keep any IP addresses so for example that I guess that's what we consider like user data or yeah so the backend doesn't store any personal data in this case so by default it's using the IP address only to identify the country and the the city but it's not storing the IP address after that and I know that on the data collection site from the boomerang library I'm not sure if it's on the boomerang library has also like part of the boomerang source code is private but I know that for PCI compliance reasons it has special parts that try to avoid collecting stuff around the user sometimes the user may put for example a credit card number and this could be actually collected by mistake so this library also tries to avoid collecting critical user information do you mean to consent the cons so the library comes with a special snippet that's a loader snippet so you can have your own callback so you can you can call this loader snippet only after a cookie consent so it's possible you
Keyboard Interactions
So, hello everyone. My name is Patricia. I'm very excited and happy to be here. Just quick disclaimer about me. I am Chromium contributor and today I'm going to present about keyboard interactions and how I tried to improve them. So quickly about myself, I am from Vilnius, Lithuania. I moved to the Netherlands to study computer science and during my studies I really got into open source, specifically through the Google Summer Code program. In 2022 I worked on the definition of the INP metric and I continue my work diving deeper into metrics, interactions and specifically into keyboard interactions that I will explain you today. So about 2022 I worked on the Perfetto tool which is a wonderful tool for developers but I won't get into details here because Alexander in few moments will explain everything you need to know. But how I use it to this day, I trace websites with DevTools Timeline check mark and it gives me all necessary information about interactions and specifically event timings. And we know when we have event timings we can get anything about INP metric. So what is INP metric that was already mentioned? It is a very I think popular metric today but it's simply an interaction to next pain metric that assesses responsiveness by measuring key press, tap and click interactions. So for example when you press a key on a virtual or physical keyboard or tap on a touch screen or click a mouse to open any menu on a website everything is measured for developers to see how fast their website responds to users input. And this definition is actually I think wonderful, very innovative and Google since Google announced it very recently that it's going to replace first input delay this March 12th which is very exciting for that performance group. However after looking better into the metric we found out that it's not entirely perfect although it's very wonderful metric but specifically key press interactions aren't working as we would like them to work. So my goal today is to explain how we improve key press interactions, what is firstly a key press interaction and how we measure them and then we will dive deeper into a bit more complex concepts of non-standard interactions such as emojis and how we measure them for the INP. So I guess this lines brings you to kindergarten especially considering the FOSDEM context of very heavy tech topics but to understand what is a key press interaction we really need to look into a simple button because key is just a button but in a more complex context. So when you press any button in this world button goes down and when you release it buttons button goes up. It's just that simple. So within the key press we have very similar behavior that contains the two fundamental events key down and key up. In this example we have one interaction as in the input we see character A and the entire interaction starts with the first event called key down meaning that user press down a key. We immediately generate interaction ID for that saying okay we start the interaction. After key down there is key press event which is dispatched if and only if there is a character value and we see that in this case we do have character value and it means that key value was mapped to a specific key. So for example if you would press anything that wouldn't produce a character value you wouldn't see a key press. Then we have some events about DOM so before input which means that the DOM is about to be updated input which is the immediately dispatched when DOM is updated and lastly we finish the interaction with the key up to which we assign the very same interaction ID as it was generated on the key down. And although this definitely makes sense and most importantly key down and key up are the most significant events in the interaction that you perhaps have already seen. So this sequence gives us the entire definition of keyboard interaction within the INP. So it contains of three time spans as in the click and click and tap interactions but in this case we have input delay, processing time and presentation delay. So input delay is the time when is the duration when user presses down a key and event handlers are executed. Processing time is the time it takes for the code to be executed in the key down and key up event handlers and finally we have presentation delay which is the entire duration when the event handlers are stopped being executed and we finally see something on our frame. And this definition definitely does make sense. However it had some problems. After better investigation we found out that it can be a bit more than confusing. Firstly having key press events interaction ID equals zero makes developers wonder if key press is related to keyboard interaction at all. And to make things worse it turned out that key press can be as large as key up and key down together. Just a second. So with this problem we updated keyboard interaction, the definition of the keyboard interaction and we have that the new update is very similar. It all contains those three time spans of input delay, processing time and presentation delay. However for the processing time we included key press such that it would be between those three candidates of key down, key press and key up at the end. And we really hope that this will remove some confusion for developers as we assign interaction ID to the key press event. And finally we do believe that including key press is a step towards polished, improved and more accurate IMP metric especially within the keyboard interactions. Well with the simple key press is everything is quite well defined. We know where the start is and we know where we finish the interactions. However more interesting things happen with non-standard keyboard interactions because we cannot be sure that our users will always use just standard keyboard interactions. I even came across this post from Instagram on Google's Instagram that has everything to express one idea from emojis to just basic symbols. And to understand how fast the website responds to such input we really need to dive deeper into input method editors. So what is, does anyone know what is input method editor? That's great. So actually I think most of you might have used in some sort of way. It's a software component that enables users to input text that cannot be easily represented on a standard keyboard. So it typically happens due to the very large number of characters in users' written language. And it's very common in East Asia regions for example Korean language, Chinese language and Japanese language. Although I would love to speak Korean, Japanese and Chinese unfortunately I cannot. So today we will look into a bit more standardized example of simple emoji which has very similar structure. So we already can see that we need to process way more events for one emoji than for simple key press. And however everything actually depends how many interactions were made while producing that emoji. In this case we see that users started by typing in H and then selected the emoji as you can see on your left screen example. Since the complexity is way higher of such interactions IDs we only assign interaction ID to input events. But thinking in general the differences between pressing down a key within emoji context and non-emoji context we find out that we still have those very important two fundamental events key down and key up. So with our updates we assigned interaction ID on key down and key up. We still start our interactions with the key down. We generate new interaction ID on key down then assign the same interaction ID to input event and we finish that key press interaction in the emoji context with the key up event. When users just select the emoji without typing we just simply assign interaction ID to the input event. And that gives us better understanding of how non-standard interactions behave. However the algorithm really requires some creativity and some better understanding. But coming up with the solution for something that does not behave in the same way was quite a challenge and the solution might not be most perfect because it heavily relies on the order of events dispatched. And for example we see here that when we hold three keys at the very same time and release them at the very same time we have three key ups at the very end with the exact same interaction IDs. And this shouldn't happen in general. They should have different interaction IDs. But although it might not be the perfect solution but looking into input method editors is a very important way to address web responsiveness within East Asia regions where people actually do use a lot of graphemes in Chinese, Korean and Japanese languages. And who knows maybe just all our emojis are just introduction to a bit more complex interactions. Maybe one day we will see 3D interactions and emojis will be just simple ones. So for this project I'm really grateful for my mentors and I call them heroes. They really helped me through the entire process of understanding interactions, defining INP and understanding how to improve web responsiveness for developers. And thank you a lot for listening. If you have any questions let me know and if you're interested you can read the blog or just ask me anything you want. Thank you. And do you have? Yeah. Yeah basically, so do the interactions go from top to bottom? So it's actually it really depends on the way we see on the websites. And there's like websites that shows all the events dispatched and it starts from top to bottom and maybe it's not the most intuitive way for you to read from top, I mean from bottom to top, right? But is the usual order that what you would get when you try to look into the events dispatched during keyboard interactions? So yeah, I mean yeah, this was from bottom to top, absolutely. Okay, thank you.
Web Performance at Mozilla and Wikimedia
Hello, my name is Peter. I work for the Wikimedia Foundation in the quality and test team. So I have like three minutes, so I'm going to show you a couple of things. In the team, we want to make sure that we find regressions, right? And the cool thing about it is that we keep all our performance metrics in the open. So you can go to our Grafana instance, Grafana.wikimedia.org and see our metrics. Now I'm going to do a live demo. Oh, that didn't work out so good. So let's see. Okay. So, we have, I'm going to show you four dashboards. We have our real user monitoring dashboard with the data that we send from our read users back to Wikimedia. So I propose that you go to that dashboard and look out at our performance metrics. I think it's quite interesting because we don't have so many big websites that actually show their data. We also have another dashboard where we have all our synthetic tests. So you can use the drop downs to see the pages that we test and the performance data of that. So this is kind of like internal data, so maybe not so interesting for you. So I have two more dashboards that is more interesting. So we have the user network dashboard. Let's see. Here is actually what kind of network our users that use Chrome has. So we use the network information API and beacon back the data. So we can use the drop down and see what kind of network our users is using. And if we scroll down, you can also see what kind of connection type they have. So this is interesting because you can see what kind of connection different areas of the world have when they access Wikipedia. And the last thing I want to show you is the CPU benchmark. So as Beth said, it's important because different users have different devices, right? And for some of our users, we run a small JavaScript that we measure the time it takes to run and we beacon that data back. So we can see what kind of performance different devices have for different users all around the world. And we use that data to actually see and compare it to different devices. So we can use that data to tweak how we run our tests internally. So if you go to that page, you can see what kind of benchmark to use as a Wikipedia. Okay, that was all for me. Dave. Thanks, Peter. Hey, everybody. I'm Dave Hunt. I'm the, I'm going to stand over here so I'm not blocking the screen. I'm Dave Hunt. I'm the engineering manager for the performance tools team at Mozilla. And I'm going to show you a little bit about how we handle regressions for Firefox. And we have tests that test page load benchmarks. I'm going to use a real example of a recent regression. And I'm going to go pretty fast through these because I only have a few minutes. So obligatory slide with a quote from a famous person. So Galileo, Galilei said, measure what is measurable and make measurable what is not so. And I think this is something we try to do in our team. So here, this is a performance alert. We have a bunch of tests running on, we're not suddenly commit something to Mozilla Central, our repository for Firefox. When we notice a change in the baseline, we generate a regression. One of our performance sheriffs will be monitoring and triaging these alerts. In this case, this one was triaged by Andra. And this shows you the magnitude of the regression and the tests that have alerted. In this case, I filtered it down just for simplicity. This is Expedia. And we can see some of our speed index tests have regressed. The sheriff will do some investigation. This is the same test or one of those tests shown in graph view. You can see this is our baseline. There was a change. The sheriff actually has come in here to some retriggers and backfills just to narrow the regression range and identify a likely culprit. And then the sheriff will file a bug. So we file a bug in in bugzilla. Because we've identified the likely culprit, we'll also need info. They will request further information from the author of that patch so that they can be aware. Looks like there might have been regression. Maybe we need to back this out or maybe we need to fix it. And I'm just highlighting here as well links through to one of our other tools, the performance, sorry, the Firefox profiler. So we provide as much as we can to the engineer so they can confirm. Yes, it looks like it's my patch and also can have a little bit more of a deeper dive into what might have caused it. And then another tool that we have is perf compare. This allows engineers to, if they think they've got a fix or they think they have something that might affect performance, either positively or negatively, they can push that to our CI system, run the tests and see a comparison. And so here this is again that example, Xpedia contentful speed index. This is the before, in this case, the regression and a patch that should fix it. And we can see that the distribution of the results, we've run the tests multiple times. Distribution of the results is smaller. And so it indicates that perhaps this is fixed. And it was. So we also alert on improvements. This is the alert that came in a couple of days probably after the patch landed to fix it or to back this out. I think this change was a change in how aggressively we are garbage collecting. And so yeah, we get this and we can also look at the graph view. We can see the period of time that we had that regression and we can see that it is fixed and it's back to the baseline that we had before. We also capture videos. So again, another tool that is useful for the engineers to confirm. Yes, it looks like there really is a regression. In this case, this is the fix. So this is the slower and improved, the faster. I mentioned the Firefox profiler. I encourage everybody if you don't use it or haven't used it, check it out. Try it. Give us feedback. And finally, I just wanted to promote. Floring Kes is talking in Janssen at 1pm today. That's the main track on Firefox profiling. So you'll see a little bit of example of using the profiler for something other than necessarily web performance, but it's a very versatile tool. That's it.
Fast JavaScript with Data-Oriented Design
Hello everyone. My name is Marcus. I would like to share some lessons that I learned while working on the Firefox Profiler. So yeah, I work at Mozilla. I'm on the Firefox Performance Team and I work on Firefox itself and on the Profiler and I also have my own Profiler called Sample which you can use on macOS and Linux to profile your own applications. And I'll give you a brief overview of the Profiler. So this is what it looks like when you have a profile loaded. You have a timeline at the top. You have a call tree and you have a sidebar here in the call tree. It's very, very small down there. I'll zoom in a little bit. In the call tree you can see which function called which function. You can see how much time each function spent running. So let's say this function here dispatch event. Firefox Profiler is a sampling Profiler. So it interrupts the thread at a given interval like usually one millisecond every one millisecond. It checks what's on the stack and then it accumulates that into this call tree. So one thing that I want to call out here is the category breakdown in the sidebar. So here we have a bunch of categories. User is just regular code. Ion here means this is JavaScript code that was jitted into the IonMonkey subengine of our JavaScript engine. And yeah, there are a bunch of other categories. And you can select in the timeline and as you draw your selection, oops, as you draw your selection, the category breakdown updates in the sidebar. So we can also zoom in, see something a little smaller. So here we have more code in Ion, more code in the user category. It also has a flame graph. So zoom back out. And flame graph, you're probably familiar with flame graphs. They're a different representation of the same information. Like you have a call tree, you have nested functions, the width of each box is the time that is spent in that function. And we also have a tooltip here in the call tree, which again gives us a category breakdown. I'm emphasizing the category breakdown so much because we're going to implement our own profiler in a minute, which, and we're going to focus on calculating this category breakdown. So here we see it's a bit sluggish as you move the mouse around, because it actually needs to iterate over all the samples in the timeline. It checks for every sample is the stack inside of the function that you're hovering. If so, check the category that the CPU is spending its time on for that sample, accumulate that into a map of categories to counts, and yeah, do that for all the samples. And we can see here at the root node, we have about 500,000 samples in this profile. So what I didn't tell you is this is actually the version from last July. And I fixed this performance problem here. So this is what's live now on profile.farfax.com. Hovering these boxes is no instant. And it's still doing the work. It's still going through all 500,000 samples every time you move your mouse. So I want to talk a bit about how we can crunch through all that data in in very short time. Wrong direction. So yeah, even with lots of samples, we can now have a fast UI. And I made an example project just for this talk called mini profiler. It is on GitHub. It's also live on Netlify. You can try it out in your browser if you want to. And this is what it looks like. It has a very reduced feature set, but it also has this timeline. You can select parts of the timeline and it calculates this category breakdown. So yeah, let's say here, we spent 30% in Ion Monkey Jitter JavaScript code. At the same time, it also calculates the heaviest stack. The heaviest stack is the stack that we spend the most samples in. All right. So yeah, mini profiler features. There's only two features. You select the range, and it gives you a category breakdown and a heaviest stack. So how does it calculate that? We have an input JSON, which describes the profile contents. The JSON is a list of samples. Every sample has a time, a weight, and a stack. Every stack is an array of frames. Every frame has a name and a category. I'll show you that here in an example. So as I said, a list of samples, every sample has a time property, a stack property, a weight property. The stack is an array. Each stack frame has a name and a category. To calculate the category breakdown, we take in the profile. We take in a range of the indexes of the samples that are in the selection. Then we iterate over this range. We get each sample. Whoops. We get its stack and its weight. We get the top frame from the stack. We get the category from the frame. And then we check. Does our map have this category already? If so, get the current value. Otherwise, default to zero. We add the weight of the current sample. We put the sum back into the map. And then this map is what gets used by this spelt component. For the heaviest stack, it's somewhat similar. We again iterate over all the samples in the selected range. For each sample, we again get the stack and the weight. And now we need to check if this stack has been used by multiple samples. And how do we find two samples with the same stack? Well, the stack is an array, and you can't really check them for equality easily. So what I'm doing here is I'm stringifying the stack into a JSON string. I'm using that as the map key. And then here is a similar issue to what we had with the category breakdown. We check. Do we have an entry in the map for this stack? If so, take its current value. Otherwise, default to zero. Add the weight. Put that back into the map. And if this stack is the heaviest that we've seen so far, we remember it, and then at the end, we return it. So these are the two main algorithms in this mini-profiler. Category breakdown, heaviest stack. Both of them have to iterate over all the samples. So how fast is it? So if I select here, it's reasonably fast. If I make the selection bigger, it starts getting a little janky. I'm computing some throughputs down here. So 100 nanoseconds per sample is how long the algorithm for the category breakdown takes. And 30,000 something nanoseconds per sample for computing the heaviest stack. Because, yeah, we saw the heaviest stack algorithm, it was really inefficient. It used JSON stringify. It looked up this gigantic string in a map. It needs to hash the entire big string and so on. So this is obviously not the way to go. But this is just a place to start so that we understand what's going on. So this is the throughput here. The nanoseconds per sample might not tell you much. But what you can think about is, how does it limit the size that you can handle while still being responsive? So let's say you have 100,000 samples. In this example here, we just had 1,600 something samples. What if you have 100,000? Then you get 10 milliseconds for computing the category breakdown and 3.6 seconds for computing the heaviest stack. 3.6 seconds per update, that's not acceptable. So we need to do something. And also the JSON file, because it has all those repeated stacks, it's just massive. So let's use a different format, different JSON format. Here I made a V2 format. It still has samples, but instead of having the stacks right in the sample, it just has an index. And this index now goes into a stack list. Each element in the stack list has a frame index, which goes into the frame list. Each frame has a name and a category index, which goes into the category list. So I hope that's not too overwhelming. We just have a bunch of indexes now. Instead of nested objects, we just have some side-by-side lists and we index into them. And also here the stacks are a little special because of this parent stack index here. So if, for example, a sample refers to stack number two, then this is the frame at the top of the stack. Then we go to the parent stack, find this frame, that's the next frame on the stack, find this stack, put this frame on the stack, and then the parent here is null. So that means we're at the end of the stack. Hope I haven't lost anyone yet. So let's go back to the compute-heavy stack algorithm. So we were iterating over the samples. We were stringifying the stack arrays and we were checking the JSON string. Now we don't need to do that anymore. Now we have an index. If two samples have the same stack index, that means they have the same stack. So we just use the stack index now here and we don't need the JSONification. We don't look up big strings. And this is like a massive performance improvement. So 300 times faster. The category breakdown is also affected by the new format changes. So now instead of getting the stack and the frame directly from inside the sample, we instead get a stack index. We look up the stack index in the stack array, which gives us a frame index. We look that up again, get the category index, look that up again, get a category name. This is a string. Put that in the map or add up the weight. This string here, this is kind of unnecessary. We know if two samples have the same category index, we can use that as the key. So I made an optimization here to remove this name lookup. And now we're just accumulating category index weights in this map here. There needs to be some process, post-processing afterwards to make sure we get these names here in the category breakdown again. But that's outside of our algorithm. All right. So here I had selected the small profile for the format version one. Let's switch to the same profile in format version two and do the selection again. And now we can see, we can select to the full width and it's still very responsive. So here's our throughputs. So how fast is it now? 47.1 nanoseconds per sample for the category breakdown is what I measured, 51 for the heaviest stack. Okay. So that's much better. Let's see how far we can go. We want to see if there's more we can do here. So we use a profiler. I am going to start the profiler. Oh, what I didn't show you is how to use the profiler. Well, let me do that really quick. So if you use Firefox and you go to profiler.firefox.com, you can click this big button here, which gives you a toolbar button. And then if you click that toolbar button, it starts recording. So let's record our current implementation. Do a little bit of this, capture a profile and see where the time is spent. Well, where is the time spent? One second. Let's try that again. Let me refresh this page. Ah, I can tell you for this time spent. It is so fast that it barely shows up in the profiler because we are still using the small profile size. So let's do that again. Capture profile. The local host here, there's barely any CPU usage. You would see more yellow in here. So let's switch to a bigger profile. We still have just the 1600 samples. Let's switch to the medium profile. So here, yeah, it still works okay. It gets a little bit janky towards the edge here. So again, we're going to start the profiler, select, play around a little bit so that we get lots of samples. Capture the profile. And there we go. This is what I was expecting. So now we have lots of yellow in here. I'm going to show just this thread. I'm going to switch to JavaScript only. I'm going to switch to the flame graph. And now what we can see here is we are spending time in compute category breakdown with string key map and compute heaviest stack with map. And what we see here is that we are spending some time in map.prototype.set, both over here and over there. That makes sense. We're assigning things to a map. So can we not use a map? Wrong direction here. So we're seeing the time in map prototype set. We have the map here. For the category breakdown computation, we're getting the category index out and putting it back in. But we know these are integers. They're integers into the category list. The category list doesn't have lots of elements. We can just use an array here instead. I'm going to use a float 64 array here. Because the weights are floats, using a typed array means I know that the maximum number of elements is already preallocated. It's initialized to zero. I don't need to check if there's something in it already. I know that it starts with zero. I can just add the weight. And that's it. We can do the same modification to getting the heavier stack, the seriously compute heavy stack algorithm. It was also using a map. We can use a float 64 array because we know how many stacks there are. Here the key is the index into the stacks array. We use that key as our index into the map array. And then it should work as before. And what we see down here, it is three times faster to skip the map to use a typed array instead. Let's try that out. Here I'm going to switch from the basic implementation to the integer keys for category breakdown. No, sorry, to the typed arrays instead of maps implementation. And now I'm going to select, and it's very smooth through the entire profile. And we have 500,000 samples now here. And we are still responsive. And let's see if we get an even bigger profile. This one here has two million samples. How responsive are we? It's okay. It gets a little janky towards the end here. It's mostly okay. So where are we now? Let's just take some, take some recap. We've addressed the obvious load ons. We've done what the profile told us. We fixed the hotspots. We changed the format so that comparing stacks is cheap, we changed two maps into typed arrays. Got us a 3x perf boost. In the heaviest stack case, the map or the amount of memory we're using might be a bit bigger now because we're allocating an array where we have an element for every single stack index, even if no sample references that stack index. So maybe some extra memory, but we have a performance boost. And so we have the throughput here. Yeah. So for the medium profile, our throughput is like 16 nanoseconds. Or let's see, sometimes it goes up and down a little bit. Yeah, let's say 16 nanoseconds for the category break down, 40 nanoseconds for the heaviest stack. I was seeing some other numbers when I was trying this at home. So it's pretty impressive. Modern computers are pretty fast, but maybe we can do even better. So let's try better. Let's go back to the category breakdown algorithm. We are taking these two values out of every sample. The sample is an object. It has three properties. We're ignoring the time property. We're getting these two properties out. So what does that mean at a byte level? So how are arrays of objects stored in memory? Well, it depends a little bit on which JS engine you're using, how you're allocating the object, if you happen to be on a fast path or not. But in SpiderMonkey, this is what you might expect. So we have a samples array, which is backed just by a list of pointers. Every pointer takes up 8 bytes on a 64-bit system, and it points to a JS object. So let's say here, the first entry in our samples array points to this JS object here. The JS object starts with a header. SpiderMonkey takes up 24 bytes on a 64-bit machine. Then if we're lucky, we have the fields inline just after the header. We might not be lucky, but let's say we're lucky. We might also have a bit of padding here at the end, because the inline slots might be only sized to four or eight, and we're using three properties here, so there might be a bit of extra memory used up by that. So this is just one representation that we could have. It varies a lot by engine. For example, Chrome has pointer compression, so these things here might be four bytes each, but then the time field might be an extra pointer, because in Chrome, sometimes the floating point values are a separate heap allocation. The padding could vary, the object header size could vary. These fields here could be behind another pointer if they're stored out of line, and so on. But anyway, what it comes down to is we wanted to get these two fields here, 16 bytes in total, but what we ended up with is all of these other not-so-useful bytes clogging up our cache. So when the CPU wants to get those bytes, it gets them in 64-bit chunks. Cache line is 64 bytes. So if you're getting this value here, you're getting the other bytes that are in the vicinity, even if you don't need them. Well, here we do need the JS object header, because the JIT needs to check that the object is of the right shape, and so on. But we really just want those values here. So can we do anything about that? We want to improve our cache line utilization, and we want to reduce the indirection. Maybe we can. Let's do something radical. Let's turn everything on the side. So we have this array of objects. What we could do instead is to have an object of arrays, or struct of arrays, where we have a column, or where we have just one key for the time column with a big array that has just the time values, one for the stack index, just the stack index values, the weight, just the weight values, and a length stored on the side. These arrays must all have the same length. So now everything's backwards. If we want to access the weight in the past, we had samples i.weight. Now it looks a bit weird, because we have the sample table.weight column, and then we get the ith element of that. But let's do it. Let's see where it goes. And so what we end up with here is a new profile format again. Now we have a sample table, a stack table, a frame table. The calories are still a list, because it's just some strings. And same thing as before, the stack index goes into the stack table, the frame index goes into the frame table. We just need to access the properties differently. So what does it do for the computation of the heavier stack? Here we were getting the stack index and the weight property from an object. Now we just get them from separate columns. And already we're seeing a 2x performance improvement. For the category breakdown, similar story. Instead of getting the properties from objects, we get the column first, access the ith element, and get that. This here is even faster, like 3.5x faster. Let's see that in practice. So we're switching to format v3 now, struct of arrays. Let's get the medium, medium sized profile. And now it just flies. It's just responsive all the way. 4.5 nanoseconds per sample, that's really not a lot of time. This is super fast now. Let's get an even bigger profile. Still super responsive. So when we think about the memory model, or the memory, how it is represented in memory again. We're accessing these columns now. We're accessing them in order. And what happens is that our cache lines are now fully utilized. We don't have object headers clogging up our cache anymore. We just have the numbers that we wanted. But yeah, it's just super efficient now. We get all the stack indexes, we got all the weights. The time column is now pretty much irrelevant. It was clogging up our cache before, but now we're not accessing the time column at all. So it just doesn't bother us anymore. Okay, so let's recap quickly. We have a struct of arrays. Some people call it parallel arrays, commonly used in game engines, databases, and so on. It has a few drawbacks. It looks a bit backwards if you read it. Sometimes when you want to pass around an object, you need to manually materialize it because you don't just want to pass around an index. But it also means that the type system, at least in TypeScript, is now less of a help. We can introduce mistakes that it wouldn't catch. So for example, if we build up our arrays and we end up not putting our values in every one of the columns, we end up with mismatched lengths, and that is hard to catch at the type level. Also, when we pass around indexes, sometimes, yeah, you get a number, you don't really know, is this an index into the stack table, into the frame table? I don't know. The type system, at least in TypeScript, I don't think is well set up to catch these kinds of mistakes. But it's much more cache efficient. It's easier on the garbage collector. You need to traverse fewer objects. Some engines skip the contents of arrays and numbers, so it should speed up that too. Less memory overhead from object headers and padding. And we can just treat columns separately. Like sometimes we want to make a change to one column. Let's say we want to shift the entire profile by some time delta. We can change just the time column. The other columns stay untouched. We don't need to recreate any objects. And it also gives us a little more control over sizes and how compact our integers or our numbers are stored. We can pick with our typed array. We could pick an int32 array. We could pick an int16 array. If we know what the domain of our values are, we can store things more compactly and we get back in control of the representation. Okay. I want to make it even faster. So if we look back at our category breakdown, we're getting the stack index, we're getting the frame index, but it's all just to look up the category from the frame table. We're not really interested in the stack of the frame. We just want the category for a sample table, for a sample. So what if we just got the categories for each sample and use that instead of here, stack, frame, category, just go category, boom. Well, it would be great if we had this column. Where does it come from? Well, we can compute it here with the get sample categories method. We iterate over all the samples. We do the stack frame category conversion here. We cache that in the sample categories column. We pass that to our existing function, but we only want to do this once, not on every call. So we need to cache it somewhere. We can use memorization for that. So here's a memorized call. We get the profile. We only run this once. So if we call this multiple times with the same profile, let's say our profile is immutable, we have it cached from last time. And we can make the caching even more precise. If we memorize a function which takes just the columns that we need, then we get it. We wrap this into the existing get sample categories function, which takes the profile, but then it takes out the categories. Sorry, it takes out the columns we want, passes those separately to the memorized function, and that makes the caching even more or even tighter. If you touch a column that is not involved, you don't invalidate your cache. And did it work? Yes, it did. Oops, wrong direction again. Memorized sample categories. We're now down to three nanoseconds. So I'm basically done with the talk. Let's just look at the graph here at the end. This V1 graph is off the charts like this. It's way higher than this. But we made it faster with every change here. And this last step of caching the sample, the categories for each sample, it looks like it's not much, like 25% on these nanoseconds. But what it actually means is we can handle more data. We can handle a higher count of samples in, let's say, a 16 millisecond interval. And like 25% more data, that's massive. Okay, I want to say really quick, what is data-oriented design? It's a mindset and it's a collection of techniques. The main technique here is structure of arrays. The mindset is more about how you think about it. The shape of the data determines the algorithm and its performance. You need to know which things are small, which things are big. We might have seven elements in this array and 100,000 in that array. If you keep that in mind, you're better set up to write fast code. And if you also think about cache line utilization, you're even better set up. The rest is not that important. Thanks, everyone. You can find me in the Firefox Profiler channel. You can check the Firefox Profiler online. Happy profiling!
From Google AdSense to FOSS: Lightning-fast privacy-friendly banners
Good morning. I'm Tim. I work as a performance specialist at Akamai, but that's not what my talk is about. And everybody here in this room has two things in common. I assume. First, we love web performance. And two, how many of you think already about food? Because I'm starving. And actually, if you don't know what to eat in the next days, when you're here in Belgium, there is this waffle burger at a Belgian restaurant called Quick. And if we are performance focused, Quick is also a nice way to get there. Now, next to my day job at Akamai, I also run the largest scale modeling websites in the world. With 50,000 visitors a day and 6 million pay juice a month. It's a bit too big for the talk of Tvetan earlier today. And it's not only the largest scale modeling website in the world. It's also the fastest one. And this, thank you. Thank you. Thank you. And this, despite the fact that I run banner advertising, because normally banners means slow, slow, slow and annoying for your end users. And this talk is all about how I switch to an open source ad server solution in order to give my users a better privacy. And then also because I love performance, make sure that the performance is lightning fast. Who remembers this day? One. Yes. GDPR. Correct. This was the this was when GDPR almost six years ago was introduced. And if we travel back in times to six years ago, my website back then I used Google AdSense and a few other ad serving solutions. And what is great about these solutions, you can just add some JavaScript on your website, and you start earning money. That's it. Now, the problem is that when you then you look at your waterfall that you see all these extra requests to third parties, third parties calling third parties, fonts are downloaded, CSS, JavaScript, cookies are set, tracking cookies, a lot of stuff happens. And this is a tool by my ex colleague Simon Herne, request map that shows you the blue bar, the blue circle at the bottom is the actual website. And then you have all these spiders crawling off additional requests going to additional things. And from a privacy perspective, this is not ideal. And this is all you need to do to create a nice banner of in this case, a hamburger. Now, when I started, this was how my website looked, I was basically chillax. This was just how the web worked. This was the only way there was no different way. This was just how the web worked. Now, in April 2018, one month before GTPR, I was like a little bit in panic. I was hoping that the ad providers, not only Google AdSense, but all the others would come up with a privacy friendly version for Europe, and would therefore also make the websites faster. And in April, nothing was moving. So I looked for a plan B. And luckily, I was able to find a plan B, which was open source. So revive open source ad server. And why did I pick it? It was PHP based. My website was PHP. So it's good. It was already five years old. So it was like not brand new. So it was already proven. And it had fairly stable releases at a regular time. Today, this open source project is maintained by the Aqua platform by Eric and his team. They also run a, of course, paid hosted version of the solution. But I use the download free version. So what can you do very quickly? Everything you expect from an ad server. You can manage your campaigns. People can sign up for to start adding ads on your website. Basically, everything which is needed to send to serve ads on your website isn't there. And this is the result. Remember before that spider going everywhere, everything hosted on the same domain. So from my privacy perspective, I was back in chill X mode. Now, let's talk about performance. Just by implementing the open source solution on my own systems also gave me some performance gains by design. And the first is here revive itself does not require all these requests. So that's the first thing. But as you can see, what is missing here are things like DNS lookups, TCP connections as a cell handshakes in order to talk to different systems. So that basically means that everything which is needed to serve that hamburger banner as soon as possible, it's not delayed, which is good. The other benefit is we already talked about IMP before and JavaScript performance. The library broadly compressed only 1.7 kilobytes. And typically the more bytes you ship and JavaScript bytes you ship the less good for things like IMP first input delay total blocking time. So it's a fairly small library. Other things I work for a CDN so I can run my website on the CDN. So also use the image optimization services to make sure that I return modern formats like AVIF or web P, et cetera. And then finally, last but not least, the fact that everything is under my control also means that I fully control priorities things like fetch priority high, fetch priority low, preload the order in the page. I fully control the order of things and I decide do I want the banner to be served first or do I want the actual content to be served to be served first. This of course assumes that your web server or your CDN listens to the priorities. Now, this was the basics. Just setting up revive great for performance, great for privacy. Now, good is not good enough. And in order to get these very, very good results, you still need to do a little bit more. So let me explain you that. So we'll first look at LCP or largest contentful paint. Just as an example, what is here the LCP element on this page should be fairly obvious. It's the largest image on the screen, which is that nice helicopter, which I'm currently building. Now, that's easy. Second one. What is the largest contentful paint element here? Sorry, it's early and I'm hungry. It's actually as expected, the top one, because that's the biggest image. Now, this is not what my users perceive as the LCP element, because they come for that small picture of the car. Now, what is the problem? This image is late discovered. It's first needs JavaScript to run, then it needs to do a request to a PHP server to know which ad to serve. And then only the image will download. So it's late discovered, and it means it will come in of potentially a few seconds later. So what is the best solution? Just send more bytes. So my website is driven by a lot of contributors. So when somebody uploads a smaller image, I basically nudge some other people like, Hey, do you have a bigger image of this one? So my LCP gets better. Not only my LCP, people also like to watch nicer pictures as well. Now, that's a plan A. That's the best. Now, I'm not sure, not always sure if it actually happens. So sometimes I do have pages where the images are too small compared to the banner. And what is my plan B? I call that a fast fallback banner. And it's exactly what it's doing. It's fast. And it's a fallback. So in order to make it fast, you need to make sure that it's early discovered. So it's just it becomes a standard image tag. So basically my PHP code, I check, Hey, when I generate this page, I already know the size of the image I will embed. Rather than using the JavaScript based version, which is slow, I fall back to a default image variant. The downside is that from an ad perspective, I can no longer track revenue. I can no longer know exactly which banner should be targeted. Yes or no. So I lose some functionality. But typically on a website, you don't always sell all your potential banner locations. So you anyway have some, for example, I sponsor certain scale modeling events, or I have some coffee mugs of my website with internal banners. So I can basically perfectly display these non revenue generating banners, but keep performance. And here is then what you see is request number four is the LCP element, which is requested quite soon rather than somewhere at the end. That was for LCP, making sure that it's green under all conditions. Next is CLS, CLS cumulative layout shift. And this is something everybody knows, typically on newspaper websites, you're looking at a page, you're, you're reading the content and then suddenly bam, everything goes down because the banners start loading. Now the solution for this is quite simple. Just add a class, add a placeholder that the browser while rendering and painting everything on the screen makes room for them already reserves the room for that nothing special. Now, unfortunately, this was not good enough. Why not? Because in add systems, and in all add systems, you can do you have basically the choice between user experience and making more experience and making more money. And the top one is the fixed zone. You basically say, hey, in this location, when it's a fixed zone, I only want to show banners which are this size 300 times 250 pixels. Now, you can also have flexible zones. You hear I can define, you know what I have my design allows 300 pixels wide, but I can show bigger banners, smaller banners, a variety of things. From a money perspective, this is better. Why the bigger the pool of ads you can potentially serve to your users, typically the more money you make. The top one is better for end user experience because you know, hey, my placeholder is always this, which one did I implement? Of course, the top one. Now, a new problem arrived. Page is rendered. You see the nice placeholders. And then suddenly this happens. Watch carefully. Everything moves to the top. Which same browser would do this? Safari, Chrome, Firefox. All browsers are same. However, ad blockers are not always same. Ad blockers assume and assume that when you have advertising, they try to remove everything. So what happens is they detect the ads on my website. Although they're privacy friendly, although they're fast, they get removed. And you have this shift. So how do you solve that? Not blocking the ad blockers. If my users want ad blockers, that's fine. That's okay. If they are free to use that. This solution is to add an additional container around your ad. So this is the EINS. That's the ad. Make sure that the container has the placeholders. And then when the ad blocker arrives and deletes the ad, the container is left. So no layout shifts. And this is really my mantra. CLS should really be reduced to zero. Every single pixel which moves is in my view a bad thing, an annoying thing. And CLS should really be reduced to zero. So we covered privacy. We covered performance. Now let's look at the revenue perspective. Because in the end, the revenue part is I need the money to fund the hundreds of dollars which are paid every month on the server cost. And when I started, it was easier. I used ad sanctions. task steamed student stock of a sort of So, banners does not mean bad. You can implement it in a positive way. If you have full control with open source, you are perfectly able to do that. And it's also possible to make that lightning fast. Now, I didn't get any money for this. So, I'm really dreaming already about this burger later on today. There is just one small problem. It's Robin. Robin, do your hand up. Robin is the next speaker. Robin is my colleague. And we also call him Mr. Quick, so the Quick restaurant, but he normally works on the HTTP protocol and he hates it when I call him Mr. Quick with the K. He stands between us. So, Robin, please talk fast so we can all go have a great lunch. Are there any questions? Yes. Thank you. So, I've heard that your scale mates is very popular in various continents. So, for the answer, do you need to get practical somewhere in the continent or they're all like... Yeah, okay. Yeah. So, the question is, so that scale mates, my website is visited by people across the globe, here in Belgium, in Australia, Japan, Brazil, everywhere the globe. And the question was, if I need to have a replicated setup, so I use a CDN that gives you a replication for static content images, etc. So, that's a given. But I actually also replicate my servers across the globe, not all. I have, for example, servers in Australia, in Japan, to make sure that when a user does a database call or does a search, that they get an instant response. Thank you for the question. We have a few more minutes, I think, for questions. Two minutes. Otherwise, two minutes. Any additional questions? Yes, they're true. Yes, but the... Yeah, great question. So, the question is, in a revive, which ad providers can I introduce? In theory, in revive, you could also make a non-privacy friendly version, because you can also say, hey, in case I don't have any direct inventory, let's say, for example, with a scale modeling company, you can also decide to fall back to, for example, adsense or anything else. And it's just the only thing you need to do is add that their JavaScript and your advertiser code, so in theory, you could integrate any SSP. But then you're back into the same game, then you're... You have a performance impact and privacy impact. So, revive allows you, potentially, to do everything. Does that answer your question? Thank you. There was one question in the back as well. Yeah. So, the question is, which frameworks did I use or modules did I use to build a website or just for the advertising? Everything builds myself from scratch. Yes. Yes, I... Yeah. Yeah. Everything builds from scratch. The only thing I used was jQuery and I still use jQuery on some admin sites of the thing. But yeah, saying jQuery in 2024 is not cool, but I'm okay with that. Now, everything, yeah, PHP built from scratch. Thanks for the question. Any additional questions? Robin can maybe already come up as the next speaker. There was one question in the back. Yes. You can already switch my laptop, Robin. Yes. Just to look at it here, you are negotiating with these advertisers directly or...? Yes, correct. Yeah. So, the question is... Sorry? They call some people or you can go to them? Yeah. So, the question is, how do I get in touch with these advertisers? Because before, any banner would just show up. So, Revive also has an API. So, on my website, I basically have what you can sign up. You can create an account and register as a business account. And then I have a simplified interface where you can just upload the banners and you can ask, hey, is it for all scales or for specific scales? Are you targeting all scale models or just the aircraft ones or the shipbuilders? So, I have a simplified interface and they just sign up themselves. Thank you for that question. Yes. So, the whole question, have you had to deal with bad ads and bad actors? Yes and no. So, the question was, do I have to deal with bad ads and bad actors? Because I also have a shop database. And I basically already have a database of domains which are from scale modeling companies. So, when somebody signs up with a Revell, which is a brand, the e-domain, I basically know that it's linked to a, I basically can give them some confidence. If I'm unsure, they can already start creating their campaigns, but I can't, I still need to enable them before they're actually published. And people can't add JavaScript on the website. And so, in Revive, you can add JavaScript banners, but I blocked that because JavaScript is bad for performance. Does that answer your question? Thank you. Thank you very much and have a great lunch. Thank you.
Insights from the RUM Archive
Oh, last talk of this session is about the RUM archive, which is a data set of anonymized real user monitoring measurements. Now I know what some of you are thinking, Robin, if it's a data set, why does it have a palm tree in the logo? That doesn't make any sense. But think about it for a second. What happens if you go to a palm tree and you shake it? Something interesting might fall out, like a coconut. And the same thing happens with the RUM archive. If you shake it a little bit, something interesting might fall out. But for both, you need to be a little bit careful. Because if you're not, the coconut might fall straight on your face, leaving you scarred for life. So we need to be a little bit cautious in how we query the RUM archive, and we'll get to that later. The first thing I want to explain is what is actually in there. How do we get the coconuts in there? So currently, all the data is from the Akamai Ampulse products, which basically means we have a lot of Akamai customers that have Ampulse, and they let us put a piece of JavaScript on each of their pages. So every time a page is loaded, we send what is called a beacon, which contains all the performance measurements and a lot of other metadata for later analysis. Now, usually, our customers only see their own data, obviously. And here, we want to make this more publicly available. So we have to do a couple of things first. First of all, we filter the data. We only do the top 100 customers in terms of traffic. We anonymize the data. This includes stripping all of the URLs. So you won't know which measurement belongs to which site, which is a sad but necessary operation that we have to do. And then we further aggregate the data so that many similar measurements are actually combined into a single histogram for later analysis. This gives us two data sets. One, of course, for the page loads, and then one for third-party resources. These will be things like Google Analytics that are loaded from external URLs by many different customers. And so we can also offer some insights on that. We have most of the performance metrics you would expect, including some others, like RageClicks. This is if people got very frustrated. They start clicking the same area of the screen trying to make it work. For the third-party resources, we can also show if they were loaded from the cache or not. Very interesting. But crucially, one of the things we try to make the difference in is that we collect data from all the different browsers on all the different platforms. And you can, of course, also query on those from the data set as well. Now, you might be thinking, Robin, sounds fine, but don't we already have this from other public data sets? And partially, yes, this is true. We are blessed with very good web and web performance data sets, but we still feel that there are some gaps in there, gaps that we hope that the RUM archive might help fill, especially when it comes to things like cross-browser and real-user monitoring data. So let's say you're interested now and you say, how do I actually get access to this data? The main way is through Google BigQuery, where most of the data is stored. BigQuery is a very powerful, very flexible platform. It's sadly not the cheapest. It does cost you a bit of money. And even if you're willing to pay, it can take a while until you get useful data out of this, which is something a colleague of the Mozillaeans here today noticed a while ago. The reasoning was sound. They were trying to look for user agent Firefox on device mobile, expecting to get Firefox mobile data, obviously. It doesn't actually work, because in the RUM archive, Firefox is really just Firefox desktop. If you want mobile, you need Firefox mobile for Android and Firefox iOS for iOS. This is because we at the RUM archive put stock in consistency above all things. Now, especially for newer users, going to BigQuery directly is sometimes a bit of a big hurdle. So we also have a cheaper way, which we call the RUM insights. This is basically the team saying, OK, this is what we think most people will want to know about this data. We do the queries, and then we have some ready-made visualizations on the website for those as well. They also do the access. Sadly, though, even the RUM insights don't really help much for the Firefox mobile use case. As you can see, Firefox in our data set is definitely present on the desktop side. On the mobile side, none of the variants actually hit the 1% cutoff that we put for generating these diagrams. This is one of the many insights we can get from this data set, of course. Because having a nice coconut is of course all nice and dandy. You can't really do much with that, right? What you really want is you want to get to the juicy inside of the coconut, in this case the coconut milk. Now, I can hear some of you thinking, you think Robin, there is no such thing as coconut milk. Okay? Coconuts cannot be milked. They do not have nibbles. And you would be correct for the latter part, of course. But there are still ways to get milk out of this. You know, you could hit it with a machete. Or if you're a bit more sophisticated, you could hammer a screwdriver into these black spots there. You could still get something out of there. The point is, there are many different ways of getting the milky insights out of the data nut. But they don't all give you the same results. And a good example of this, I found when I first started querying the room archive, I just wanted to know, you know, roughly mobile versus desktop. What are we dealing with here? And when I plotted that out, I actually saw this weird periodic pattern. You have these bumps and valleys in there, which seem to suggest that people switch the type of device they use every three months, which of course makes no sense. Okay? And anyone who's ever done this kind of analysis already knows what this is. You know, this is of course just a bit of temporal interference. Because what I did not want to do was have a separate data point for each and every day that would be way too expensive in Big Merry, right? So what I want to do is just have one day per month. And naive as I was, I chose the first day of every month. Now this is not always the same day of the week, of course. This can very easily be a Saturday or a Sunday or a holiday, where you would expect more people to use mobiles than desktops, of course. The solution is of course also very simple. So the first day, we just use the first Tuesday of the month. Not the Monday, because that's often also still a holiday or a vacation day. But Tuesday should give us more consistent results. It's not fully foolproof though, as I found out. The first of July last year was a Saturday. So the first Tuesday of July was the fourth of July, the big US holiday. And that definitely does show up in these metrics. But this is not just something specific to the RUM archive, and every temporal data set has this. But I think it bears repeating because people keep making the same mistakes there, including me. Now diving a little bit deeper, looking at the different OSes that we see. On the desktop side, it's probably somewhat as you might expect. But on the mobile side, we have a very outsized representation of iOS devices. At nearly 63%. And I say outsized, because if you look at the actual sales numbers, globally, iOS fluctuates between about 15 and 20%. Even if you look at some of the richer countries, like let's say Australia, you expect a more 50-50 split. There are several reasons why iOS is overrepresented in the RUM archive. One of the main ones is that Akamai, as a company, is mostly present in the richer Western countries. Right? And our customers are mostly from industries like e-commerce, luxury goods, travel, that also address more richer end users as well, that are more likely to be on, say, iOS devices. So there is definitely an ingrained bias in the current RUM archive data set that you need to be aware of. But that doesn't mean the data isn't useful, in my opinion. We can still do much interesting stuff with it. For example, I think this serves nicely to highlight one of the big problems I feel we have in web performance right now, is our maybe somewhat overreliance on the Core Web Vitals and the Google Crux data set. You might not notice, but on iOS, you actually have no browser that can give you Core Web Vitals metrics, not even Chrome. This is because on iOS, every browser is actually Safari in disguise. Apple forces you to use the underlying WebKit engine, which does not support the Core Web Vitals. And so the more iOS traffic you have, the bigger your blind spot for those users is going to be if you only use the Core Web Vitals and the Crux data set. And I might say, Robin, that's only a problem for the customers represented in your RUM archive. And I would argue that the RUM archive currently does not maybe represent the global web, but I do think it's somewhat representative for, you know, for example, the e-commerce industry, which is definitely one that we consistently target when we talk about web performance. So I do think this can lead to interesting insights on that part. There is a silver lining to all of this. As you probably know, the EU is trying to force Apple to allow other browsers properly on iOS. Apple is dealing with this in one of the most disgusting ways ever. In my opinion. So I'm not quite sure how much this is actually going to change in practice, but still it is a step in the right direction. Okay. And even if this doesn't happen, we can still do some cross-browser comparisons by looking at other metrics that are readily available on all browsers. And we actually started doing this in the RUM archive, because we have those metrics, of course. I had hoped to present them to you today. But we want to be sure that we are 100% correct in our interpretation of this before we release any type of summary on that. So not yet, but soon we are working on this. I don't want to leave you hanging for today, though. I do still want to give you something to take home. And this is because there is a shining ray of light in the darkness. Because a couple of months ago, Firefox actually announced that they will now start implementing largest contentful paint. First Corel Vital available in non-chromium browsers. And this actually went live in stable Firefox about two weeks ago. And we already have some of the data in the RUM archive, which I looked at. And if you compare this, you will see that Firefox is actually faster than Chrome, sometimes a little bit, and later percentiles significantly faster than Chrome, for LCP. Now, what I think this means is that Firefox has won the browser speed wars. We should all immediately switch to Firefox and dump Chrome. No, it's much too early for that. We don't know if this actually means that Firefox is faster, or if they just use a slightly different algorithm, or they identify different elements, or it's just a different type of site that Firefox users visit. We don't know, right? So don't read too much into these results. I just wanted to have something to start the discussion, to start getting people and ties to actually look into what the core reasons for these results are. But so, useful things for the future. We're talking about Corel Vital, so you might ask Robin, what about INP coming up? INP is actually already well supported in the Ampulse products. So you can see here, this is an INP screenshot from the previous speaker's website, Tim's ScaleMates. You can see Tim has a ton of work to do. He claims he has the fastest website in the world, but we can all see the proof that it is not true. Shame on you, Tim, shame on you. So INP is in Ampulse, it's just not piped through the Remarkive yet. We expect this to happen in the coming months, and so we can also start analyzing data for that. So up until now, I've mostly been talking about the milk in the coconut, right? But we all know there is something else in the coconut as well. The flesh, the meat of the coconut. We rarely eat this directly. We usually process it into other foods, such as, for example, these delicious coconut cookies. These are actually kind of a Flemish specialty, I think. We call these Rotskis. I think there are amazing, amazing cookies. Now one thing you might see is that there are several individual cookies in this box, right? But they all look kind of the same. They're all quite similar. And sadly, that is also something that we see for the third-party resources that we have in the... He's not human. Kind of human. So third-party resources that we have in the Remarkive. Because if you start looking into this, a lot of them are from Google, as you might think. Most of them are ads, or tracking, or analytics, right? Most of these are things that the typical end user probably would like not to see loaded on the pages they visit. So it's a little bit ironic that we have to go all the way down to number 98 to find the first sign of something that is created to try and mediate some of this, which is the very first cookie consent manager, the GDPR backwash, let's say. I say try to deal with all of this. I'm a bit skeptical that it actually works. But I mean, the fact that you have 100 entries before the first cookie consent manager, I think is a nice one-slide summary of some of the things that are wrong with the web today. Now, this was a bit of a downer, so I also wanted to end on a better note. So I went through the whole list, and almost near at the end at number 498, I find something that we all were hoping to see, which was, of course, the jQuery mouse wheel plugin. With 13,000 downloads every single day, half of that is from Tim's site, as we just heard. And then the other one's there. So jQuery is still going strong. Let's hear it for jQuery. As I said before, we also have some other stats on these third-party resources. For example, how often they're loaded from cache. And at the median, this is actually quite low. It's only about 2%. I definitely think that the browser cache partitioning plays into this. It gets better to higher percentiles. But so most of these third parties are not actually loaded from cache. This might not be a huge problem, though, because most of them are also quite small. Most of these are tracking pixels that are just a few hundred bytes in size, though there are definitely outliers. So one of the bigger ones that I found was a Google Ads JavaScript that was 131 kilobytes compressed. That's massive. And that was loaded over 260,000 times in a single day. So a very big impact just on that one external resource. Now we had a lot of different resources, a lot of different cookies. Another thing that we have a lot of is browser versions. Because browsers a few years ago, they started updating themselves fairly regularly. So for example, Chrome releases a new version almost every month. And the question there was, how long does it take for most users to switch to the new version? That's actually quite good. Because here is that within two weeks, within the month, over 75% of Chrome users are on the latest version. And most of the remaining ones are on the previous version. There is a very short long tail of versions present in the dataset. This is very similar for Firefox, which also very aggressively updates. But here we do see one interesting data point, which is the blue one here, which starts in August. And even in December still had about 13% usage. And it turns out this is something they call the extended support release, which I think is a long-term support version there. Probably mostly used by companies, I would imagine. So you do have a bit of a longer tail there. But other than that, Firefox is also a very cutting edge, I would say. This is of course contrasted with Safari. It's not an entirely fair comparison, because with Safari we don't have the minor version numbers like with the others. But here we still see the global trends, right? The latest version of Safari is 17. Even after two months, didn't even reach 50% of the Apple product population. And if you look even version 15, which was released over a year ago, is still at about 14% of all page lists. So clearly in Safari you do have a lot of older versions, up to a year and even older present in your dataset. You can't really rely on newer features being readily available there. A very fun one there was Facebook. They have a ton of versions, often multiple per week. And their clients apparently also update to the new versions very, very quickly. Meaning that I often only had one data point per version, which messed with my graphing library. It tries to draw a line, finds only one point, and then decides to just draw nothing at all. Now interestingly, this is exactly what would happen if you would leave me alone with these cookies. You would know that there was supposed to be something there, but there's no physical evidence of it whatsoever left. So a couple of other things. So this is again from Tim's website. So Tim has his own very extensive ROM setup, as you by now all know. But even for people with their own ROM, I think it's useful to have the ROM archive next to that so you can compare both of these. For example, this is for the navigation types dimension. The biggest part is normal navigations. You click a link, you go to the sites. You can also have back forward navigations. People press the back button, which should be much faster because it should still be somewhere in the browser loaded. And they have things like reload. So people actually hard reloading the page. Now for the back forward navigations, you want to see as much of that as possible. Because people, that is the faster that you can get. You can see here that Tim has clearly very well optimized for this use case because he has a lot more people doing back forward navigations than the averages that we see in the ROM archive. So good work, Tim. The same goes for reloads. Reloads you actually want as few of those as possible. Because when people reload, it usually means something has been going terribly wrong and they're doing the have you tried to turn it off and on again, Mattage, to try and fix it. So Tim is only at about 1% there. Well that is much lower than what you see in the aggregated data there as well. So it can be useful even if you have ROM to compare to see where might we improve or where are we actually doing better than others. Or let's say you want to move into a new region or a new country that you don't have ROM for yet. You can try and get some ideas about what the situation is there before you actually do. And so to thank Tim for everything that he does for the web performance community, I actually brought him a little gift. It's a palm tree scale model, Tim. I don't really know what to do with these. Maybe you can have one of your tanks drive them over or something. I don't know. But so thank you, Tim. Thank you for that. Another thing I really wanted to look at was single page apps. I have to admit something. I am still on Twitter. I still call it Twitter as well. And if you're on Twitter, sometimes it seems like everything is react. Nothing else exists on the web anymore. All of it is react. All of it is single page apps, which I really hope is not the case. But when I looked at this, I was somewhat surprised because more than 40% of all page loads in the ROM archive are actually single page apps, which is much more than I would have thought. Now, for web performance people, this is actually good news. This means we have a lot of job security down the line. So that's good. It's a little bit weird. And another very interesting point here is the difference between hard and soft. So hard means you load the initial load of the single page app, the spinner that we saw before. That's basically the hard. The argument being you download more, it takes a longer time to load the very first time, but after that everything is much faster. That's usually the selling point for an SPA. But if you would take this at face value, you would say for every hard load, there was only one soft load after that, where you would expect a lot more soft loads and hard loads. And if that is actually the case, then that whole argument for SPAs doesn't actually hold through at all. Now, that's not what I'm saying. We need more research. I need to look deeper into the data. There could be other explanations for this. But interesting to think about. I definitely did not expect these results. And I would love to compare these with other datasets there as well. I'm running out of time. I had a little bit about HTTP3 there as well, including some things where I got very angry. But let's skip that because I really want to get to this final page. Because we all know that coconuts are amazing. They are exceptionally delicious. You can make a lot of different products for them. But you can make them even better if you combine the coconuts with something else. For example, delicious Belgian chocolate. I think you can get into a very much a 1 plus 1 equals 3 situation with this. In case you haven't tried this Belgian coconut chocolate, it is to die for. Definitely try it out. What I'm trying to say is that currently the RUM archive only has Akamai impulse data. We are very much open to other RUM vendors or even large sites with a big RUM presence contributing data to the dataset as well to hopefully help us remove some of the biases that we've seen to get a better picture of the actual global web in the RUM archive there as well. Some of you might think, sounds interesting Robin, but this is going to be a lot of work. Isn't it? No, no, it's actually super easy, barely an inconvenience. Because if you look at the SQL query that we use to put impulse into the RUM archive, that is only 1.6K lines of SQL. Only 1.6K lines. Very simple, two hour stops and the data is in. Well I guess the message is clear. The RUM archive is now open for business. What I talked about today is really just the highlights, the top what we can do. We are literally just started to shift the coconut milk there. So if you want to help out with that, please come. If you have any questions, if you want us to run some queries for you, if you want to help with the analysis, please let us know. If by now you are just really, really hungry, I would say please come and try out some of the excellent chocolates and cookies, because there's no way I'm taking them home with me today. Okay, so please, thank you.
Linux on a Confidential VM in a cloud: where's the challenge?
Hello, everyone. Welcome to the virtualization dev room. My name is Vitaly. I normally work for Red Hat and you can see me being active in KVM community as well as taking care of Linux on all types of third party hypervisors and public clouds. And today I wanted to talk about bringing basically general purpose Linux distributions to the newly introduced VM type on public clouds, which is a confidential virtual machines. So if you haven't been living in a cave with no internet over the last couple years, which I wouldn't blame you because the world is a crazy place to be in now, but you may have noticed that some hyperscalers were announcing or releasing their confidential VM instance types or features. I'm not here to advertise any of them, but just for the reference, Google probably was the first with their plain AMD serve option in 2020. And now they even have a seven SNP in public review as of like last week or the week before. Microsoft Asia, they were refers to commercialize seven SNP offering AMD seven SNP and they GA in 2022. You probably see they now have Intel TDX option available in public review and Amazon offers seven SNP feature in GA. So it sounds confidential, so it must be good, right? Because we all like when our data is confidential. But what does it actually give us? Like at least like these technologies, what are they about? Like this are like both like AMD serve and all its variants and Intel TDX, they are CPU technologies. So first thing they give you is memory encryption. So your VMs memory cannot be read by your hypervisor or other guests. Second, which is important and which wasn't in like first implementation like plain serve is that your CPU status encrypted because normally hypervisor can see for example your registers where your VM is executed and if it can stop you every cycle can certainly read your data. And the last which is also important is that memory integrity guarantees are provided to you because even when your memory is encrypted, the hypervisor which is like malicious or compromised can do like an esthetic try to for example swap to memory pages. They will remain encrypted but your guest will access the wrong one, right? And it can probably mount an attack using this technique. So this all sounds great, but when we talk about confidentiality normally we say like confidentiality must be achieved in runtime at rest and it transits, right? Like very generic and all these things which I just described, they give you confidentiality at runtime, right? So what about the rest, right? Concentrality of the data in transit is not really specific to CVM because we were doing this for years, right? We know that internet is not safe place, right? So we need to encrypt our data when we send it through public channels and not only public channels, but what about storage, right? How do we ensure that the storage of the VM is also confidential because even if you have something which is confidential in memory, you will eventually need to write it to disk and do other things like you will need to read your operating system from the disk. So you need some guarantees there. The last thing I wanted to mention is that these confidential VM technologies, they don't give you any additional guarantees when you're already within the VM. So if you have an application which is attacked there, right? Nothing's gonna save you, right? The hypervisor cannot see your data, but everything which is within the VM can normally see the data. That's how it works, right? We want to put general purpose operating systems there. So yes, let's discuss a little bit about this protecting data trust because it seems that hardware technologies don't give us this, right? So first is that you want to protect at the guest level. If some cloud tells you, oh, but we are encrypting our disks, right? Like you don't need to worry. Yes, but then you have the key, right? If you can encrypt and decrypt it for me like in a transparent way, so then it's not confidential from this perspective. So you need to do it from the guest. And the thing is you need to somehow protect the operating system itself and not only data you care about because first you have some data which is really sensitive. Like think SSH host keys, right? If somebody can read it from your VM, he can impersonate himself and pretend that he's you, you know? You don't want this. Second, you have, you will say, oh, I'm running like a general purpose operating system there. It's open source. Why would I need to protect it? You don't probably need to protect it from arbitrary like reading from the host, but you still need to protect it from writing because a malicious host can try to mount an attack by modifying something in the operating system. Think about swapping SSHD binary with something, you know? How would you notice, right? You won't. And good thing is that we have some technologies in Linux already for years which are mature like locks or things like the invariative or integrity protection which you can use because even when you store your like encryption key something or like integrity hash in memory, it is protected from the host because remember your memory is encrypted, the host cannot read it. The thing is the guest needs to somehow get this key, right, when it starts and where would it get it from? So, yes, let's take a look at like how Linux normally boots and what we, how we can implement say like full disk encryption or something, right? You start booting from firmware, normally everything is UFI now and all these confidential instances, they are UFI. So, there is some firmware which comes from CloudVendor, but that's like another story. Why would you trust this firmware? You probably shouldn't, but anyway. So, then you will always have some unencrypted part, right? Because the firmware cannot jump in the encrypted part without knowing the key, right? You want to do decryption yourself, you don't want to afloat this job to someone else. So, you will always have something like bootloader, kernel, initramafas stored there in clear. Yes, you may say that we can actually do encryption at bootloader level, which is true, but then we are complicating the bootloader like a lot and the only one which does it probably is grub and nobody likes it. No, I mean, no, but it becomes, it's all like a operating system with all the complexity and everything and you don't really want that for your bootloader. You want it to be really small if present, maybe even you don't want to have a bootloader at all for confidential case. So, and then you will jump into this, you know, encrypted part, you will somehow get the key and then we'll decrypt it. So, that's how it's going to work. So, yeah, how can you provide the key to the VM? You cannot do it manually. For example, like grub, you can type it on your console. You cannot do it on a cloud because you don't trust the console. The console is an emulated device there, right? If you type your password there, the cloud will know the password, right? So, you're not going to do that and you will need to provide it like in an automated fashion, but you can only do that. you you you you So, they were suggesting if you want to have a virtual TPM device, you run a separate domain like another virtual machine which will have this like TPM device. It's really hard to implement and this like 1.5, I think, TDA specification they've added partitioning, which is somewhat similar to trust levels and I think that that's what clouds are going to use. Although, you don't know, thumb clouds may actually implement an emulated device on the host. Just for example, like you do with QEMU and SWTPM, right? You can run it as a process on the host. And not all of these solutions will give you a confidentiality. For example, the one which runs on the host obviously won't. Then there are two types of TPMs normally, stateful and stateless. Stateful is a TPM which has its state, right? And every time you run it, for example, think about it this way. It has a private key and it never changes, right? So, it's generated once when your VM is created and then every time it's loaded, you can use it for like encrypting, decrypting, something. Stateless TPM is just firmware which will generate a new key every time it boots. So, how can we use this? Let's first talk about stateful TPM. Like all these hyperscalers, they give you some sort of a stateful TPM. The question is where is the state stored, right? Because you can turn off your VM, turn it on back. So, the state needs to be saved somewhere. And it's not part of your like encrypted truth volume or anything. It's somewhere else, right? So far, again, like not an advertisement but publicly only Azure proves that this state is kept securely, that there is some attestation going on under the hood when this TPM loads, which protects it from the underlying hosts. You can't say much about other implementations, like because no such claims were made. So, you know, you don't know whether you can use it to isolate from your host or not, right? What's good about stateful TPM is that you can implement root volume like pre-encryption, right? There is a device which has like private key so it can decrypt something. So, you can take your root volume and encrypt it and upload it in an encrypted state there. And that's something which, for example, like Azure confidential decryption is doing. In theory, we don't need to pre-encrypt. We can probably do something like self-encryption. And there are such ideas floating in the air that we will start with this general-purpose Linux distro, right? Do some integrity checking. And on the first boot, you will encrypt your root volume and seal the key to the TPM. But I haven't seen such implementation yet. It's probably possible, but it's kind of hard because you need to prove that the environment where you were doing the initial encryption is saying that it was really a confidential VM doing an initial encryption. Otherwise, someone can try doing it at some other place and attack your VM. So, stateless TPM. Currently, I only know about Azure TDX which publicly offers this option. But what's good about stateless TPM is that it's just a program. You know, it's just part of the firmware. So, you can take this initial launch measurements and attest it. It never changes, right? You don't need to attest the state of the VTPM. It's going to get generated every time, right? Which is good. Think is that, again, like as I said, currently, you will have to trust your cloud provider with the provided VTPM. And yeah, there is no anything like bring your own firmware in public clouds. You can still use it for volume disc encryption if you want to use TPM, but you will probably have to do some attestation and then inject some intermediary key. And also, there is nothing like this in standard Linux tools, anything. Like you can, like just encrypting root volume to TPM is something which is like generally supported by SystemD or Clevis or other solutions. But something which would do like attestation to remote server and then bring the key is just non-existing. Second, yes, what do you do with the VTPM if the cloud provider is not telling you that its state is isolated from the host? Or doesn't tell you how it's implemented, actually. And the thing is you cannot use it, right? You probably cannot even use it for things like PCR measurements because if it's an emulated device, it can certainly get messed with, you know, and then you will see different measurements. So, the only thing you can do in this case is try ignoring this thing completely and rely on architectural attestation, something, registers which both Sev and TDX give you. The thing is, again, that our standard Linux tools for like volume encrypting, something, they don't know anything about this currently, right? So, you will have to, you know, come up with a solution for attestation and delivering like root volume key password or something there. And it's not done yet. So, just a few words about this unencrypted part, right, which I told you that will always be there, right? Even if you do like full disk encryption, which you call full, it's not going to be full because you want to load like kernel and something. So, how can you prove that these things are good? So, normally, we have two technologies which have been used. One is called secure boot, the other called like measure boot. Secure boot without a space, measured boot with a space, nobody knows why. Anyway, so secure boot proves that all loaded EFI binaries are signed by a trusted party and measured boot basically measures every important fact about the boot, like binary certificates, which signed binaries, there has to be something in special registers of TPM devices. And we need to check basically everything which is being loaded. And as I told you, like normally, again, for general purpose Linux distro, you will end up with like a kernel, initramafas, kernel command line being available in clear, not encrypted because, yes. And to protect these things, there was a concept called unified kernel image introduced, which is a very simple thing. It just you take all these artifacts like kernel, initramafas, command line, sign them together and make it like a UFI binary like which is extracting itself and launches the kernel after that. So the implications are, of course, of this like it's more secure, but it's less convenient to use. The initramafas becomes static and generated when we build UKI. And normally for a general purpose Linux distro, we want our vendors, yes, to build UKI. You want just like install an RPM, you get a UKI. You don't want to build it yourself. Otherwise, you will have to get your keys provisioned in the firmware. And not all clouds allow that, right? They may have like a vendor certificate there in UFI by default. It may not give you an option to put your own there. So you will get like a static initramafas which may or may not be a problem. Of course, you have less demands for initramafas which is on public clouds. And like you don't need to do network boot something there normally. But it's still limited. There is a system extension feature in system D which can be used to with limitations to do initramafas extension. Emanuele is going to give a talk like in an hour after me about extending UKI is going to cover this topic, how this can be done. So the other limitation is kernel command line becomes static, right? So this becomes one size fits all, right? When we build as a vendor like Fedora, we build Fedora UKI, we need to hard code kernel command line. You cannot pass like root equals UID anymore. So you need to rely on something like auto discover or something. And again, we just got an extension mechanism which is called like signed extensions. You place basically a UFI binary stub in ESP and get your kernel command line extended. This is already like publicly released in system D but these tools are still adopting this. I haven't seen like a fully working solution yet. But we're actively working on it in Fedora. Last but not least is how do you boot your UKI, right? So it is UFI binary. So it must pass secure boot checks. So it must be signed. And you can boot it either directly from firmware or you can, for example, boot it from shim if you want to have shim for some reason. For example, if the cloud provider does not allow you to have your vendor certificate in the secure boot DB. But you will still have to manage your UFI variables because there is nothing like boot menu there if you are booting directly from firmware, right? In Fedora, we have a package called UKI direct now which can manage it for you like automatically. We do things like AB booting. For example, when you install a new UKI, it's going to be tried once. If it boots, it becomes the default. If it doesn't boot, you will rework back after the reboot to the old UKI. Because otherwise, if it doesn't boot, you are like screwed completely. You won't be able, even able to access your encrypted root volume. Yes, so if we speak about stateless TPM where we don't really need to trust the provider, the cloud provider doing attestation of VTPM state under the hood, then we will need an attestation server and client. And again, there are some offerings say in the proprietary world like Intel was advertised as project ember. But there is nothing which you can use today in the open source world. There are attempts to implement this in confidential containers project. There is this thing called KBS which is both like a protocol and an implementation of this key broker server. But again, like we will need something in the standard tools to do attestation. We are yet to figure out how to tell this thing which server to attest to. Yes, so we talked a little bit about encryption as I said that for root volume, you need to at least ensure that it wasn't tampered with. And for that you can probably use integrity checking. But then problems are very similar there because now instead of the password, you will have to somehow convey the right hash ID to use for the checked part. Right? Yeah, so I'm a little bit out of time here. But yes, you will still need to use all the technologies which I described for encryption. You will have to ensure the integrity of this non-encrypted, non-verified part because UKI is still going to be on ESP which is like VFAT, you cannot attach anything there. Right? Okay, so just a few words. Even if you have your VM which started and checked, yes, everything, you need to verify that you are basically connecting to the VM you expect because think about host starting your VM somewhere and then starting another one which is completely encrypted and was like host and you know, oil miners there are changed. How would you know that you are connecting your VM? So you probably need runtime attestation and clouds are offering you something but there is also no open source, something standard for that. Okay, I'll skip to the last and the most important slide. Thank you very much for listening. You probably don't really have time for questions but I can take as many as I can before dying in the hallway. Yeah, so thank you.
How Much Do You Know about Snapshot
Okay, hello everyone. My name is Titi Ma. I'm from Redhead. And today my topic is about snapshot, especially for the implementation in open shift virtualization, open stack and LibWord. Actually, I'm a QE from QMU, which is very close to LibWord. And the main production of our LibWord is open shift virtualization and open stack. So I made some investigation here. And here is today's data. So first, what is a snapshot? A snapshot is a point in time representation or copy of data on the state of system, software, a disk, a virtual machine or any cells. But today I'm mainly focused on disk and virtual machine. And actually, snapshot plays a vital role in virtualization as it is used for data backup and recovery. We know that data is always imported for any users. And compared to the traditional data backup, snapshot can do a quicker backup and restore. And about the snapshot, we can also do different snapshots in different points in time. It means that we can restore to any historical value of our system in the state. So here are some general user cases about the snapshot. In our daily work, we mainly hit systems failures or data corruption. If they have a snapshot, we can use it to do the backup and the disaster recovery. And also it could be used for testing or developing environment. It means that in our data testing or in our development, we may destroy our system during our work. If we have a snapshot, we can make use of it for this scenario too. And also snapshots can be used for systems upgrade or software updates. If it fails, then we can roll back to the lower value of our system. And it can also be used for training on education scenarios. And that students may make mistakes during their learning. If there is a snapshot, we can also make use of it to roll back to the initial state of the system. And it also can be used to customer scenarios, customer issues, replication. It means that we can save customer environment as a snapshot. And we can use this snapshot to do the debug. It will accelerate the problems solved here. And a snapshot can also be used for security, incident recovery. In today's network world, malware is everywhere. So if our system is attacked by it, then we can make use of a snapshot for this scenario also. OK, from now on, I will talk about snapshots in that three platforms. First step is about snapshot in OpenShift virtualization. Actually, OpenShift virtualization is an add-on for OpenShift container platform. And about the snapshot in OpenShift virtualization, OpenShift provides robust capability for it. As it extends the base OpenShift snapshot feature to include the guest OS operation coordination and multi-disk management. And actually, from user space, there are two methods to create the snapshot. Next is through the web console. And next is created through the OS command line with YAML file. And in the YAML file, we need to define a virtual machine snapshot as a customer resource definition. And the snapshot, you open a shift virtualization, we can create when the guest is powered on or it is powered off, they are both supported. And when the guest is powered on, we are usually recommended to install guest agent, this software in the guest. The guest engine here is used to free file system of the guest. Then it gives time to flash memory data to the disk before the disk snapshot is created. So it here to guarantee the data consistency here. Okay, actually, the VAM snapshot in OpenShift virtualization, it makes use of a volume snapshot for VAM snapshot. The VAM snapshot here, yeah, VAM snapshot is a YAML file. It actually creates corresponding volume snapshots for all the supported volumes, either VAM. And actually, volume snapshots, the source of it usually from PBC, persistent volume claim. We know that the real data of the PBC is stored in a PBC persistent volume. And it could be classified into different storage classes based on different storage bank heads. It is the same as the same for volume snapshots. The real data of volume snapshot is stored in volume snapshot content, this object. And it also could be divided into different volume snapshot classes. Yeah. Okay, let's look at a general data flow for the snapshot in OpenShift virtualization. You already there is a user request line to create a volume snapshot. And the request will send it to the snapshot controller. This controller is deployed in the control plan of OpenShift. And here it is watching the volume snapshot, this object. Once it detects there is the object, then it will create the corresponding volume snapshot content here. And there is another component named CS9 snapshot, which is a sci-dicart container in CSI driver part. And it is watching the volume snapshot content, this object. Once it detects it, it will trigger the snapshot to create operations. And actually based on different storage bank heads, the issue in command loss is different. Like for RBD, it uses RBD snapshot related commands. And for NFS, it uses tar command to issue the snapshot here. And about the host pass for local file, it uses tar command also. And for the block, it uses DD related commands for the snapshot operations. Yeah. Okay. About the snapshot in OpenStack, like in OpenShift virtualization, there is also WAM snapshot and WALM snapshot. Actually WAM snapshot here is different from OpenShift virtualization. Here it actually creates several images, WALM, several image snapshots. And actually the image of the snapshot is also saved as an image file in OpenStack. It means that you like to restore from this snapshot, you need to relunch a new instance from the snapshot file. And also for data consistency, guest agent is also recommended to be installed before the snapshot is created. And for the WALM snapshot, OpenStack is similar as the OpenShift virtualization accepts the commands here is use OpenShift OpenStack related commands like OpenStack, WALM snapshot, create or for the restore it use sender related commands. Okay. Like in low end to the data flow here. For WALM snapshot in OpenStack, usually yes, it's the same. There is a user request from user space. The request will be sent to the sender component. And first it will send to the sender API. It will do some basic checks here. And then it will send it to the sender scheduler. And for the sender scheduler, it will schedule the request to different storage bankers. It's just like the OpenShift virtualization. And for the different storage bankers here is also the issue in commands is different. Like for RBD, it's the same. It use RBD related snapshot related commands here. And for OFS, it's different. It use the QMIL image, this QMIL tools to do the snapshot here. And for the LVM, it use LV related commands here. And about the VM snapshot in OpenStack, it's different. It's also different from the OpenShift virtualization. It does not make use of the WALM snapshot in OpenStack. Here is such code flow. It's mainly implemented in NOVA. And also it can divide into live snapshot or code snapshot. About live snapshot, the data flow first, it use the QMIL image to create a delta disk at first. And then it use the LibWord API block rebase to rebase this delta disk to the root FI file. And then it use the QMIL image to convert this delta to the snapshot. After the snapshot file is created, it will delete this delta disk. And about the code snapshot, it just use the QMIL image to convert directly to do the data transition. Actually, when I first saw this workflow, I'm confused. Why not use LibWord snapshot directly? Actually, the workflow here is just some LibWord APIs or QMIL related commands. Why not use LibWord snapshot? Actually, the reason here is that the LibWord snapshot from the current real release note that LibWord snapshot is not recommended to be used in that. OK, let's look at why it is not recommended to be used. What is the current status of the VAM snapshot in LibWord now? The LibWord snapshot now is using internal snapshot. So what is an internal snapshot? Internal snapshot means that the snapshot file itself is saved in the same base image file itself. We can image that the snapshot file and also the base file is merged in the same file. It will be hard to maintain. Actually, this feature is stopped developing in QMIL real level. It is planning to be disabled in the future. Another thing I'd like to highlight is that the VAM snapshot in LibWord is truly different from the VAM snapshot in OpelShift virtualization and OpelStack. In OpelShift virtualization, OpelStack is used as a guest agent to guarantee data persistence. But in LibWord snapshot, it will include the complete systems info. It will include the complete memory data and memory data and also the disk to see info into the snapshot file. So it can guarantee the data persistence here. And also for the LibWord snapshot, we can also do disk-only snapshot here. And regarding this advantage of the internal snapshot, LibWord appears as working on external snapshot now. And for the current status, we can create external snapshot now. But for restore and delete, it is still under developing and there is an issue tracking it now. And it is planning to be released in LibWord 10. And so eventually, when this feature is supported from data persistence, the perspective that this could be a perfect option for snapshot. But actually, there is still some limitations for the VAM snapshot in LibWord. As it should not source the storage bank assets that will. And the image format of the snapshot file in LibWord must be QQ2. While this QQ2 is not for some bank as a link from RBD bank as from the official documentation that we learned, QQ2 is not RAM-committed over RBD as there are some performance issues there. So, let's give a brief summary here that in high-level, we can divide the snapshot into two parts, code and live. For code, it means that the VAM is powered off. We can guarantee the data consistency. But actually, more customers may prefer live snapshots as there are still some other applications. If VAM is running in the VAM, they want to keep it running. They want to keep it running while doing a snapshot. So, about the live snapshot, we can also divide it into disk-only or volume-only snapshot. There is no memory data. It means that there is VMAH data in consistency here. And also, another choice is VAM snapshot, the whole VAM snapshot. There are two choices, like in OpenShift virtualization or OpenStack, they make use of guest agents, this component. It is used to file. But the question here is that it is just quite the file system as much as possible. It also depends on the workloads. It means that if there is a very heavy workload, in the VAM, there is still data loss, potential data loss here. And another choice is the live-watch snapshot. It will include a completely memory info in the snapshot file. But as I also told, there are some limitations for the different storage of backhands here. So, always based on your requirements or your environment to choose the one that suits your best. Okay, that's all of my presentation. Thanks for listening. Thank you.
UKI addons and extensions: safely extending UKIs kernel command line and initrd
Okay. Hello, everyone. My name is Emmanuel Giuseppe Sposto. I'm a software engineer at Red Hat. And today I'm talking about the UKI at Donson Extension, how to safely extend UKI, scan and comma line in E3D. So why this talk? First of all, because this is extremely new stuff, like it's very new, hopefully also exciting. Because there's not a lot of documentation, of course, because this stuff was just merged. And hopefully this talk will also help you understand a little bit more about what they are, how to use these addons and so on. And because they may be very useful because UKI, as also Vitaly explained in this talk one hour ago, is pretty static on the point of comma line in E3D. And with these addons, we can extend it, these two things without sacrificing the security. And also, yeah, this attempt to advertise a little bit to UKI, so what they are to the more public to be more recognized. So let's look first at Vitaly's slides. These are from last year, I think. So I will just briefly go through this. So Confidential VM provides data protection from the host he runs on. So we are protecting the VM from the hypervisor because it could be malicious and it's privileged, so we can access the VM and we don't want that. The host is still able to disrupt the execution of the VM. And there are specific hardware, SV, SMP and TDX responsible for encrypting memory and CPU. And storage encryption is necessary for security and must be done by the guest OS. This was already explained by Vitaly. And usually the situation that we have is that we usually encrypt, we have the encrypted part and while the kernel is signed by the vendor, in NITRA MFS and the common line are locally produced, are not signed and also it's difficult to measure them, of course. Whereas with the UKI unified kernel images, basically a single binary produced and signed by the vendor, in this case Red Hat. And it basically contains the important parts, the RP sections together with the signature, there is the kernel, the NITRA MFS and there is also the common line as a separate section that is then feed to the kernel. Before going to the next details, I wanted also to explain like the use case, like yeah, the use case in this case for this talk, that we have the UFI, the firmware that is in terms called shim the boot loader, which in terms called system distap which is very key piece for the add-ons and on both the kernel and common line the NITRA MFS which in turn unpacks the UKI and gets the kernel and runs the OS. The issue that also Vitaly mentioned is that the kernel line is immutable and is something that we don't like because there are limitations and you cannot have a static common line for every use case that you have, there is a crash kernel options, debugging options and we cannot ship different UKI for every basically use case. So what we are aiming for the UKI kernel common line is it cannot be static as I said because there are different use case, it has to be secure so whoever modifies the common line has to be authenticated otherwise the whole point of confidential computing is lost and by default nobody because the common line is inserted inside the UKI and then is signed so you cannot modify it anymore and has to be extensible of course because we don't want to ship a new UKI every single time. There are already ways for the one that are no UKIs to extend to add kernel common line to a new UKI but it's a little bit when we talk about confidential virtual machine it's a little bit tricky because as again I'll show you the option and you need to trust a lot of parts. So as I said there is the common line section it's embedded in UKI, it's generated with UKI, it's secure, it's shipped with UKI altogether but it's static, you cannot be modified. Then there is FI shell which is looked by system distable if the common line section inside the UKI is missing many distro for example they ship always something in the common line section inside the UKI so it's ignored. It's useful usually for type 1 entries but again it's unsafe because an attacker can easily inject its own parameter through the FI shell that's why it was disabled for CVMs so you cannot extend the kernel common line with the FI shell. There is SM BIOS system management BIOS, embedded metal this is good, it's trusted because it's coming from firmware and BIOS but it doesn't apply on CVMs again the hypervisor can easily inject kernel common line. So yeah as I said it's not good so it was also this was disabled and then there is QM firmware configuration by the name you can already figure that this is only from QM it's again coming from hypervisor so also disabled. Then what do we do? Our proposal initially upstream was an allow list basically an allow list is another P section where you use regex globbing and whatever just something like this to parse the common line that you want to get and the easy case will be if there is something that we don't accept in the regex we just discard the whole common line but the common line would come from FI shell SM BIOS all these sources but we try to filter and system desktop does the parsing. The advantage is of course that we can reject what we don't want but the problem is just moved to another place because then you can do attacks on the regex and globbing because they need to be very careful formulated so what's also this was disabled so was rejected actually and eventually we have the solution the system D solution nuclei addons. Nuclei addons is basically another separate binary which is contains a very few P section one of these is the common line and it's signed by yeah can be signed but should be signed for the CVMs and we take advantage of shim validate function offered by shim to validate the P signature so basically this means that system desktop will ask shim to validate if the binary has been signed by some key that we trust in the secure boot database. There is a very useful tool UQFI in system D upstream it's you can create UQIs very easily very better than drag up and object copy and you can also create addons and yeah basically the common line is very easy you can also provide the keys when you want to sign your own addon so it's this is the solution. So how UQI works the workflow is UQFI first you create the addons so you ask UQFI to create an addon with the common line that you want then the addons it needs to be put in the specific location in the ESP I will show you later where exactly is this system this tab looks for this location and finds automatically the addons asks shim calling shim verify on the addon to verify the if the addon is trusted so it's signed by somebody that we trust and then if a leadation is successful we read the addon the system read the addon and appends the common line inside the addon to the UQI common line section to extend it and then it's provided to VM linux to start links with the new common line there are two kinds of addons there is global and local addons so global addons can be applied to all installed UQIs and this is the location and UQFI UQI specific addons so if you want to apply all these to one specific UQI you have installed has to be provided in the UQI name has to be in an extra d folder in the same location where your UQI is and then has to be put in there just naming convention because last time I checked the system this tab was checking for also the extension name and this kind of stuff so you need to get them right UQIs are always located in the this part AFI linux UQIs always ends with the AFI and addons is dot addon dot AFI and specific addons here as I said you need to be located in the extra d folder okay so next next step is what is but white self so suppose that we as a vendor we shipped a new key I common line addon and we signed it and everybody's using it and then we figured the common line as an issue then what do we do because we signed it as a vendor so what it's trusted first solution just change the certificate so but this is basically impractical yeah good luck with that yeah we messes up all the measurements you invalidate all the addons so second solution try to create a blacklist on the cloud provider this is impractical third solution at the station check if the hash is matching your addon that you don't like anymore and the last solution these are these s but rules so what is s but is basically another p section inside the UQI the yeah the addons for example and contains component generation and also other information but the key part is the component generation table because there is the same table there should be the same table inside your shim that and then the we are at component level so for example every Linux PS action has should be should have the its own component generation version for the Linux one for the addon and so on and if the component generation match with the what shim has we accept it but if the generation for this component of the addon that is incoming is lower then we have a mismatch and even if the addon is signed by red dot or whatever it will be rejected and this part is done by shim when they verifies they are done in checks the s but components and generation just an example to clarify this in this case we have the shim has s but one myadon version two and then the addon contains the same version for s but and myadon so it's good it will be accepted of course has to be signed by somebody we trust in this case the my the addon as the s but version is correct but my addon component is lower which means that we don't accept it even if it's signed by whoever we trust in the secure boot database it won't be accepted one open problem it's combining addons so if you have two separate addons that contain common line that is safe but together can create a security issue because they enable something that we don't like how do we do it how do we solve this issue to be honest as of now I couldn't come up with a concrete example for this and yeah one solution will be to use that station to see if they are both there talking about the system dc6 in iterative addons so system dc6 already exist they are already famous so used and what is new is that you can also use them for uki so for what if you don't know is a system system extends an image extend the base system with an overlay containing additional files so you can extend base system and you can use this system this tab provides also the possibility to use this to extend the initer d inside the uki um more or less is same concept as the common line addons so you just use different tools because they are different things they are no p binary with p files sections so there are system extension images and micozi is used instead the uki fi and but the part for example where to put it is the same the workflow is more or less again the same create a system c6 extension you put it inside the extra d folder it must be a raw file and then this is the only difference system this tab will take the initer d the addon and will put it inside the initer d extra c6 folder where the c6 extension will then load it and apply it to the initer d yeah who uses this can use these addons the use case are various there are three groups of users that can use this the vendors for example read that they want to ship we want to debug kernel and uki and we ship our addon and there are there could be the vstod the virt host admins that can use host side tools like virt firmware or whatever to modify these these kind of variables more or less the same use case and the guest admins can add you can use guest side tools like mock to insert the key insecure boot even though this is a little bit tricky for in the cloud because on asia it's basically impossible to add a key in mock because when it reboots you cannot connect via when you connect with the shell you skip the mock reboot section when they ask you to confirm your key available tools there is a system d has a lot of tools uki fie is the main one in different version is supported gradually first how to build and then how to inspect them and then there is also i sent a pr to extend boot ctl to find addons and display already as a preview what will be the kernel command the full command line so if there is a system d maintainers right then and there is mico c to create a system d sex the image and then we have a uki director for fedora there is kernel boot config you can add update and remove uki's and then we and also added kernel addon which does the same thing for uki addons and the future work what are we planning to do next maybe an rpm so the vendor ships an rpm with the collection of addons generic addons that we want to ship signed by the vendor but of course we don't want to pollute the esp with the addons that the user doesn't need so there was a agreement also upstream to find these two locations user lib linux extra d for global and the other one for uki specific addons where the rpm should install these addons and then when the user needs them can simply use kernel addon or just copy the addon that for example we as a developer ask to for debugging the uki to copy it in the esp reboot and they will be there yeah on the cloud cloud if they want to allow the user to upload their own uki addons they need to be a way to inject to inject the owner certificate otherwise yeah you cannot do it this also there is a little bit an issue with the measurement because the when you add the user certificate has to be measured in pscr7 especially and the solution we found is to simply add the dummy addon before performing attestation so the certificate is part of the in the key ring so it will be attest is measured on prem more or less the same things who for us is libvirt we want to offer the same possibility to upload the certificate for secure boot and yeah and there is already a way to add the dummy addon so that's that's it from my e-talk if you have any question here outside thank you yes please uh so second comment is on all of the add-ons Right? Because you can trust the UiViceQ boot mechanism. Whereas in a confidential computer environment you cannot today use. I'm not aware of any stack right now that gives you a trustworthy UiViceQ boot environment. That means you need another mechanism to do that measurement for a confidential computer environment. The most natural path for that is to use the launch digest. Because you have the launch measurements, you need to know ahead of time. When you boot the VM in a way, in a way, in a way at boot time, all of the data that you need to launch at the end, which means you need to have the UiViceQ ready to be available including all the add-ons. At which point we go in full circle, I think we are much better off just building a separate UiViceQ for that one set of configuration you're doing. So you can attest that I'm actually running a set of configuration. You don't want your debug add-on in your production fleet. That is, you want to pre-aggressively. So I think the mechanism that is the most natural one here is to go and build a separate one-off UiViceQ even if they're made of add-ons if you want to. Okay. Okay, thank you. Okay. Thank you. We cannot do a vocation only with a firmware. The firmware cannot support a vocation mechanism outside of the DDX. And DDX has both space and around that. If you have a lot of space, if you ditch the microsoft solution, don't use the microsoft solution. Thank you. Bye. We know how it ends. Guys, you are more than welcome to present next year if you want. You are more than welcome to present next year. You are more than welcome.
From Virtualization Platform to Hybrid Cloud Solution: A Hands-On Account
So, good afternoon everyone and thank you for joining me today. My name is Bello and I'm a software engineer at Red Hat. And over the past year I have the opportunity to be part of the forklift team and take it for a spin. So, today I'm about to share with you our recent journey and without further ado let's jump in. So, in today rapidly involving both of IT we can observe an increase in moving away from traditional virtualization environments towards more hybrid cloud solutions. And with Red Hat RNOT just observing this trend we're actively participating in it. So, recently we had the opportunity to go on a journey and migrating from a virtual established environment to a newer solution. And today I'm going to share with you some of the inside challenges and benefits of such a transition. So, let's start by discussing these two very different solutions. So, picture you on a journey through the IT computing landscape. Our first step is Ovid. It's like an older reliable train that's been running for years. So, Ovid is an open source product based on KVM technologies and it's offering cost efficient way for enterprise to managing their virtual workloads. It's an alternative to vSphere. But our journey does not end there. We then continue to the world of OKD. So, picture it as a high speed train whisking us to the future of cloud computing. So, OKD is also an open source project based on Kubernetes and it's providing us with cloud computing capabilities alongside enhanced Kubernetes features such as added security, automation and user friendly interface. And it supports both containers alongside virtual machines. So, when considering such a transition it's important to take into account how it can be done. So, there are several path we could take each with its own set of advantages and challenges. But today I would like to focus on main three. So, first we can't reprovisioning all the virtual workloads and start from scratch. Even though this solution may be sound pretty straight forward it's both costly and time intensive. And for complex workloads it's not always possible without risking some data integrity and operational disruptions. Next, we can migrate all our virtual workloads into containers. So, with the use of conveyor project we can really reduce the cost here. But it's still not an easy task. And again we have the same issues before not all workloads can be containerized. So, while this may be a good solution for certain types of applications, it's not suitable for everyone. And finally, which seems to be the best one is keeping our migration workloads, our virtual workloads as they are. And with the use of forklift tool migrate them to the new environment. So, by that we don't have to worry about any data lozage. And with the use of this tool we can have simple and smooth transition. So, what is it forklift? So, forklift is a tool designed in a system that is designed to be a virtual environment. Design in assisting migrating from traditional virtualization environments to Kubernetes-based environments. And it's taking care of the entire migration process for us. It's working alongside with another project named CubeVirt. And CubeVirt providing us the virtualization capabilities on top of Kubernetes-based environments. And once forklift migrating the virtual workloads, they will be placed on top of CubeVirt. So, forklift as a versatile tool supports a variety of source providers, source environments as you can see here in this list. So, now I would like to take a deeper look at forklift high-level functionalities. So, forklift supports two types of environments, KVM-based and VMware-based. And for both of them, it's taking care for the entire migration process. That means creating the disk, copying the data, and for VMware-based product, converting the virtualization stack to match CubeVirt requirements. And, of course, finally creating the VM itself with its original setup to run on top of CubeVirt. So, the use of this tool will make easier and smoother transition to the new environment. So, now that we finish discussing these different solutions and approaches, let's dive in into the specifics of our own migration from OVirt to OKD, where forklift used as a crucial tool in facilitating this migration. So, I would like to start with a little bit background on why we decided to go ahead and proceed with this transition in the first place. So, our OVirt environment being in use for more than a decade, supporting hundreds of virtual machines with diverse usage, some for production while others for developing and testing. While the fact that OVirt reaching its end-of-life zone wasn't the main reason we decided to go on this transition, it certainly matches in this direction. And, moreover, we wanted to take this opportunity and reallocate some of our resources and remove underutilized workloads while causing as minimum interference to the users as possible. So, taking all this into account, the shift to OKD seems to be the most reasonable fitting choice. So, as any successful story, planning is always essential, and our migration wasn't exception. So, we started our journey by having in-depth analysis of our current environment and just understand what the migration requirements and what we need from this transition exactly. We then continued to having some resource evaluation. That means we had to make sure that our target environment will have enough resources to accommodate the incoming workloads in terms of compute, storage and network. And finally, we had to create a clear timeline to make sure that each step of the way is well known and everyone involved from users and IT teams are in the loop of this transition. So, now we would like to zoom in even more into the preparation step and focus on the resource allocation. So, we had to start by finalizing our VM list for migration. And when we thought about what going to be the criteria for VM to be eligible for this transition, we decided to proceed with actively used VMs only and had to have close conversation with their owners to understand their specific needs. After that, we had to calculate the storage and IP addresses of all the VMs in this list to make sure that our target environment will have enough resources. This step was more than just technical preparation. It was essential to ensure that once the migration is started, we won't have additional downtime to lack of resources. And last, we had to come up with a way to reflect our original ownership and access mode from the overt environment to OKD. So, with a well-planned and tool like forklifts at our disposal, you might think this migration is going to be a walk in the park, right? Well, not quite. As we started our journey, we discovered that the path ahead of us is going to be quite challenging. So, now I would like to share with you some of the obstacles we encountered and how we tackled each of them to keep our migration on track. So, the first challenge was regarding the VM selection. So, as I mentioned earlier, we wanted to continue with only actively used VMs. That required from us to analyze the VM usage patterns and understand which VMs were actively used during specific time period, tasks that proven to be quite challenging. Then we had to gather the information about these VMs, such as disk size, network and ownership. And that task appeared to be quite demanding as well, both in matters of complexity and in time intensive. And the first, our two environment had different provisioning models. Our overt was more admin-driven and our OKD was more user-driven. And we had to come up with a way to bridge this gap somehow. So, in order to overcome these challenges, we went ahead and developed Python script specific for the migration process. And they can be broken into two categories. The first one, based on OvitasDK, was mainly used for finalizing the VM list for migration and two data gathering, such as the disk size, IP allocation and ownership. The second sort of scripts were based on Kubernetes API and they were used for creating the namespace on the target environment and for assigning the appropriate role for the users. We also uploaded the script to our GitHub region, so they can be used as a blueprint if anyone wants to take a look, you're more than welcome. So, now I would like to focus into a specific issue we had and just walk you through the different stages that we took to solve it. So, as I mentioned earlier, our two environments had different provisioning models. So, our Ovitas environment were more centralized models, where admin had full control of the environment and managed all the resources and created new VMs. Our OKD environment, on the other hand, is more user-driven, where the user have freedom to manage and create their own resources within their namespace. The namespace resources are set by predefined quotas. So, to bridge this gap, we decided to go ahead and create new namespaces on the target environment and place in each one of these namespaces all the shared VMs by the same users. And by giving them an admin access, we made sure that each user will remain with the original permissions. So, let's clarify it with an example. So, let's say after we finished finalizing our VM list, we ended up having four VMs for migration. So, as you can see, on the new environment, we created three new namespaces, and in each one of them placed all the shared VMs by the same user. So, Bob and Alice, who shared two VMs, now will have shared namespaces with both having admin access to it. And Bob ended up having three projects assigned to him, which really reflects the diverse usage on the original setup. So, now I would like to guide you through the script we used for this mapping process. So, the first part is based on Ovid SDK, and we did the following. So, we started by creating a list that mapped between all the VMs and the users from the system. Then, based on information from another script, we removed all the admin and system users from that list. Then, we created a dictionary that mapped between sets of VMs and all their corresponding users. And based on this dictionary and Kubernetes API, we created a YAML file. So, here we can see a set of actions for one set of VMs. So, we started by creating the new namespace on the target environment. Then, we created an admin role that gave full permissions on all the resources under this namespace. And finally, we created a role binding that bind between a specific user and the role, the admin role. And by that, we made sure that each user will retain its original access to its resources. So, now that we finished with the planning and preparation phase, let's dive into the migration execution. So, our first step was to deploy Forklift. Forklift can be installed from the operator hub, and it's managed by an operator lifecycle manager. In our case, we decided to install it on the same cluster as the target one, but it also can be deployed on a remote different cluster. Next, we had to create a new namespace that will hold all the migration resources, including providers, different mappings, and the plans themselves. It's important to know that the user used to create the namespace should have sufficient permissions on the migration resources. Next, we had to create the target and source provider. So, each provider represents the environment we're migrating from and to. Once we deploy Forklift, a new tab named migration will appear in the console, and from there, we can manage all of our migration resources, including the addition of new providers. So, we started by creating the source provider, and here we chose Redhead Virtualization, which is the downstream name for Ovid. We then had to fill in all the information about this environment, so Forklift will be able to connect it. Here, it's important to use users that have sufficient permissions on the VMs where about to migrate, or else the migration will fail. In our case, since we were dealing with scale migration, we went ahead and used Administrator account. Next, we created the target provider. So, here we chose OpenChief Virtualization, which is the downstream name for OKD. Here, we only need to fill in the name, and all other information is automatically filled in. Next, we had to create our network and storage mapping. Once the migration starts, Forklift needs to know how to redirect the incoming workloads in terms of villains and storage class. This mapping will tell him how to handle the incoming workloads. So, here we can see our network mapping, and we can see the new villains we created for our migration needs. Here, we can see the storage mapping and the storage class used for accommodating our incoming workloads. Finally, with the use of script, we had to create our migration plans. So, each plan holds inside of it all the VMs that are about to be migrated to do the same namespace. This means used by the same users. Once we were ready, we triggered, again, with the use of script, all the migration, and the migration started. As you can see, it also can be triggered from the console, but since we were handling with scale migration, we automate this process. Now, I would like to have a quick overview of the steps we had and add some additional information. So, we started by deploying Forklift and setting up all the costume resources for migration. Then, with the use of script, we automated all the plans and the migration. In our case, we decided to go with cold migrations. That means that during the transition, the VM is going to be shut down, because it best suited our needs. We're migration, on the other hand, keeping the VM operational during the migration, but it's leading to longer migration time, because we need constantly backing up the data to keep the VM operational. So, during this transition, we also monitored and troubleshoot the entire process just to make sure that we're on track. And once the migration was over, we chose some randomly VMs and tested to see their up and running, and then waited for some user feedback. So, although eventually we had a successful migration, we did encounter some issues during it. So, the first two issues related to the fact that we had a lot of simultaneously migration running at once. That caused both storage and network strain, and eventually led to longer migration times than we originally anticipated. Another issue we encountered caused some of the migration to fail, and after we had some investigation, we realized it was related to some bug in our codebase. After that, we released a fix, and with the use of that fix, we were able to migrate all the VMs, and it was included in the next version of Forklift. And finally, since the downtime was involved, we had to keep a clear communication and make sure everyone in the loop of what's happening. So, once we started receiving user feedback from the field, it was clear that we still have some issues to solve in order to make this transition fully successful. So, the first one was related to boot order issues. So, VMs with multiple disks were not booting from the right one. So, we addressed this issue manually, and later we discovered it was caused by another bug in our codebase that was fixed in the next version of Forklift. The second issue was related to the new VLAN we used. That caused our FQDN names to change, and the workload inside the VMs were no longer accessible. So, we had to update our DNS records, and the user had to adjust their FQDN names inside the workload to use the new ones. And after that, all the workloads were accessible again. So, as we're reaching the end of today's journey, I think it's a good point to reflect and draw some conclusions. So, overall, we had a successful migration. We were able to migrate more than 100 VMs and copy 12 terabytes of data. We mainly were able to achieve this result through thorough in-depth preparation and planning, and we realized how much it's crucial for a successful migration. Another thing is that we understand that each migration process can be different and held between different environments, but we do see some common ground and best practice that can be used to similar journeys. And finally, and probably the most important, is that even though Forklift is a really powerful tool and it gives us great capabilities, it cannot facilitate migration on its own, and additional steps are required, such as the use of scripts and thorough preparation. So, as we're wrapping up today's session, I would like to extend my biggest gratitude for each and every one of you. I hope that the session today will be valuable for people that want to go on the same journey. I wasn't able to cover into details all today topics, but we post a blog post about this. So, whoever wants to get another information, you're more than welcome to take a look. And that's it. Question and some insights. Thank you. Yeah. How did you handle notifying the VM owners during the process you automated the fanning with the old automated notifications? Yeah, we had... Can you repeat the question? No, you should get the question. Sorry. For the streaming, so maybe people are watching from the audience. So, did we automate the process of notifying the VM owners? So, in our case, we had a VM list that all the owners on this environment were included in, and we said as spreadsheet that all the VMs that were eligible for migration were included. And then we asked the owners to let us know if they want to migrate their VMs, because there was people that decided to continue to different environments or didn't need the VM at all. And based on this information, we also built our migration, our final migration list. So, yeah. Yes. Kai, could you please give us some example of what issues you had during the migration steps? Yeah, so I will give an example about some boat issue we had after the migration. So, we had a lot of VMs with multiple disks, and when you're trying to boot from the disk that doesn't have the operation system on it, the boot will fail. So, you see just like a black screen, and the OS is not found. So, we understood that it's probably not booting from the right disk, because we saw there is another one. And once we manually changed that, we really saw that it's solving the issue. So, we adjust this manually for all the VMs in the migration list. And after that, as I said, we released the fix in our next version. Yeah. Hi. Hi. Is this tool also performing some kind of a preferred check over the plane? I don't know. It's checking that you have enough space on the target storage class, or it's checking that the VM you selected is not exposing particular devices. But can make the middle of the end? We do have set of, so the question was if we did some verification to make sure that we have enough space on our target environment or in devices like in compute. So, we do have set of validations on our plans, but this one are not included. We're checking more things like names that match Kubernetes and more security things, not something like that. Yes. You transferred, you mentioned 12 terabytes of data. I was in the presentation yesterday about talking about PCCOP, and then we're talking about validating that all the data was named correction over a large database migration. Did you do some things? Because I was saying quite a hard problem. A lot of days you might get a crack or off. So, the question is if we're doing some validation on the data, if it's copying correctly. So, it's depend on the source environment, but we do use some external tools for that, and these external tools supposed to make sure that all the data copied correctly. So, it's really depending on the source environment you're using because there's different flows between the different environments. But the tools that we're using, for VMware for example, we're using VIRT V2V, so we're taking care for this check. For Ovid and OpenStack, we're using ImageIO, so it's taking care of under this tool. Okay, so if anyone want to ask any specific question, feel free to approach outside. Thank you.
Making VirtIO sing - implementing virtio-sound in rust-vmm project
Hi everyone, my name is Dorin de Basse and I work at Red Hat. I currently work on enabling the audio stack and other features in the automotive team. And with me here is Matthias. Hello everyone, I'm Matthias. I also work at Red Hat. I am working at the automotive and the beautification team. And I'm going to talk about the audio sound and implementation we did last year in this year too. So yeah. Okay, so in this presentation, we'll be talking about making VETAIO sync. And we'll focus on the implementation of the VETAIO sound in the RASVMM project. So just a brief outline. I'll be talking about the automotive use case. I'll go through the VETAIO sound device on the driver. And Matthias will take care of the VHOS design implementation, the audio back end architecture and the upstream status. Okay, so let's get right into it. One might ask why VETAIO sound? Our main use case is the automotive industry. And in automotive, Android guests are being used for deploying infotainment systems. So in order to support these Android guests, the virtual machine monitor, as in our case, Quemo, requires a set of virtual hardware like VETAIO sound, VETAIO net and VETAIO GPU. And having a VETAIO sound device emulation would allow for Android to be deployed in different virtual machine monitors that currently support the VETAIO device emulation. Examples of these VMMs are Quemo, CrossVM and the likes of them. The Android reference platform, which I linked in the slide there, it defines a set of VETAIO interfaces that are expected from any VMM monitor that runs Android. So based on our expectation for Quemo KVM as a hardware diagnostic hypervisor, we decided to close the gap, which involves enabling the VETAIO sound device emulation as an external process. So now Quemo or any other VMM that currently implements the VHOSESA protocol can actually interact with the user space application. So before showing you how we build this device, let's present to you what the device is. So the VETAIO sound device is a parametriolized sound device and is based off on the VETAIO specification standard. It's consisting of the VETAIO driver, the PCI bus transport and the VETAIO sound device. And this is an architectural view of what the sound stack looks like. And I will show you how the different VETAIO components come together. So first we have the user application in the guest that's interacting with the driver using a set of SISC calls and common user space libraries, such as, take for example the ALSA library in the case of a normal application in the guest or tiny ALSA library as in the case of an Android application. And then the VETAIO sound driver on the other side takes the information that it received from the guest user space and shares it over a transport method. And in our case is the PCI bus. Now this PCI bus is a way to expose the VETAIO sound device to the driver that's in the guest. And the VETAIO sound device, just like any user space application that's running in your host, it sends the audio streams to the host sound drivers and the necessary sound libraries and the E-mone would route it to the host, to the sound driver that's running in the host canal space. So I mentioned something about the VHUCHESA protocol in the previous slide. So what is it? The VHUCHESA protocol is a set of messages that has been designed to offload the VETAIO data part processing from QEMU to a user space application on the host. And this user space process application is what's responsible for configuring the VETAIO rings and doing the actual processing. The VHUCHESA protocol actually uses communication over the Unix domain circuit. And it allows the control planes to initialize the shared memory regions and also exchange the file descriptors. The protocol defines two sides for communication. We have the front end and the back end. For the front end, we have it sending the message request while the back end is sending the message replies. The protocol itself also implements the control plane for establishing VETQ sharing between the guest and the user space process. And this user space process utilizes the VHUCHESA library. So I attached an example here of what the VHUCHESA protocol message would look like. We have the front end that's sending the VETQ memory layout and configuration to the back end. And you can see the message outputs in hex formats. An example of one of these messages is the VHUCHESA get feature message. It's expecting an acknowledgement reply. But sometimes not all messages from the driver expect a reply from the back end. We attached here a subdom tool, which is a tracing tool that can help you while you're debugging in case you want to play around with the traffic messages. So this subdom tool would actually dump the socket traffic between the front end and the back end. And it's being used if you pass the parts of the socket and also specify formats. Maybe you want the format in hex and the subdoms could also provide your format in a pickup format if you want. So the VETL memory region, which is this guest memory here, is initially being allocated by the guest. And in Quemo, this is being done by the memperealock option. And the VETL memory region, when it's been allocated by the guest, it's smacked by both the front end and the back end using the M-MAPS CIS calls. So this memory region would be accessed by the file descriptors on M-MAP. OK, so what happens during the device initialization? We have the feature bit negation that goes on there. During this initialization, the device and the driver both have feature bits that need to be negotiated. And at this point here, the driver would read the feature bits that the VETL sound device is exposing to the driver. And then the driver would tell the device, OK, hey, man, I only support this subset of features or I do not accept this set of features. So take a example, when we have the VETL ring event IDX feature, when it's been negotiated, it would allow the device to control how the notification from the driver should be handled. And we have other features like the indirect descriptor feature. And this one thing to note about the VETL sound driver is that it doesn't have any specific features that are currently defined. So it uses a generic feature bit set of the VETL device. And there are a couple of other driver requirements for this feature bit negation, which you can find it in the VETL specification link. So in a nutshell, a VETQ is a queue of guest allocated buffers. And this VETL sound driver is consistent on four VETQs. We have the control queue, the event queue, the TX queue and the RX queue. And each of these VETQs are consistent of three parts. So first we have the descriptor table. And the descriptor table is occupied the descriptor area. We have the available ring, which is occupying the driver area. And we have the used ring that's occupying the device area. So to further explain how the VETQs are being mapped in the driver and the device, take for example, we have the user application that's running in the guest. It would notify the driver of the audio streams that needs to be processed through the corresponding libraries and interfaces. And when the driver wants to send a buffer to the device, it fills the descriptor table with the M-Mapped buffer and writes that descriptor index into the available ring. Now after writing it, it has to notify the device of those available buffers. So it would notify the device saying, hey, I have some buffers that need to be processed. Now, depending on the buffer size, it could create a descriptor chain, which it would always because of the sound buffers are usually a lot of them. So for the device side, when it's done consuming these buffers, it would write the descriptor index into the used ring and send a used buffer notification to the driver itself. Now in the past, this was not how the driver used to work. That's when the user application sends messages to the driver, because it was unable to actually determine when the buffer has been updated from the user application that's running in the guest. And some of our upstream contributions was to ensure that this acknowledgement callback was being used to notify the updated buffers and also prevent the reading of steel buffers. Thanks to Matthias for some of those contributions. And let's see how the requests have been processed for each of the vertio sound red queue. So for the control queue, it's been used for sending the control messages from the driver to the device. And this control red queues have been translated into a VHOS user request and it's been forwarded to the backend for processing. So on the device side is going to respond to these messages indicating the status of the operation. For the event queue, it's been used for sending notifications to the driver, but in our current implementation, we did not use it because it's not necessary. Then we have the TX queue, which is used for sending the PCM frames for our P streams. And this TX queue is being used for playback. So it would carry the PCM frames that have been initiated by the driver and also replied to the previous received frames from the device. For the RX queue, it's being used to receive the PCM frames for input stream. And this is being used during the capture. So the RX queue would carry the PCM frames that have been initiated by the device and also replied to the previously transmitted frames. So I'll let Matthias take over. So now I'm going to talk about the VHOS user implementation. The VHOS user implementation is split into the front end and the backend. So the backend and the front end communicate by using the VHOS user protocol as Doreen explained before. So for the front end, we based on the word from Alex Benet from Linario that simplified the boilerplate code in Kimu, which is common for all the VHOS user devices. So if you want to see this work, I leave the patch set there. Then for the backend, we decided to implement it under the RASP-MM project in the VHOS device repository. And the benefits of doing that are the following. So for example, we show the device implementation between multiple virtual machine monitors like Kimu or cross-PM. And we use RASP as our main language. So we leverage the features that this language have. Also the process that emulates the device runs separately from the Kimu. So that's reducing the attack surface of Kimu. And also the current implementation has less context which that, for example, the Kimu built in device. And I leave you the link to the script that you can use if you want to try it, you compare. And also you have the link to the RASP-MM project. You can look for the implementation. So now let's see how the backend is designed. So basically the current implementation is made of a device and the audio backends. The audio backends implement the driver for different libraries like PyWear or ALSA. And the whole backend is implemented by a single thread. And current implementation has called the number of strings. So we have only one for input and one for output. So when a new request comes from the guest, depending on the queue in which the request arrives, we're going to have different handler. And depending on the queue, the semantic of how we handle that request change. So I'm going to talk about that a bit. So for example, for the control queue, when the driver's in a request, what we're going to do is just to process that request immediately. So for example, we're going to pass the request and depending on the control message, we're going to call a different method. What we use here is a genetic interface so anyone can write a driver for the audio backends because they share the interface. And then after processing the request, we notify immediately the guest that the request has been processed. So in this case, the methods in the interface are not blocking. In the case of the transmission queue, when a request arrives from the guest and the transmission queue, as Doreen said before, it is when we're doing playback. So we're going to reproduce some sound in the host. What happens is how we process that request is by just picking up the request, I mean, storing a pointer to the request and putting it in a 5.0 queue, which is per stream. And then at some point, the worker's going to wake up and pop the queue request and process that. Here we have to make sure that we're going to consume all the payload that the request has or at least to fill the buffer that the audio engine proposes because otherwise what happens is that the worker's thread is going to wake up more often and we're not going to use the buffer, I mean, the whole buffer that the engine has for the playback. So we have to be sure that at least we consume the whole period. So in this case for the transmission, we notify the guest only after consumption. We have to do that, have to wait because otherwise we can make the user application run out of data. So the spec said that we have to do that, I mean, to notify just after consumption. So in the case of the reception queue, I mean, the transmission queue, reception queue were exactly the same. The only difference is that in the case of the transmission queue, we have, and the payload has data to reproduce in the host. And in the case of the reception queue, we have data in the host that we want to send to the guest for capturing. So what we do is the only difference is that when we pop requests, we're going to use that space to fill with data from the host and then send it back. So if you want to try it, as I said before, we have to launch two processes. One is going to be for the emulation, for the device, and this is the command line in which you use it up there. For example, the backing that you want to use in this case is pipe wire. And in the other command line is for chemo. And the only parameter that you have to take into account is the unique socket that you're going to use to communicate with the demo. So I would like to mention some of the afterword that these were required. And for example, we fixed the BitDio sound driver because it was not respecting the BitDio specification. So that is what Doreen mentioned before. And so we fixed that. And also we have been working in the spec to make it more clear. So we have we sub-streamed some patches to the BitDio spec. And other work we did was to add the descriptor util module to the build queue crate, which allows, I mean, which is what's before in BitDio FS, before, and we move it to the build queue crate so anyone can use it. And the point to do that is because you cannot, you cannot hack all the way that request is distributed over at the scriptor. So the guest can use any distribution of the, use descriptors he wants and because the spec doesn't say how to do it. And we have to be independent of that. And that is the reason of that. So also there were the patches to add the generic because user device, which used the boilerplate code code that you have to put in chemo for because user devices. And also there were some, I mean, there were many development in the pipe wire arrays crate, thanks to the Linda. So for example, we added the fill out module. Also the sparring buffer. There were many also backfixing that we did doing this work. So yeah, we are getting at the end of the presentation. So if you want to get in touch, feel free to participate in the because device project. Also we have a Slack channel called a big dios on if you have any questions. And we also submitted a proposal for how Google somebody of course, so we are, if you're really interested in participating, we are trying to add a new. Audio backing for she is streamer. So feel free to submit your candidate to that. And if you have any questions, feel free to contact us directly. We have the email here. So yeah, that's all I think. So I think now we're going to questions. The question is what happened if I want to use it. It's going when you launch the first program is going to launch the device emulation and then it's going to launch Kimo. And then, for example, if you are in the guest, you want to use it, you're going to use for example, speaker test or apply or something like this to do. And then you are going to listen something in the host. So, yes, but what is now nothing is happening. What is happening when you use the back end? No. So she's asking what happens when we use the now backing. It's clean. No audio. He doesn't use any library. Yes, nothing because the pipe wire would use the pipe, I correspond in libraries and also would use the also libraries, but no, nothing. Okay. Sorry, I missed the question. Can you disclose some car brands that is using your feet? Can you can we mention some brand that is using this implementation? No. Can I ask why you chose to implement this in Rust? Okay. He's asking why we choose to implement this in Rust. So as you all know, Rust, like going to Rust design safety and features of Rust, we choose to implement it in Rust and also the memory usage. So, yeah. I can compliment a bit because also there was the was already the Rust BMM project that existed before. So for a lot of things, we was quite easy to implement the device because we could use many, many things. For example, to work through the beer queues, notify the guests, it was already all in that project already. So for us was just to implement the parsing of the request. But for example, the beer queue handling was already there and also it was easier to implement. Yeah. That's it. Maybe it's a bit out of scope, but have you made any benchmarks compared to like fully virtualized audio devices? What's the like overhand of using this compared to one of the audio devices already existing in KMU? Okay. So he's asking what is the benefit of using this audio device in comparison to the other audio devices in KMU? So regarding the PipeWire backend, PipeWire provides reduced latency, low latency and also low CPU usage and memory usage. And using it in the audio backend, we did some latency benchmarks. You can look up the PipeWire Wikipedia and how to do this latency benchmarks. You could also use the CPU check for CPU cycles and context switches and also latency. So that's, yeah. I think we compare it with the KMU built in device, for example. And it looked like the less context switch for the user application in the guess. Yeah. One of my colleagues who is a computer sound developer device, but completely different. I don't know. I think I'm going to go into details. So he said that the way how good that sound specification is written doesn't allow proper implementation of the device reset functionality. So I just want to ask if you've had any troubles with the device resets or just curious how you've handled that. So the question is that the built-in aspect, rather than built-in sound, doesn't exactly well describe the reset method. That's it. I said that the question is that the built-in sound aspect doesn't explain very well the reset method. That's it. There are some conflicts in the sound. We didn't have that issue yet, at least. And now I tried to remember if we had any feature to call it reset or something like this, but we don't. So maybe we can talk offline if you want. Any more questions? Thank you. Thank you. Thank you.
Exercising QEMU generated ACPI/SMBIOS tables using Biosbits from within a guest VM.
Thank you. Thank you. Good afternoon, everyone. Thanks for coming to my talk on using bias bits to test key moves, ACPI, and SM bias implementation. My talk is going to be structured around these four points. First, we're going to discuss what's bias bits and why we're using bias bits to test key moves. And then I'll be talking about some of the implementation choices of my test framework. And then I'll describe the test framework itself. And then I'll give a brief overview, depending on how much time I have on the changes that I made in bias bits to get everything working together. So what's bias bits? It's actually a software written by Josh Triplett. He wrote this software after he left Google. And the software had actually a real-life usefulness in the sense that the bias developers and Intel, they used it to test their bias implementations on real physical hardware boxes. And what this software comprises of is that you can exercise ACPI and SM bias objects in the bias directly from a grub environment. And even though it's a grub environment, it also has Python built into it. So you don't have to write tests using Bashish, which is grub's native scripting language. You can write all your tests using Python. And all of this is executed from ring zero. So there is no need to actually go from ring three to ring zero to execute your tests, et cetera. All of the components, that is grub, Python, ACPI, which is what bias bits uses to execute ACPI components. All of these comes together in the form for bootable ISO, which is then used to boot actual physical box or virtual machine, in our case. So this is what it looks like in a most simplest form. You just run Kimu KVM here. Using the bits ISO, and it spawns a virtual machine. It executes a bunch of tests, and then generates the logs, and it pushes the log out of the virtual machine. I'll describe that a little bit later. And then it shuts down the VM. So why use bias bits for testing? Well, first of all, like I said, all the tests you can write are based using Python in a pre-operating system environment. And so that means that we don't have to go through the OS to execute bias components, but we can directly execute ACPI from the grub environment itself. And it has already ACPI CA built in so that we can directly execute ACPI methods. And the current test framework that we have in Kimu is basically what it does is it spawns a VM. It extracts the bias, the ACPI tables from the virtual machine's memory, and then compares those tables with some golden master blobs that is already checked into Kimu repository. And then it compares the golden master blobs with the actual table which is what Kimu is using. And then if there is a difference between the two, it throws an error. So the main idea is that any time we're making changes into Kimu that affects ACPI or some bias tables, we can go through, inspect the changes, and we can make sure that the changes are not breaking anything. But what we don't have is an ability to actually execute the tables from a running VM. And using bias bits gives us the ability to execute the tables. So that's the main advantage of using bias bits. So let's discuss some of the implementation choice of the test work. So bias bits is a software in itself. So it has its own repository. And then we have the Kimu repository. And these two repositories, in the Kimu repository, we have all the changes that basically decide the ACPI implementation. And bias bits repository has all the bias bits specific stuff, like all the build scripts, all its internal logic, and the two things are kind of separate. And adding to the complication is the fact that bias bits has, so George gave up developing on bias bits around circa 2017. And any effort that I made to reach out to him failed, so he didn't respond to my queries. So we couldn't actually directly use the bias bits upstream. So what we had to do is we had to fork the upstream bias bit software and put it in GitLab under the Kimu project, and then make changes to it. And those changes involved a lot of build fixes. So bias bits turns out to be something that is not buildable under the Neo compiler and tool chain because nobody has been maintaining it. So we had to make a lot of changes to make bias bits just build. And then a lot of fixes to get all the parts of the test framework working together, which I'll describe a little bit later. And then we have the Kimu repository that has potentially the changes that affect the tables. And so the people who are actually making changes to the ACPI implementation in Kimu, they care about the Kimu repository. They don't know or understand the bias bits repository. So now we have to decide how these two repositories are going to work together. So one of the questions is, so do we make bias bits repository as a module of the Kimu repository? And there has been a lot of discussions upstream on that. And it turns out that people really hate some modules because of a multitude of reasons. And you can actually look into this thread upstream. And it has a lot of interesting discussion as to why we don't want to have another submodule. So how do we keep the two repositories in sync with each other is an interesting question. And then from developer's point of view, whoever is making changes to, say, ACPI implementation in Kimu, do we make them go back and forth between the two repositories? Say, for example, they make a change in Kimu that affects the tables, and they want to write a test for it. So do they go to the bias bits repository, make the change, build bias bits into an ISO, come back to the Kimu repository, point the test to the new ISO, run the test. Oh, something doesn't work and fail. OK, let's go back to the bias bits repository, make changes, come back to the Kimu repository, and go back and forth. That's kind of complicated. And developers don't like to do that because they don't really care about bias bits. They just want to test. They want to add a test to exercise their changes. Right? So another also going to question is what kind of test framework do we use to write the bias bits tests? Do we use Q-test framework? Or do we use something else like the integration Avogadro test framework? Now, the existing test that I just described before that compares the blobs, it's called Biostable Test, and it's actually a Q-test framework. And people are familiar with that framework, right? Because any time people make changes to SAP implementation, that's the test that fails because it compares tables blobs and it right away fails saying that you have these new changes in the tables. You better have a look at it. So people actually understand how Biostables Test work. But do we use the Q-test framework then? The problem with that is that Q-test framework is really not written for something like spawning a VM, the managing all the issues of VM management, collecting the logs, dealing with errors, and then shutting down the VM, et cetera. So I started writing a Q-test for bias bits, and then I realized that it's not really suitable. So I started then looking into writing a new Python-based test framework for just doing the VM management and then using bias bits with it. And then finally, when I proposed that upstream, then somebody pointed me to the Avogadro framework, and I looked at it, and it was right away, Avogadro framework already had all the libraries that deal with VM management. And all I had to do was just focus on the bias bits part and develop that part. So Avogadro Test framework kind of really nicely fit into what we really wanted to do and what was available already without doing any new development. So finally, we went with the Avogadro Test framework. But then the question is, how do we make people familiar with how to run Avogadro tests? Because not all people are familiar with this test framework. Not all people run integration tests. So then we decided that, OK, how about we write a documentation for bias bits test? And that's what we did. So Kimu repository has documentation to how to run a few simple commands to execute the test framework. So I just described all this stuff. So let's describe what the test framework is all about. Now, before I'll just keep a couple of slides, and I'll show you the diagram here. So like I said, there are two repositories. There is one Kimu repository, and there is one bias bits repository. So in bias bits repository, we want to maintain everything that's related to bias bits and nothing related to Kimu or a testing ACPI. So the way we did it is that in the fork, which is residing right here, we have all these branches in there. Now, the Kimu bits branch is the one where we have made all the changes specific to using bias bits for Kimu. And so there we have a GitLy CI job, which is basically a BAS script that builds bias bits. And as a part of this CI job, so every time you commit any change to bias bits repository, this CI job gets triggered, and it will generate a bunch of build artifacts, which are nothing but like pre-built binaries for things like rub, Python, ACPI, CAC, et cetera. And then all these build artifacts are pushed in a well-defined location. And there is a URL for it, and you can just go and download those artifacts. And so in the Kimu repository, what we do is we, in the Kimu repository, we maintain the actual tests that exercises ACPI and some bias tables. So the actual tests are here in this location that are run from within the bias bits environment. And then there is a main driver to put all these things together. And this is the one, this is the main Avogadro test, ACPI bits.py. So when you are running the bias bits ACPI, S&B bias test, you need to run this guy. And what this guy does is that it pulls in these changes, these test scripts, where you have potentially added new changes for your stuff that has gone into Kimu for ACPI. And then it pulls in these build artifacts. And together it generates an ISO here. And then with this ISO, it spawns a Kimu VM and it runs the tests. Once the test is running, it collects the logs. The logs are pushed out into outside the virtual machine into a well-defined location. This test script then analyzes the logs. And then it says whether it failed or passed, depending on how many tests it ran, whether it looks for certain patterns and says, OK, this test failed or what have you. So basically, this mechanism does two things. First of all, you don't need to go back and forth between the two repositories. Everything that is bias-bit-specific resides here. And if you're not concerned with bias-bits or if you don't care about how it is built or what changes are in there, you don't need to touch this repository. All you need to do is just remain here. So every time you make changes to ACPI implementation, you add corresponding test code in here. And then you run this guy. This guy will pull in your changes, use the existing artifacts, and you run your test. Now, after it runs your test, this has some verbose mode where it puts out more information in case there is a failure. So you can analyze the failure, make changes to these test scripts, and again, rerun this guy. So the advantage is that you are actually not, you're already within the chemo repository in your workspace. You're not going back and forth between the two. And then, because a pre-built artifacts are being used, generation of this ISO is a lot easier because these things need not be built. They're already built for you by the CI job. All you need to do is put these test scripts together with this guy and generate the ISO. So this is what I just described all these points here. And then, so let's look at the advantages which I briefly described. So, so no need to use some modules. There are pre-built artifacts that makes it a lot easier. And then if you need to make changes to the bias table, as to bias bits, you make the changes, build new artifacts, and you point the main test to the new artifacts. And the other advantage is that when you release chemo in turbos, that turbos does not have any bias bit specific binaries. They're completely mentored outside of the chemo repository. So they're completely separate, and you don't need to release chemo with any bias bits artifacts. The disadvantage is that because we're using pre-built binaries, therefore we are very architecture specific. So right now we only support 64-bit X86, and it does not support any other platforms. And supporting other platforms is kind of non-trivial, because you need to make sure bias bits can actually build for those platforms, right? And that is, and bias bits was never tested on platforms other than X86. So it's a non-trivial work anyway, right? And then there is tool dependencies to build the ISO, and the environment where you're running the test should have those tools available. So let's look at the overview of the changes that are in the bias bits fork. So like I said, bias bits was ever maintained after 2017, so I had to make numerous changes to make bias bits build with the latest toolchain and compiler, and changes were across all these guys. And I had to also upgrade a CPI-CA, because a CPI-CA is the main driver that knows about various tables. And if you don't upgrade a CPI-CA, you cannot write tests that uses the newer tables. So I had to upgrade a CPI-CA. I had to find a mechanism to push the logs out so that the test framework can analyze the logs. I had to make sure that the console logs are available. And one other thing is that the Python that runs from within the bias bits VM is still Python 27 and not 3, because upgrading Python is a non-trivial work. And since it is a very closed environment, very controlled environment, I didn't see the value of upgrading Python in that environment. So it is still running Python 2, whereas everything else in Kimu is Python 3. These are some of the useful resources, and you can have a look at those resources. This includes things like the Josh's presentation slides and his talk on bias bits itself, which is a lot more details than what I described about bias bits in this talk, and then the details about the test framework itself, the fork that we maintain here, et cetera. So the last but not the least is, before I talk about demo, is that I would really like to thank these guys. Igor is originally proposed the idea of using bias bits for exercising Kimu with the CPI tables, and so I'm grateful for that. And then all these other guys, they gave various useful feedback throughout the process while I was submitting patches upstream, and I'm grateful to all the reviewers of my batch sets and the entire upstream Kimu community for help. Lastly, if you really want to see a demo, there is no time for this in this presentation, but you can click on this link, and there is a video that describes a lot more details on actually how to run the test and all the scripts within the repository. So thank you so much, and now I can take questions if you have. Yes. I have a question. Yes. What do you mean by Python? I mean, what is that Python? It's just a copy based on the built in Python? No, no, it's Python. The interpreter is built from source. Wooden Biospits, it's actually, the Python is built from source. So Python 2.7 is the one that Biospits uses, and it builds everything because it has to build extensions so that it can integrate with Grub. So from Grub, you can actually run, you can say Pi, and then you can run a Python script. So all that happened because it was built from source with integration with Grub. The only problem is that it's a Python 2.7, and I didn't see the value upgrading it to 3, but you can actually run the whole Python script, and that's how all the tests work, because they're all running from Grub, but they're full-fledged Python 2.7 scripts. So it's a full-fledged one, not only certain API that you can use? No, no, it's a full Python. Any other questions? Thanks. Thank you.
One SDN to connect them all
Okay, so good afternoon. My name is Miguel Duarte. I'm a software engineer working for Red Hat and the OpenShift Virtualization Networking team. Well, in this talk we're going to be discussing an SDN solution for both types of workflows so you can have pods and virtual machines in the same network and the use cases that this SDN provides and a little bit how it works. There are going to be some demos as well. So let's jump to the agenda. All right, so first thing we're going to do is explain the motivation, like what drives us to have like to do this and the actual problem we're trying to solve. From there, there's going to be a little short introduction that depends how deep it is going to be. Well, it depends on a few things. And then I'm going to walk you through the use cases for this SDN solution, show the demos, finalize with the roadmap for the future and with the lessons we've learned during this development. So first thing, how many of you have used or like worked for stuff that has anything to do with Kubernetes? Like yeah, pretty much everyone. How many of you use Kubeverts or know what it is? Well, more than I thought. Okay, cool. So the introduction is not going to be that deep. But yeah, let's start. First thing, going to be discussing the Kubernetes networking model. Like as most of you will know, pretty much it's very simple. And one of its few premises is that any pod that is deployed on the Kubernetes cluster can contact and can reach any other pods in the Kubernetes cluster. Like basically you have cluster-wide communication between whatever type of workloads that are deployed in your cluster. Without NAT, by the way. So another thing you get as a byproduct of that is like VSC and all that, it pretty much it configures a way for you to reach the outside world. So you get free batteries to reach the outside world, to reach the internet. The thing it does not allow you to do is to connect to a pre-existing network. If it's, for instance, I say you want to connect to a database that's deployed on an existing network, well, you're out of luck. Kubernetes does not solve this. More things, if you, for instance, you want to deploy a VNF, for instance, and you require more than an interface, Kubernetes will also not do that for you. There are solutions out of three, but we're not going to go there right now. So the motivation for our talk pretty much comes that you don't have like an entryway for you to access like stuff on physical networks and to get access to additional networks. The default cluster network that you have that comes bundled and Kubernetes gives you for free or whatever Kubernetes distributions you have give you, well, it's not suited for all types of use cases. For instance, for virtualization, like the ones that you, of you that use Qvert, should know like you'll get IP addresses and well, IPAM management in for virtualization is extremely tricky and it will not mix well. Pretty much because you have like different IPs when you migrate from the source to the destination pod and that thing will not play along correctly. And finally, in virtualization, you typically like use secondary networks to do all sorts of east-west communication and you pretty much rely on the default cluster network just to like get batteries of the Kubernetes services, stuff like cluster DNS and things like that. So on your secondary networks, you need to figure out other ways. So you could do like bridge CNI and other types of plugins, but that leaves like your operation teams will need to know how to debug yet another totally different solution. Where admins will need to know and configure yet another like bunch of solutions and depending on the use cases, you'll realize that this plugin will work, but this other will not. So like the matrix of things that your operations team has to know and your administrators need to learn how to configure, it skyrockets like it's and it becomes literally too expensive to actually handle. So now the objectives we have the first is we want these cluster admins to like go do something else like we want to push all the complexity of these different sorts of use cases and this mix and match technology to be pushed from their heads and from this lots of tools that they need to learn and know. We want to push all this complexity to the network. And finally we want to have like a single like a single plugin be able to handle like a multitude of use cases. So pretty much what we want to have is like to have the whatever the CNI that comes bundled with the with our Kubernetes distribution, we want that to be able to to to work properly both for the cluster default network and for the secondaries. Okay, so very short introduction now. So Qvert. Qvert is a Kubernetes plugin that allows you to run virtual machines inside pods. So you basically get two different types of workloads. You have pods and VMs and like you manage them from the same from the same solution. As a like an implementation detail like the virtual machine actually runs inside of the pod. And each pod has like a live version instance running in it and the QML process and all that and like just to finalize you have like the networking requirements that virtual machine has is a lot more than a pod like a pod is something like entirely stateful. It's like a cattle you just kill it and a new one will spawn and do it a new thing while a VM is stateful and you need to treat it very careful carefully. Now kind of a little disclaimer our SDN solution that we developed uses Oven and Oven stands for open virtual network and it is essentially like an SDN control an SDN control plane to open V switch. So you have like a bunch of you have open V switch installed in each of the nodes of the cluster. You have Oven on top that is kind of rendering open flow and installing it in each of the open V switches in the nodes from like higher level entities like you have things like a logical switch that grants you connect L2 connectivity between the workloads on these two nodes and this thing gets rendered into open flow and installed to the nodes. Then we have Oven Kubernetes on top of it. It's a CNI plugin that what it does is translate from Kubernetes entities into Oven logical entities. So for instance when we have like a secondary network what we end up having is a logical switch when you have like pod attachments what we have is logical switch ports that are connected to the logical switches and for instance in network policy is nothing more than like a port group that associates a list of ports of logical switch ports to a bunch of ACLs. Alright, so supported use cases as I said in the initially in the motivation section for virtualization use cases mostly you do not rely on the default cluster network to do east-west communication. You use secondary networks. So that's the first use case we are focused on is east-west communication. So as you can see here like pretty much what we end up having like these things here are pods or virtual machines it doesn't matter what we actually end up doing is attaching a new internet network interface to it configure it and what we get is like the logical view of having like them connected via a cluster wide switch. So that's literally what we get a cluster wide switch a connection to this cluster wide switch and everybody that is disconnected that is connected to it can communicate across that network. There's a short demo right here and we'll see that oh god no internet I knew that that's why I have this terminal here. I'm really sorry for the font size but if I put this a little bit bigger it'll basically mess up like the window configuration. I hope you can still see it. So first thing we're going to see is what the network configuration is. I'm not sure if you're used to using MULTUS the ones of you that use Kieferd I guess they are so this is the first thing that we need to look at the network attachment definition. This explains this pretty much holds the CNI configuration from which the CNI plugin will just configure networking for your pod. So the interesting thing here is pretty much like the name of the network so the idea here is that the networks are not namespaced but the network attachments are. So this means that if your network admin wants to grant a namespace connection to a network he needs to provision or she needs to provision one of these like in the proper namespace and this will connect your namespace to this network. Finally the oops sorry it does not go back. Another interesting thing that was there was the topology which is layer 2 so pretty much what we have is an overlay network this is totally disconnected from the physical network that allows you to have like east-west communication. And we do not have IPAM because IPAM for work loads is very tricky and we'll see more on that later on. So we're connecting two different namespaces as I said. Now we're going to provision these into the cluster like fun fact this is like all lazy so I just put like a bunch of stuff in the cluster but nothing happened yet. It's just like the definitions are provisioned and nothing else and now we're going to show the workload definitions. Here they are so we have one virtual machine remember we do not have IPAM in the network so we need to configure the IP statically you have there in the bottom we configure the IP statically using cloud init in the VM and that's its IP 192, 168, 200 dot 10. And then we have like our pod we have two pods here specified we have the blue server pod and you have like a yellow server pod and the blue pod has the dot 20 IP that we configure using the network selection elements this is like multis lingo and what this is doing is exposing like an HTTP server on port 9000 and what we want to do is so we have two servers the blue and the yellow and we have the VM is the client and we're going to be curling to each of the servers by their hostname so that's what we're going to see in well let's first provision this the windows in the bottom are so we can kind of see when they're ready so the old service ready the tenant so the VM is booting up let's speed up this part so we're now going to log in via console to the virtual machine and we're going to curl both the servers and they're going to echo back their hostnames. So God it's going to take forever if only I had internet and I could play a video oh God. Does anybody know how you can ask an emma tell it wow amazing I don't know. It stopped again. Okay it's playing cool so yeah login via console I hope it did not stop what's happening so yeah I should have known better but again we're going to log via console the UI of this thing is absolutely preposterous like I don't know if it's playing or not. Okay so yeah you get curl to the dot 20 IP address it replies with the blue server thing we do the same thing to the dot 30 IP address it replies with its hostname which is the yellow server this concludes the first demo which shows us like east-west communication between well different workloads in different namespaces. Now going to the second use case which now we want to have like remember the motivation slides where I mentioned stuff about accessing things on a pre-existing physical network that's exactly what we're going to be seeing now so as before we see like we have a logical view of a cluster wide switch the difference is that this switch is actually connected to a physical network and you can access stuff that's there in our example it's going to be like a database that has well the data the VM needs. The first thing we need to kind of elaborate a little bit is that you need to configure the physical network so first of all it's not something that a typical user will get access to it needs to happen by a cluster admin and for that we're going to be using two tools first NM state and Kubernetes NM state. NM state is basically like a declarative tool that configures networking you just give it like the desired state I want my network to be like this and it's going to go while punching buttons attempting to make the current state be what you desire to be if it fails it rolls everything back so no changes to your network so it cannot destroy it and if it succeeds it'll tell you you succeeded so basically what we want to do is use Kubernetes NM state which is kind of a cluster wide thing send YAML specification to the cluster and all my network specification will be applied in all the nodes in our cluster so it would look like this so in the in the left we have an example of a policy and in the right we have like a diagram of the topology we're trying to do here so if you look here we this is going to be applied to all the Kubernetes worker nodes because of this node selector that we have here and what we're going to do is create an OVS bridge in each of these worker nodes attach this ENS for interface to the OVS bridge and then using this these oven bridge mappings in the bottom we're going we're saying that we want traffic from the network called default network to be sent to the OVS bridge called BRX and we want the traffic from the tenant blue physical network to be sent to the OVS one bridge it's literally what you diagram you see there in the right now we are granting access to from workloads to the physical network you should not mean should fret carefully when you do this and for that we need to have like micro segmentation this pretty much is like what you have in for the primary network for the cluster default network network policies this is the exact same thing but applied to secondary networks so in our example what we want to have is like a virtual machine that wants data from the database but we do not allow it to actually consume the data directly from the database so we expose that information from a pod so the pod actually can connect to the database and expose this information from a RESTful API over port 9000 so this is kind of what we want to do ensure that the VM cannot reach the database directly over the port the posgrasql port but and ensure you can using well this tiny pod as its proxy data proxy so again another demo this is going to be a disaster I have tempted to tell you to just check this at home but how many times we have more than five minutes right again this does not work sorry it's the other cast so now in this demo what we do have we do have two namespaces data consumer and data adapter we just provisioned them let's first thing some information like this I'm running a kind cluster here and I botched this again so I'm running a kind cluster here and so my Kubernetes nodes are running basic as containers in in my laptop and so the physical network that we see in the diagram is basically like my laptop it's going to be connected by a Linux bridge in the in the laptop and for that I need to kind of since I'm using a VLAN I'm going to need to pre provision the VLAN and that's the interface you see here in the bottom like this podman 1.123 it's a VLAN on top of the Linux bridge management interface I'm going to show this again and so again VLAN 1.123 that's subnet and we have a database running in containerized database running here and we are going you see we have access to the database now let's check our manifests really sorry so I think you should check the demo at home and we have five minutes yeah I don't think this is going to work please do check the demo at home but pretty much what you'll see is what I showed in this diagram so you'll have access on one port you'll have access direct access from the VM to the database like you can PSQL to the data directly and to get the data from using HTTP from the pod and then I provision some policies and then you stop having access to the database and that's pretty much it now roadmap what what are we going for next first thing we need to have is like IPAM for the workloads so we need to find a way to tie the IP allocation to the virtual machine and not to the pod where the VM runs like remember the our big issue where the virtualization is migration and that means that when you migrate the VM to a new node the pod gets a new interface the VM is still with the old interface and basically networking is not properly configured we wanted to address that first and that will unlock the next thing in our in our queue which is selective selective policies so our kind of policies for the secondary networks right now you can only use IP blocks you cannot use like semantic things like I want to allow all workloads from this namespace we're having these labels to access this sort of workload you cannot do that you need to specify IP ranges directly once we have these two things we're going for services next like we want to have like exposed via services like egress from VMs and to have like load balancer services so that you can access them directly over the secondary networks finally self-service networks this is instead of having the cluster admin provision these for you since network overlay a simple overlay that does not touch the physical network you could directly use like a self-service functionality to just create the network yourself and connect and provide east-west connectivity to your workloads okay well lessons learned this was a let's say a joint venture like or a collaboration between both Red Hat and NVIDIA and the fun thing is we had the exact use cases in mind like both of them focusing on Qvert but with different scope so we are a lot more into the generic kind of world we want we give you a platform and you do whatever you want with your platform and NVIDIA notes they have like their use case in mind which is I guess gaming in a data center and their tooling is a lot more let's say pointed in that direction but was a really nice collaboration and we're hoping to see more in the future on a less positive note we get the user experience of this is not amazing like it's better than originally intended because like for instance the thing I've showed you about the the nm state policy that was something that we came up with because we felt like this feature is entirely unusable like people are going to be breaking their default cluster network every time they use this or risk doing that so we've provided that but we still have some sort of nightmarish stories every now and then because of the the way network attachment definitions look like and how easy it is for you to get things wrong and how silently and how these silent errors kind of creep up it's absolutely insane like sometimes things work but not in the way you expect because it just doesn't recognize one of the parameters because you typed it wrong but everything else works it's insanely hard and yeah I'm really sorry about the demo but yeah thank you for listening and if you have any questions in one minute one question it's the same thing yeah so the question yeah so the question is basically there's another cni so we're doing this in oven Kubernetes and there's another cni cni called cube oven so it's kind of it really screams that it does the same thing and yeah it really does the same thing the use cases are mostly the same the thing is that they do a lot more than we do like quite honestly like their feature set is a lot more complete than ours and we're trying to get there like if your question is like why didn't you just use that well we do not have any let's say current stake in that cni we do not have maintainers there we do not like we would have to try to gain entry there and in some cases we do not like their API so we're trying to do things in another way it might seem like we're reinventing the wheel sometimes but yeah kind of we kind of are like we both do the same thing and their feature set is more complete but again we're trying to do more and reach their feature set thank you for the question and I think it's that's it well
Deploy Kubernetes... From Kubernetes: an overview of Cluster API
You're famous. Yeah. That's it. Close the door behind you. Okay. Okay, let's go. I hope you hear me correctly. People in the side and in the room and people online. So, yeah, let's begin. Okay. Okay, let's go. I hope you hear me correctly. People in the side and in the room and people online. So, yeah, let's begin. Thanks for them for having me today and to talk about cluster API and Kubernetes. Thanks to you to come here. That's quite impressive to see the room being fully packed. Yeah, I hope you will learn things. I hope you will discover things. That's the most important. And you will get some stuff to continue to investigate at home. So, the goal of this talk is to give you a brief introduction to cluster API. To give you a brief introduction to cluster API. So, yeah, let's begin. Thanks for having me today and to talk about cluster API and Kubernetes. So, yeah, let's begin. I hope you have a great day. Thanks for having me. Thank you. Thank you. Thank you very much. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thanks for joining. Let me quickly introduce myself. I work as a flat car engineer inside Microsoft. Flat car is an operating system designed to run container workloads. If you want to learn more about flat car you can go at 5.15 see my teammate Thilo talk about flat car. It's a deep dive introduction. And it will give you the key elements about this operating system. But that's not the purpose of this today. Outside of work like SRE France, which is a DevOps association where we organize meetups and conferences in France and in Paris and in France. So if you want to talk in a meetup or if you're interested to organize something, let me know. Context, the context is Kubernetes. Kubernetes is quite the answer to everything today. So if you want to deploy something small, something big, there is likely a big chance that you're going to use Kubernetes. So to me, it becomes a great standard, I think we can say this term. So yeah, that's the cool thing with Kubernetes. You can deploy a small thing and big things and it works. And it works in the same way if it's a big thing or if it's small thing. Something to know about Kubernetes is that you have two aspects of this technology. You can consume Kubernetes, means you deploy your application on it and that's cool. And you have also to deploy and maintain the Kubernetes cluster. So you can do both if you want to. You can do only one aspect of the other. But today, let's focus on the deploy and maintain Kubernetes cluster and not how to use Kubernetes cluster. Two or three weeks ago, I was on Twitter checking some news, what's going on in the tech industry. And I've seen a tweet of a person I've met in different conferences. A tweet about, hey guys, what if I write a book to describe all the ways to deploy Kubernetes. So it was an idea like that and he got some traction in the end about this idea and he started to draft a list of all the ways to deploy Kubernetes. So the knee is at the first day of the currently. So if you want to talk with him about his book or if you want to invest in his book, it's a great opportunity to meet him. He has a talk in the Go Dev Room this afternoon. But we're not talking about Go, we talk about Kubernetes and the 50 shades of deploying Kubernetes. So you can use binaries, you can use managed services, you can use platform, you can use a bunch of ways to deploy Kubernetes. But today, let's have a look on the line 27 or 26, something like that, it's the cluster API. Cluster API, if I quote documentation, it's Kubernetes, a project focused on providing, you can read. The most important is the last line, the cluster API project, use Kubernetes types APIs and patterns to automate cluster life cycle. In other words, use Kubernetes to deploy Kubernetes. So that's the cool thing with Kubernetes is you can extend this technology using CRDs or custom resource definition. So you can extend the technology and you can leverage, you can benefit from the reconciliation loop, for example, the Kubernetes for what you want to do. It's already available for the basic usage of Kubernetes, but you have over projects like a cross plane that will leverage this way of managing the life cycle on the provider side. So cluster API is this kind of stuff. So if we take a look on really abstract way of how does it work, you have two clusters. On the left, it's the management cluster, this is the pilot of everything. This is where things happen. And you have the workload cluster, this is where you run to run your workload. So your website, your SaaS, whatever, it will run in the workload cluster. This is what you currently do if you do some Kubernetes cluster. But before that, we have the management cluster. So you're going to tell the management cluster, hey, I need a cluster in these providers, please deploy everything I need to have to run this Kubernetes cluster. Because to deploy a Kubernetes cluster, you need networks, you need security groups, you need a bunch of things to create on the provider. Well, the management cluster will take care of that, and it will deploy things for you, and you don't have to do anything. So that's the way to see things. And in this example, my management cluster is running with Kubernetes in Docker, kind. So this is pretty convenient because I can run my management cluster on my local laptop, on tiny resource thing, because I just need to deploy one single control plane. I don't need to have high availability and stuff like that. I just want to use the Kubernetes APIs. And the workload cluster in this case is running on OpenStack just for illustrating. So as long as you have a network connectivity between those two clusters, and you have credentials, of course, it will work. So and you can even decide to migrate the management cluster from one cluster to another, but that's something else. That's the key elements to understand and to know if you want to understand the concept of cluster API. So this is my kind cluster, so I just have one single control plane running. That's it. Nothing fancy, nothing to do. Just kind create cluster, and I have my management cluster. Nothing to install on it on top of this. Just a regular Kubernetes cluster. Really simple. Now, how can I create things on my cloud provider using cluster API? Well, for people that already knows Terraform, that already knows cross plane, all these kind of projects, you know that there is no secret. You need to know the APIs of the cloud provider to implement them, to consume them, and to create what we call a contract. So this is the border between the cluster API logos and the cloud providers. So you need to teach cluster API. Okay, so in cluster API, we say that a network, it's this thing. So a network on GCP will be this thing, on OpenStack, it will be this thing, and so forth, and so forth. So yeah, the idea is to teach cluster API how to manipulate and how to deploy resources on the cloud providers. And for this, we use what we call a cluster API providers. So on the Kubernetes SIGS sub project on GitHub repository, you can see all the various providers supported. So there is OpenStack, GCP, Public Cloud, on-premise. So it's a tinker bell on the upper right. So yeah, you have a bunch of providers and if you have some knowledge in Go programming, if you have some knowledge in API consumption and stuff like that, feel free to start to contribute on this provider because this is a cool way to start contributing to Kubernetes and Kubernetes ecosystem. So yeah, that's the idea under the hood, what's going on. And now I have my management cluster. I need to create my workload cluster configuration. So I just use the cluster CTL, cluster Cuddle, whatever you call it, command to generate this YAML configuration file. And I just provide a few key elements, the flavor, the Kubernetes version, how many control plane I want, how many workers I want. One interesting thing is the flavor. So cluster API relies on templates. So these templates are provided by the maintenance of the providers. So for example, the flat card template will deploy a workload cluster based on flat card images. You have some flavor, for example, on the open stack with load balancing, if you need some load balancing services and stuff like that. So flavor is a way to customize your deployment of your workload cluster. You will still have a workload cluster in the end, you get a Kubernetes cluster, that's fine, but you can decide to customize it. So this flavor, this variant, are tested using end-to-end testing. So each time there is a new release of the providers, you can be sure that it passed the CI, so you can safely update your configuration. Of course, for clarity, I didn't mention that you need to provide a few more environment variables to, for example, to provide the credentials. Of course, cluster API is going to create some things on GCP, on AWS, on the open stack, whatever. It needs to get access to this infrastructure, so it needs to get the credentials. So this is an example of things you can pass, but you can also define which instance size you want to use for your control plane, which instance size you want to use for your walkers. So this is the kind of elements that you need to configure previously calling this command. But yeah, just for demo purpose, I wanted to show you this command line, which is the bare minimum to generate the cluster configuration. And now we have the KAPI quick start.tml file. We can apply it like any over Kubernetes manifest file. So KAPI Ctl, KAPI Ctl, apply KAPI quick start.tml. And it will create, as usual, some resources on my management cluster. So we can see that there is the cluster definition, there's machine deployment. So this is something common to cluster API. Then we have the open stack specific part. And this from one provider to the other, of course, the output will be different. But that's the idea, you just apply this. So that's pretty convenient because you have a file. So you can use this file in a Git repository. You can use this file in a CI. You can use this file with whatever you want. So you have an infrastructure as code in term of cluster API. Now, if I check on the provider side, I have some resources that are going to be created by themselves. Not by themselves, by cluster API. But you can see that I have some instances. So I asked for one control plane and three worker nodes. So we can see that I have four instances between being created. I have a network, I have security groups, I have stuff, SSH keys. So this is for open stack, but once again, it's the same thing for every provider. But this is the cool thing about cluster API is that it does not just deploy a cluster. It deploys everything to create a cluster. It's instance, the security groups, the firewall rules, ingress, egress, whatever you need. So it works in this way. When everything is up and running, you will just get your configuration that you can inject into Qubectl and then you have a new cluster ready to be used. So that's it about open stack. Now, we can ask yourself what's under the hood on the operating system side. I'm a factor engineer. I work in the operating system field. So I'm curious to know what's power my nodes. So with Qubectl, we can inspect the nodes and see that for example, this one is running Flakar because I asked for Flakar variant, but for example, with Flakar, we do not ship QADM. We do not ship Qubelet service. We do not ship MISC files. So how my nodes can start acting like a Kubernetes node. How things can work. And on top of that, Flakar is immutable. There is no package manager. So there is no way cluster API is going to SSH into that node. It say, okay, APT install QADM. No, no, no. So what's the magic behind? It's another project called the Image Builder. So it's on the Kubernetes 6 GitHub repository. That is the Image Builder project. So the idea is to take an OS, for example, Ubuntu, to build it with Pacer. So nothing new under the sun. And to inject the QADM, the container runtime, the MISC files, whatever you need to power Kubernetes nodes. So it's a three step thing. You take your OS, you inject the Kubernetes components, and then you export this new image, this golden image, like we sometimes call, into your providers of your choice. Open stack, GCP, AWS. So you understand that something quite complicated because in order to use cluster API, you need to use this kind of image. So it means I can wait for someone from the community to build it. The build of the image is not an issue. Everyone can build image. It's more about the storage. Because storing an image, it's something, but when you have to store image for each Kubernetes version, so it's three main versions at each time. So three Kubernetes version, then I have to keep the image for each cloud provider I wanna use, and I have to keep an image for each different version of Ubuntu. It can be complicated to store everything and to have the time and the energy to build these images. But this is what we currently do with this provider. So that's, I will not say this is the way to do, but this is commonly done currently in the open source world. So we can think about something, an alternative. And the truth spirit to me of the open source world is to have alternative. So there is no one way or one other way to do things, there is alternative. Then you choose which one is the best for you. So an alternative would be, okay, I take a Linux based OS, for example, Ubuntu, Flat Car, whatever. It's already available on GCP. It's already available on AWS. It's already available on Digital Assign Azure. Because these cloud providers provide these images for you. So just the vanilla image, a fresh image is already available. So what if now we download the Kubernetes components during the boot of the image? And in the end, we have the same result. We have a Linux based OS with the Mixed file, with the QBGM, everything I need to power my nodes. So this is something we implemented on the open stack side. So you just need to use an over flavor. It's SystemD CZex, Flat Car dash CZex, sorry. And it leveraged this new feature of SystemD called SystemD CZex. Basically SystemD CZex, it's an image, raw, squash fs image that you're going to mount as an overlay on the Linux based system. And it will bring you new binary files, new configuration files into your system. So if you want to have a look to SystemD CZex, I really encourage you to check this new features from SystemD and this is what basically we're going to do with this flavor. It's during the boot, we're going to download QBGM SystemD CZex image and everything will be open running to power my node. One, just for example, if I SSH on the node, I can just run SystemD CZex list and it will give me the output of Kubernetes image being available. So what's, what it's cool with this approach is that you remove the strong bounding between the Kubernetes version and the image version. So if you want to update Kubernetes but you don't want to update your base.OS, you can. If you want to update your base.OS but you don't want to update Kubernetes, you can. Before that, you were supposed to build new images and stuff like that. And the cool thing is that SystemD CZex is, it works in the same way on AWS, on Azure, on premise, on whatever. So you have just one configuration for all the cloud provider. So that's something to keep in mind. And we discussed with cluster API folks to see what could be the new approach of this. We also attend some office hours of the cluster API or ecosystem to make some demo of this. But it's already available on OpenStack and we hope it will be available in the next, in the over providers. A few resources, if you want to continue, check this at home. You can have of course the cluster API website, the cluster API OpenStack. This is for the example I've shown, Flattar and SystemD CZex, which is the main outlook in the end of this talk, is SystemD CZex. But yeah, so to conclude, I would say this talk is to give you the key elements, what's going on in cluster API, how does it work, but also to give you an overview of what we currently working on on this aspect of cluster API. So yeah, thanks. And of course I forgot to add the QR code, but you can find them on the FOSDEM website. And yeah, thanks for your attention, if you have any questions. So if you have any questions, or maybe on the chat, if there is some. I didn't see anything, but we can start with you. Yeah, do we have a mic? No, you will pretty much. Okay, we repeat the question. Okay, I have a philosophical question about cluster API. What's the life cycle of Kubernetes cluster to you? It's long running, you upgraded, or for it just temporarily, you just replace with a new one. So the question is about the life cycle of the cluster. Do we need to replace each node when there is a new version? That's correct? Yeah, but what is the intended purpose? It's like long life cluster, so it's like just for short term. Well, the goal is to have, yeah, okay. So the question is, do we use cluster API to have long term usage or short term usage? I'd say that cluster API is in the philosophy of leveraging Kubernetes, which means using the reconciliation loop of Kubernetes. You can just things that by themselves and if there is, I don't know, a network, so instance that going down, it can be restarted by the management cluster that will take care of, there is a state and you still want to be in this state. Because with Terraform, for example, there is a state, of course, but it's not live. It's a static state. So if there is some things missing, you need to rerun, plan apply, to be sure that your things get back. So that's one of the difference. Yes, Mr. Why cluster CTL to generate a template instead of using Helm templating? That's a great question. That's a great question. As you can see, can you repeat? Yeah, so why use cluster CTL CLI tool instead of using Helm or customizer or stuff like that? So the idea of cluster CTL, it provides some sugar on top of the common generation. So you can manage your clusters. So that's one of the features. But you can also have variable injection. When you generate a template, it will check if there is some missing variables required by the providers. So I think you can perform the same thing with other tools, but cluster CTL is just handy and you have it in this way to just be sure that you don't miss an environment variable to configure your deployment. Yes? In terms of the overlay or the flat card, how hard or easy is it to build custom overlays? Say if you've got OEM integration, what's the tooling to support that? So the question is about the overlay and how to build these images. If I understand correctly, the system is CZX. So you don't have to build, because we provide them on... You've got to say you wrote custom security... Yeah, you can... We've all decided to fork the repository I mentioned, which is called FlatCard CZX bakery, where we provide these images. So you can fork and do your stuff and why not send a peer, if it's something relevant for the community. And it's just a matter of SquashFS. If you have SquashFS utility tools on your system, you can just build your images. Basically, everything will be in a directories, then you convert these directories to a SquashFS image. Yes? Does the machine deployment controller do any sort of like... If forcing reconciliation, so if you were to delete the instance in OpenStack, would it be created? Yeah, not immediately, but in a few seconds, few minutes after it, we say, okay, I have to get four machines, I have only three. One missing is a walker node, so I need to go to habit. So the question was about... The question was about if there is some instance that is, that disappear from the OpenStack or the provider's dashboard, does the management cluster restart the instance? Yes? As a Kubernetes admin, I really love to manage my Kubernetes classes with Kubernetes resources, but I always wanted to bootstrapping problems like how do I manage my management cluster? I mean, I had so many projects, but I'd love to use cluster API, and it never makes sense, because in the end, I end up using CubeSpray for the management cluster, and I can just add it using CubeSpray. Yeah, so I think the bootstrap issue, they got the same question. So the question is about how do you create the management cluster? So I think this logo is well representative of the issue. It's the torter, the torter that you stack, and in the end, there is no answer, because your management cluster, you can define to handle it with another management cluster. So you can change cluster API if you want to, why not? That's something you can consider. And on the new workload cluster, you can just say, okay, now it's a management cluster. So I'm going to deploy cluster API on the workload cluster. That, of course, is just theoretical. There is no practical way to do that, and it's not the point. But your management cluster, that's what I say, you can use really something simple. I think I see that there is this new Kubernetes tool that you can use. It's like deployment without Cubelet or something like that, so maybe, yes, because you just need the APIs in the end of the Kubernetes. And so why not try to come with something like that, to just deploy a set of API, and that's it. But you can, yeah, we can do things like that. You can decide to use kind, for example. That's the best way to deploy things for the management cluster. Time's up. Okay, thanks everyone. Have a great day. Thank you. Thank you very much. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Ooh. Is it all right? How's that? Cold share. It's all right. You gonna have it? Yeah. Oh, yeah. Do you also have to use to the... You'll see it, so. I think we're gonna have to do the... So... We can enable hotspot for the... I think we'll stop here. Yeah, let's go here. Do you have presentations? Yeah.
Operating Kubernetes Across Hypervisors with Cluster API & GitOps
Hi everyone. Hi everyone. Welcome to this talk on Cluster API across hypervisors and with GitOps. So we've got a lot of the hype words in there. My name's Richard. I'm a maintainer of various Cluster API providers, more notably the AWS provider, the GCP provider and the RKE2 provider. Hey, I'm Alex. I work together with Richard at SUSE. I'm a KPI operator maintainer and also maintaining RKE2 provider. So today we are going to talk about Cluster API. This is only for the stream. But just to speak louder. So today we are going to speak about Cluster API, GitOps and couple virtualization providers. So we'll briefly talk about what is KPI. There was a previous talk about this. But just in case you haven't been there, we will repeat something again. We will tell you and show how Proxmox will integrate with KPI and how GitOps can be added to there. And then we'll replicate the same process with KubeVirt to show that KPI can work with different infrastructure providers. Cool. So all the demos, well the two demos in this session are available via this repo. Feel free to take a picture of it. It's got the full script for it. So you can actually run this yourselves when you get home. I'll leave that out for a second. So who was in the last talk about the intro to Cluster API? Okay, cool. So you get the idea that you have a management cluster. Oh, yep, sorry. You have an idea that you have a management cluster and to that management cluster you install Cluster API. Now Cluster API is made up of the core controllers and a bunch of providers. And you can mix and match those providers to meet your needs. So if you have your provision in AWS, you just install the AWS provider. Once you have that, you then declare the shape of your cluster. It's fully declarative using Kate's custom resources. And you apply that to the management cluster. Then Cluster API does its magic and then it will provision the infrastructure and then bootstrap Kubernetes on top of that infrastructure. So we're going to demonstrate how it works on Proxmox. So just a couple of words about this in case you don't know what it is. It's a virtualization platform. It's open source and includes anything you need for your virtualization purposes. One thing to note is there are two providers for Proxmox if you go out there. So one requires you to have essentially a template pre-created within your environment. The other one will essentially just take a bare OS and it will install everything on top. We are using the one that requires a template. Yeah, so we made a diagram of how our cluster will look like in terms of cluster API. So everything you see there are Kubernetes resources. And all these resources, they represent the cluster, the Proxmox cluster we are going to use. So we'll have the main entities, of course our cluster. It will reference the infrastructure and also reference the control planes and the way how they should look like on the Proxmox environment. And then another resource is machine deployment, which is used for worker machines and it also should reference a template of how it's going to look like on Proxmox and also some configuration for bootstrapping Kubernetes over there. Cool, so over to the demo. So we were going to do the demo live, but actually the network is not being nice to us. So luckily we did record it. So let me just set this up. That's what I'm going to do. Can I do full screen? Is that a visit? Yeah, that's what I tried. Obviously didn't try hard enough. So hopefully you can see this all right. So this is just shown initially the repo that that link showed before. In that repo there are two branches, one for the Kubevert side and one for the Proxmox side. So we're just going to use obviously the Proxmox branch here. And in that you can see all of the artifacts that we would have used in a live demo and if you're going to use this yourself. So moving on then to the pre-rex. So as I mentioned, you are required if you're going to do this yourself to have a template in your Proxmox environment. So the way that you do this, if you want to do it in an automated way, so you can use the Kubernetes image builder project and that has specific make targets that will provision and build that base image for you. And actually what we should see in a minute is I should change to that window and you can see it here. So the Virtual Machine 100 has been built using the Kubernetes image builder project. So that's got everything on there required to bootstrap Kubernetes. So it's got versions of KubeADM, et cetera, already baked into that VM and it's been marked as a template within Proxmox. Yeah. Cool. So the basic flow is we're going to create the management cluster. Sorry, we're going to create the management cluster. We're going to then install GitOps agent on there and then we're going to create a cluster. So I'm just going to fast forward here because this is great. We're using kind for our management cluster. So if I just fast forward, just preloading a bunch of images onto there. The idea being it would have made the demo a lot quicker. So I'm going to start canines in another environment, another window so you can see actually what is getting installed. So this is a plain vanilla Kubernetes cluster at this moment in time. One thing to note, if you're going through the instructions at a later time, we've made a slight config change to the cluster cut all utility configuration so that we can install an IPAN provider. So probably in the last session you went through the different provider types. The main ones are the control plane provider, infrastructure provider and bootstrap provider, but the newer provider types are the IPAN provider which is especially useful for virtualized and bare metal type scenarios and also the add-on provider type. So the way that you create a management cluster is with cluster cut all. One thing to note here is we're specifying version numbers. That was purely just to pin the versions so that we could load the images but you don't have to do that in your environment. And this will go away and install all of the providers and core Cappy into this, turning it into a Cappy management cluster. So if we fast forward a bit, you can see them installed in now and you can see the IPAN provider at the top there and the Proxmox provider. So the next step, so we've got a management cluster. So we want to use GitOps in this scenario. So you can use whatever GitOps agent you want. So we're going to be using fleet, but you could equally apply these steps with slack modifications if you wanted to use flux, Argo CD, whatever your choice is. But we're using fleet so we just need to quickly install fleet, a couple of Helm commands and we'll have that there. So we can fast forward a bit. So now we have the GitOps agent in our cluster. We can start using GitOps to provision clusters. And this is where I guess the mixture of cluster API and GitOps comes really interesting because you then can create clusters via a pull request, which opens up to all sorts of different types of scenarios. And it also means you can perform all of the operations against that cluster via pull requests. So you have the full history of the changes. You can roll back and all of those types of things. If you're using GitOps, you're used to with your applications, but you can now apply it to your actual clusters in the cluster lifecycle. So in the repo, you'll see two folders. Funny enough, the one with the cluster definitions in is in the clusters folder. It's just got the one cluster definition in there. So we're going to bring it up now to have a look at what it is. So it's just pure YAML. It describes the shape of your cluster. There's different resources to represent different parts of the cluster, whether that's the control plane or the worker machines. And it matches the diagram that we showed before in the presentation. Basically, this YAML is what you saw in the diagram, but not visualized. So two things to note here. Just highlighting the fact that we are using the Proxmox, so you will have resource kinds dependent on your infrastructure provider here and likewise for the other type of providers. So there's a couple of things we want to note there. So just highlighting some labels here. If you just remember these labels say CNI Calico, we'll come back to that in a bit. And then we just see some various other aspects. One thing to note, we're also using CubeVip here. So in this type of environment, you need some sort of load balancer so that you can get to the API server. So we're just using CubeVip as an easy way to do that, and it uses gratuitous ARP. So if the control plane machine that is currently hosted on crashes, it will move across and it will start advertising the address from another control plane machine. So it's quite a nice setup. So we can fast forward there. So here you can just see the shape of the VMs that we want, the specifications, so this could be whatever you want. One thing to note is you'll see the template idea at the bottom, which says 100, so that will have to match the template that was created via the ImageBuilder process. If they don't, then things don't work. So we require a small amount of configuration for Fleet, and this will be the same for other GitOps agents. So in this file, we call it the Git repo, and this just tells Fleet about, hey, go to this source repo, download everything in there and apply it to the cluster. So you'll just see that the repo URL, the branch that we require, so we're on the Proximax branch, and then potentially any paths or secrets that are required to access that cluster. Cool, now we've done that. We've applied that to the cluster, so it's going to bring all those cluster definitions into our management cluster, and then hopefully we start to get virtual machines being started and that cluster will be formed. Maybe. Cool, so you can see now that automatically, the cluster API has created machines here for you. So you'll see that there's one machine for the control plane and one machine for the machine deployment or the worker nodes, and you can see that the one has started to move to provisioning. What that basically means is it's going to provision the infrastructure and then start to bootstrap Kubernetes. So what does this mean from a Proximax point of view? People with really good eyesight will probably see that there's a new VM starting up, so you can see it in the events at the bottom there, a VM 104, and you'll see it on the side in the viewer. So this is being orchestrated by the cluster API provider. So it's talking directly to the Proximax API and saying, hey, create me a VM, I'm going to use it for this control plane machine. Now this part does take a while, so we're going to have to skip quite some way through. We'll just get it to the point where you bring up the console, so you can see it's using Ubuntu, and if we fast forward a little bit, eventually you'll start to see, essentially, cloud init will kick in, and depending on how you configure the bootstrap providers, it will use either cloud init or ignition currently. This is using cloud init, so you'll start to see cloud init run in, and that will essentially be running the commands to bootstrap Kubernetes on top of this VM using QVADM in this instance. Oh, we missed it. You'll see it, it will come up. So it does come up, and you can see that. So essentially what it's doing. So at that point, we have one control plane machine ready, essentially. Once one control plane machine is ready, you can then start to provision the worker machines, and it always waits until one control plane machine is ready, and then it will just start provisioning all of the worker machines in parallel. So we can fast forward that, and you'll see another VMs come up, and I think you get the point, so it just repeats the same things, but this will be for a worker machine. So, well, I just skipped ahead in the top part of the terminal window. I have just got the cube config for that newly created cluster. So the cube configs for the newly created child clusters or the tenant clusters are available in the management cluster, so you can get that out and then run, and obviously do what you want with it. In this instance, I'm just showing that stuff is running in there. So you can see that Calico is running in there, so we didn't put Calico necessarily in the cluster definition, but if we go back to those labels on our cluster definition that said C&I is Calico, that is using a feature of a cluster API called cluster resource sets, and essentially this enables you to install any type of resources into a newly provisioned cluster automatically. So it's really ideal for things like C&I or cloud providers to be able to do the things that you want as soon as that cluster is being provisioned. And again, this is all done in a declarative way, so you don't have to do any special commands, you just put all of your definitions into Git, and then the cluster API will do the orchestration. So this is what is in the second folder, in the repo, in the CRS folder. You'll see that there's a cluster resource set, and you'll see that there's a label selector, so if your cluster matches that label selector, it will then apply all the resources listed below, and those resources are essentially just config maps or secrets, and they contain embedded Kubernetes YAML, so it will just squirt those into your cluster. So where are we now? So we've got one control plane and one working machine, so I said that you could go into Git, and you could scale the cluster and do all your operations, so what we're going to show here is actually, if you go to the cluster definition in Git, we're going to scroll down until we get to the machine deployments, where it will say replica one, and we're going to, hopefully, you see the machine deployment there, change app to two, commit those changes to the Git repo, and you can probably guess what will happen now. Any VM is spun up, Kubernetes is bootstrapped on there, and that node joins the existing cluster, and you'll see eventually that it does come up. So that is the props box demo. So now we're going to show the same process with Kubert, and the idea is that you can use cluster API to provision your clusters and multiple providers in the same operational way, so the process for different infrastructure providers is relatively very similar, with the difference in operating your infrastructure, I mean, defining how your machine looks like, but the whole idea is the same, no matter where you're on your clusters. So the one major difference with the Kubert provider is it requires Kubert to be installed in your cluster already. So before you install the provider for Kubert, the Kappi provider, you must have Kubert already installed. So what you're seeing here is we're installing Metal LB to take the place of providing the load balancers within this environment. Then we install Kubert, and so Kubert works on the basis of you describe your virtual machine as a custom resource, and then it will make that happen behind the scenes via QMU. So this is what we're doing first, and this is before you get to any of the Kappi stuff. This is just setting up the Kubernetes cluster. So you can see the Kubert is starting up. So we now done the quaternate installing the provider. We're going to install the GitOps agent here now. So it's basically the same process, just slightly modified with different providers and just with different prerequisites, the prerequisite being Kubert. So forward again. So in this second branch, you'll see a different cluster definition that uses Kubert, but essentially the way it's applied to the cluster is exactly the same via GitOps. So what you take away from this is the same operational procedure irrespective of your target infrastructure or the flavor of Kubernetes that you want. You just create some YAML that describes the cluster that you want. So you'll see this is the interesting part. It will spin up a pod per VM, and that pod will then do the interaction with to actually provision the virtual machine on the host. So you'll see one of these spin up for each of the virtual machines that are required for the cluster. You can look at the boot logs via VNC as well. So if you use Vert Cuttle in this scenario, just get the name of the node and use Vert Cuttle, and you'll see it's using exactly the same QBADM commands that we saw previously with the Proxmox. And then we do the same operation. We scale it, and you get the third machine. So again, probably the key takeaway from this is that you can use the cloud environment to actually use it. So key takeaways are, CAPI can be used in many, many different infrastructure environments, not just like the cloud environments where a lot of people would naturally think of it. So virtualized environments, bare metal type environments, and some really interesting type environments where you want a control plane as a pod type scenario. It supports different Kubernetes flavors, so you might want just pure upstream with QBADM. You might want something a bit more lightweight, so you can use K3S. So it allows you to mix and match all of these things. And lastly, this is fully declarative, fully GitOps friendly. Perform all of your cluster operations via Git. So yeah, thank you for coming. Thank you for your question. Thanks. Thank you. Yeah, so the question was, can you realistically provision a cluster associated infrastructure like low balances, et cetera, with the cluster API currently in a hyperscaler to like AWS as an example? The answer is yes, definitely for AWS. I'd say the caveat is it will provision the infrastructure in an opinionated way. So it will only provision the infrastructure that's required for the cluster and nothing more. And it will provision it in a way that it thinks best. So you can slightly tweak it if you want. If you don't like, say you want to use A or Bs instead of something else, or you want to add security groups, it does allow you to do that as well. But there are, I guess, boundaries. So if you want full flexibility, then it might need to do something else. But you can also use things like Terraform and cluster API together. It doesn't have to be an either or. So you might provision the VPC and the network with Terraform and then get cluster API to do the Kubernetes and like the day two operations type of stuff on Kubernetes.
#snapsafety: de-duplicating state across Virtual Machine clones
Hello. Thanks for coming to this talk. My name is Babis Hallos. I am a software engineer with Amazon Web Services. I'm currently working with a team that maintains the Firecracker Virtual Machine Monitor. Today I will be speaking to you about Virtual Machine Snapshots. Essentially I'm going to be speaking more about some challenges we face when we clone virtual machines and then we start multiple virtual machines from that same clone. A problem that we call Snapshot Safety. I'm going to be speaking a bit about the mechanisms we have today for tackling those issues. What do we believe we need to do as a community in order to grow awareness about the issue and build systems that are safe in the presence of Snapshots. Quick sneak peek on the agenda. We're going to define what is a virtual machine Snapshot for us and what is problematic with virtual machine Snapshots and which scenarios we have problems with them. Then go through a bit about the mechanisms we have today for addressing those issues and how we are thinking about building solutions that are system wide and address the problem. Finally I'm going to be speaking a bit about what we're planning to do next on the area. Earlier this morning there was a very nice talk about virtual machine Snapshots. It went much more in detail what I'm going to go into but let's think for the moment about the virtual machine as a collection of some state and that state might be memory, the guest memory, architectural state of the VM. Then you might have some devices for doing networking and storage, etc. Then some host resources like whatever state the KVM in Linux is holding for us for the VM, maybe a top device for the networking and files that back our storage. For this talk, Snapshot is simply the serialization of this state at a given point in time in a file that we store somewhere in some storage medium. Then we use that Snapshot file in order to start one or more VMs, not that exact identical copies of the initial virtual machine. The morning talk spoke about various scenarios why you might want to do that. For example, you want to give a backup of your machine so you can go back in time in a previous state, etc. Or another scenario that you might want to do that is if you are building some sort of service that uses VMs to isolate workloads and you want to spawn these VMs very, very fastly in a state that they are ready to handle user requests, you might want to spawn a VM like that, bring it in a state, initialize everything, every service, a component that you want in order to get it ready to handle requests and take a Snapshot at that point. Whenever you have a new request in the future, instead of booting a machine from scratch, booting all the operating system, the user space, blah, blah, blah, blah, you just resume from a Snapshot and then you are much faster in a state where you can handle that request. What's wrong with that? Now, let's look again at the previous picture of our VM and let's imagine for a second that somewhere in VM memory or it doesn't have to be memory, it can be any other component of the VM. There is some piece of state, an object, some sort of state that for the purpose of the application that it's making use of it, it needs to be unique and or secret. It needs to have this property in order for the application to operate correctly or securely, etc. Now, you see where I'm going with this, once we take that Snapshot, that property of this state is lost. Here we're speaking about what sort of mechanisms, what sort of applications are having this problem and how we can address this exact problem. We are aware even today of many classes of applications that rely on this assumption of some part of the state being unique, secret, etc. For example, we can think of cryptographically secured pseudo-random number generators. Those are random number generators that have the property that it is very, very hard, if not impossible, to guess what the next byte they're going to give you is. Many applications, the security of many applications rely on this property. They have other properties as well that given knowledge of the current state of the PRNG, you cannot guess the previous bytes, etc. But for those sort of applications, imagine that one, those sort of random number generators, imagine that once you take the Snapshot, the VM Snapshot and you start more VMs from that, the state of the PRNG is being duplicated. So unless we do something else, unless we add more entropy, for example, in this PRNG, in all of the VMs that start from the same Snapshot, the next byte that is going to be given out from that PRNG is going to be exactly same in all of the VMs. Other examples of use cases that have this problem is network configuration. Imagine you have a VM that has some network configuration, IP addresses, MAC addresses, etc. Suddenly you Snapshot that VM and you create new VMs from that Snapshot that live in the same network as your seed VM. Suddenly they appear in the network VMs with the exact same network configuration and depending on your use case that might be a problem. So you might want to be able to do something about it once this happened. You might want to detect that this is happening and do something about it. Another class of applications that are affected by this is anything that really uses a UUID, a GUID. Many applications rely on the uniqueness of this variable, this number, in order to perform correctly. Imagine for example once you take this Snapshot of an application that has a UUID and you start more VMs out of it and the application that is running in this VM is using that number as an index in a database to modify stuff, read stuff, suddenly you have a race condition on the database. More than one entities are going to be using that same thing for accessing data. Any sort of use case where you rely on this thing being unique is a problem here. And really we do not know exactly all of the applications that use cases that have this problem. So it really depends on the application itself. We really need to go see whether our applications keep state that has the semantics, the semantics of uniqueness and secrecy. And if you know that you are running some workload that has this problem and you run in such environment, let's speak about and think of what sort of mechanisms you could use in order to make this use case safe. Okay, now that we know a bit more about the problem that we are speaking about, we are facing, let's see what kind of mechanisms do we have today to address it. Essentially the most fundamental mechanism we have today for doing that is called virtual machine generation ID. It operates as a notification mechanism for the VM after it is getting resumed from a snapshot about that particular fact. But it tells the VM, okay, now you are in a new world. You are not in the world that you thought you were without having rebooted. And in the technical aspect of it, it's an ACPI virtual device. It is emulated by the monitor. And the way it provides the notification inside the guest is via a generation ID, which is a 16 bytes cryptographically random number that changes every time we resume from a snapshot. So when you resume from the snapshot, the monitor makes sure that it changes the new value, it stores a new value in the generation ID, and before resuming the VCPUs of the VM, it injects an ACPI notification in the system. And once it resumes from the snapshot, resumes the VCPUs, then the guest kernel is going to handle that ACPI notification. What happens in Linux is that today the kernel is using the new generation ID as extra entropy for its entropy pool. So it's receding its entropy pool, essentially, so that it avoids the problem we were speaking about before about PRNGs. It works, apparently. It works fine. There is still a bit of a concern regarding the fact of its asynchronousity in the sense that there is a small race window between the moment we resume the VCPUs and the ACPI notification is being handled by whatever thread in the kernel handles it. Okay. Yay. Sorry about that. But at least we have something. Nice. So moving forward, recently we built in the Linux kernel, contributed a small essentially change that every time the generation ID changes, we emit a new event to the user space, because before that VM generation ID implementation did not do anything. It was using, since it was using the generation ID as entropy for the kernel PRNG, people were nervous about exporting it to the user space. So I said, okay, that's it. And in reality, the user space does not really need that 16 bytes themselves. It just needs a small notification. So there you have it. It got matched recently in 6.8 and it is still an asynchronous notification mechanism. So everything that in the user space that runs event loops, for example, can monitor for it and get notified about the fact that they're now in a new VM started from a snapshot. It is still racy, this thing has to be said. So if we think that we have use cases that need to get more asynchronous mechanism, more synchronous mechanism, we should continue doing work to build those. Okay, so going back to the PRNGs, mainly because they are used by security sensitive applications, let's see how these mechanisms can help us. In runtime systems that maintain their own PRNGs like JVM, we can now use the VM GenADU event to be notified about snapshots. So upon resume, the runtime would get that event, eventually would be notified and it would receive the PRNG as soon as possible. Now in other PRNGs that are implemented from libraries, within libraries, this is a bit more weird situation at the moment because an asynchronous mechanism like a U-event is not a perfect fit for the programming model. We will need to do something else about them. One idea here would be to use prediction resistance with what cryptographers call prediction resistance with hardware instructions. The idea here is simple. With every byte that the PRNG returns to you, you mix in some random bits that you got from a hardware instruction that is not affected obviously by virtual machine snapshots, so the problem just goes away. If you are able to do that, it doesn't matter if you have resumed from a snapshot. The state of the PRNG is always going to, including these snapshot irrelevant random bytes and everything is going to be fine. Other potential solutions, for example, in cases where you do not have these instructions or for whatever reason you don't want to use them, it would be to build some sort of synchronous APIs on top of the asynchronous VM-genade event, for example. But we really think that we should do something, don't go out on me again. We really think we should do something about the use case of these libraries. Okay, so let's think, now that we know what mechanism we have available, let's see if we can really solve the problem. And let's follow this example. It's a very simple example of a VM that has started from a snapshot. The hypervisor and the guest kernel support VM-genade. The kernel is going to use the generation ID to receive its random number generator. And we have a user space application that does some network communication and it wants to use TLS. And it reads some random bits from the from the view random, which is safe because of VM-genade in order to do some sort of communication. And everything works fine. The application creates the session key to start communicating without the world and everything looks fine. And at that point, we take a snapshot. Now the moment we resume the VM, the second VM from that snapshot, the session key is duplicated in essentially both VMs. So even though we have these mechanisms built in the system that give safe interfaces over the view random, for example, the final system is not necessarily safe. The same would go, for example, for GUID applications that have GUIDs, et cetera, et cetera, and they would need to adapt themselves. And it is true that the application could use the VM-genade event, but that event is present in the resumed snapshot, in the resumed VM. In the initial VM, there is not today a mechanism to do something about that. And again, there is some sort of race window between the event resuming the VM and the application being reacting to that event, which makes us think that probably there are things that should not ever be serialized at all. It would be much easier if that session key was never serialized. And that makes us think that VM-genade is a post-mortem mechanism. It is a notification in the new VMs, not the initial VM. And by the moment it arrives to us, sensitive information operations might be already in flight, and even if we handle that notification, there is nothing we can do about the things that are in flight. And that makes us think as well next that what we should probably do is control the timing of snapshot events. The moment snapshot events in the lifetime of the VM can arrive, let's say, at arbitrary points in time, instead we should control them. We should do something before we take even the snapshot and make sure that we only take a snapshot when the machine is in a safe state to be snapshotted. And once we resume, make sure that every application that needs to has adapted to the new situation before marking the system as ready to be operational again. Thinking about these things, some time ago we were speaking with system defaults and we thought about modeling this problem using force states, describing our systems being in one of force states. Planning is the normal state of your VM. Now once you want to take a snapshot, you start quiescing. People earlier today spoke about this as freezing, for example. And during that period you do things preparing yourself to be snapshotted so you cannot find yourself in a previous situation. And once you are quiesced, once everybody is ready to be snapshot, then you can take the snapshot and then the same. And on the resume path, on the resume from the snapshot path, you essentially do the opposite work, right? You start from a quiesced state, then you start inquiescing, getting ready for the new world, recreating your GUIDs and what not. And once everything is done, then you can be running again, up and running again. SystemD has this nice concept of inhibitors, which can essentially applications use in order to say, OK, don't do that. Don't do that transition until you are ready to, I tell you I'm ready to do so. For example, there are inhibitors for system CTL suspend. At the moment we were thinking that maybe we could use some para virtual agent to orchestrate everything. In reality, maybe system CTL suspend is all what we need and we can drive this from the hypervisor by sending an ACPI event. And going back to the previous example, how that would look like is we are in a running state in LVM, we have our previous application, and suddenly the control plane informs the PV agent that it needs to start quiescing. Here I say system CTL quiesce, but again, unless we find the reason why suspend should be different than some new sort of operation, we could even use suspend and get away with having to have a para virtual agent in there. In any case, once that happens, the application would say, OK, do not get quiesced again because I need to do some cleanup before you can snapshot me. And once the application does that, it says, OK, now I'm good to go. And at that point the control plane knows that, OK, we can take that snapshot. Now on the opposite path, the control plane would probably resume the VM from a snapshot and then start the unquiescing operation. The application might want to say, OK, wait until I know that I'm safe again because I want to create new random numbers. And I do that. And at that time, we are safe module of that tiny race condition in order to start getting random numbers again and recreate our safe and be in the state we want to be in order to be up and running. That's it. So yeah, we started working in adding support in Firecracker for VM GenAD. Up until now, we were telling people who were using the snapshot in feature in such a way that they should make sure that manually they would need to receive their kernels, PRNGs, and they use the space PRNGs after the fact. The other thing we want to pay attention to is working with PRNG owners in order to find proper ways to make their libraries not safe. Here we're speaking about the PRNGs that are implemented as libraries such as OpenSSL, AWS, and C, et cetera, et cetera. And start building this system we spoke about in system D, start modeling this in system D. And earlier, we had some ground work already done some time ago. And we hope that system D is going to be just the first one that we get this into and hopefully other management systems will follow. And that's it. Without that, I'd be happy to take questions. I just wanted to ask, you mentioned the network issues where machine comes up with the same back address. Didn't appear to address that. Is there a plan to take care of that situation as well? Or is that a problem? Yeah. The question was that we mentioned that during the presentation that there are problems with networking when you take snapshots and resume and whether we plan to address those at the future. Yeah. I think that this is part as well of the system D work that we're going to do. This problem essentially appears mainly when systems are in networks. If your VMs are not networked, there is no problem if two VMs that are not in the network are communicating somehow, they have the same random numbers. So yes, for example, something that we would like to do is to, I guess, shut down networking before taking a snapshot so you're sure that there are not in-flight connections and stuff like that. So I think this is going to be part of that work. Thank you. If we have to come up with a MAC address, we generally try to hash things first outside kind of even if we can also hash that into that element, it's already here. We have been discussing this, like, this is going to happen in my conference. We have to identify the generation ideas to the most obvious thing in the world, to add that to the hash. So that basically, yeah, once the generative changes and everything, get this into the GHD, it's not going to go to wherever else. Thank you very much. Thank you very much.
AI-Driven Observability and Operations in Cloud-Edge Systems
Thank you. Hello everyone. Thank you so much for being here. In today's session we are going to do an introduction to a driving observability on operations in cloud edge system. First let me introduce myself. My name is Victor Palma. I'm a cloud engineer at Supernevola Systems. I come from Madrid, Spain and I've been working for Supernevola for more than two years. So let's move on to the presentation. First I would like to start with some initial context in order to introduce later what we are going to see here. So first what is observability? Observability is the ability to understand and analyze the internal behavior of a system by collecting and analyzing relevant data. That is the dictionary definition. But in other words it's just the ability to transform data into information in something that could be useful for us. So we can have a lot of system logs or of data or number but if we don't provide a meaning to them it's useless. So observability has multiple, sorry I just got blank. It has some advantages like anomaly detection that allows us to identify anomalies or bugs in our system. This also provides the ability to do a performance analysis. So we can identify areas for improvement in our system and finally it's very useful for decision making. So we can see the impact of the change that we made in our system in a very easy way. As the saying goes information is power. So observability is very, very important in nowadays. But now I would like to talk about AI because AI in nowadays is everywhere. You know, a marketing guy's fault. But it's really useful for observability. It's really useful for the quick answer is yes, socially. AI provides the capacity to create more enhancing data analysis, create automated anomaly detection, create dynamic scaling for our cloud. For example, if we have more workloads that the usual we can automatically create new notes or deploy new VMs in our cloud in order to provide more services to our customers. And finally we can create predictive analysis, analytics in order to predict how our system is going to behave in the future. After finishing this part, I also would like to talk about data sorrentia and open source because I think the most important concept about AI, currently it's the data sorrentia or the information. Many organizations truce sensitive data to third parties providers. Currently these providers are based outside of Europe and we need to wait in order to bring back the data to our servers to Europe and be more transparent in this way. So as a solution, the open source is a very good solution for that problem. And provide more transparency for the cloud and helps reducing the vendor locking in our infrastructure. So we are not tired to a specific vendor and we can migrate within vendors every time we need it. So what's next? How can we address all these challenges? The answer is the one AYOPS framework, the open source solution for eye driving observability. One AYOPS framework combines open Nebula as virtualization and cloud management tool, Prometheus and Grafana as metrics and visualization solution and some AI and ML algorithms to predict and analyze all the behavior of infrastructure. All the three technologies together creates the one AI AYOPS framework that we are going to see here today. So let's go step by step and first we are going to see what is open Nebula. Open Nebula is the open source cloud platform solution in order to create your own cloud. It provides the ability to deploy virtual machines in your own private data center, in your public cloud or even in the edge. But you not only can deploy virtual machines but also application containers, micro VMs or even Kubernetes clusters. As I've said before, one of the features of open Nebula since it's open source and it's oriented to provide a truly, truly flexibility to the cloud is that avoids the vendor locking so you can migrate your workloads with between different providers in a very, very easy way. Open Nebula has a lot of integration with party tools like Terraform, like Kubernetes, Ansible or Docker. It also has a built-in tools like Sandstone that is the web user interface and you can handle all of your infrastructure from there or from the Celi-I and deploy a built-in machines based on the where, on KBM, LXD or micro VMs in Fightcracker. Finally one of the most important features of open Nebula is the possibility to expand your cloud to the multi-cloud or to the hybrid cloud. You can create on demand resources on the edge in Amazon, Google Cloud, Equinids, just clicking a button or automatically if you configure that. So you can migrate workloads to your on-premise data center to on-edge data center of the public cloud in a very strife way. So you can deploy any infrastructure with a uniform management for all this infrastructure and you can run any application in your cloud. For open Nebula doesn't occur if the host is located in Equinix or the edge or in your private data center. The only thing that open Nebula occurs is what is, what VM is running the workload and how can you access to that VM? So very handy. The next section of the one AI ops is the integration of open Nebula with Prometheus. That integration is based on the Prometheus Sportex like the Prometheus Node Sportex that is installed in every open Nebula server. It's also installed in the hypervisor nodes and it combines with the open Nebula Libreth exporter. It's a Q-Stume exporter created by open Nebula in order to extract and collect information about the KVM machines. And we also combine this information with the, the, our metrics of open Nebula that it's created itself. And this metric is, is gathered by the 1D Demon, it's the main demon of open Nebula. And then it's exported to the Prometheus server through the open Nebula server exporter. So the, so the next thing is the AI that we, sorry, that we add to the formula. So we create a bunch of machine learning algorithms and some decision algorithms and implemented in the, as a, as an exporter to Prometheus. So gathering all the metrics that open Nebula produce and the exporter produce, we implement this algorithm in order to predict and to, to, to get how, how improve the performance of, of your cloud. So in summary, the feature and capabilities of when one AA ops are the CPU usage prediction of the VMs of your cloud. One A ops come predicts the individual VM CPU prediction per hour, the general CPU prediction for usage for, for your host. The accuracy of that prediction, a very important value in, in terms of a feasibility of your, of the, of the system. And then when AI ops come also suggests where you can place a VM. So based on that prediction, one A ops maybe tell you to migrate a VM from one server to another in order to improve the performance there. Three main policies configuring when, when AI ops. The first one is the load balancing, load balancing. It's just the name said balancing the load of, in all your notes is very useful when you have on premise or private data center and you want to use all your hosts. The next policy is the resource contention. So very useful for public cloud environments when you want to use a, a few number of, of hosts. And the last one is reduce migration. That policy very useful when you want to, to avoid migrating VMs between hosts. The, this, this scenario is very useful for a edge environment where the migration of people machines between edge nodes sometimes it's kind of a bit done. So here you can see the architecture of the one AI ops. It's based on the already existed open nebulizer architecture. So everything at the bottom, it's already what's open nebulize. And the layer at the, at the top of the picture is the new one AI ops architecture layer. So here you can see the modules that we have implemented in order to provide this, this prediction and then the, all the virtualize infrastructure orchestrated. So let's do a demo in order to show you how this works. First we are going to, to go to the, to the open nebulizer system portal in order to show you, wait, sorry. Thank you. I'm so sorry. Get me a minute. Okay. Well, this is the main dashboard of open nebula, user graphic interface. Here you can see the, the principal information about your cloud, like how many machines we have or the images or the built on network or the usage of the host. We have currently, here, sorry. In this, this is a demo environment. So we only have two, two hosts with some workload and that these are the, the VMs that we have running in these environments. So this is solid to see that we already have some workload in this environment and this workload is fully random. So maybe it's consumed a certain CPU depending on, on the time. So when we install the one AI ops framework and we have a documentation for that, if we go to Grafana and import the one dashboard, we can see this. This is the results that we want AI ops generates. So we can see here at the left, the average CPU predicted per host. Here we can see the average CPU, the usage per VMs and here the, the, the real usage. So as you can see, it's closely one value to, to the other and the accuracy of the prediction, in that case, a 92%. Here we can see the suggestions that one AI provides to the user. The first policy is the core optimization policy. So it's going to, to, to reduce, it's going to try to reduce the number of hosts to the minimum. Here you can see that all the VMs, we have five VMs in this demo. The four VMs are in one host. Since this host is fully and not a more VMs entering inside this host, the, we have here another VM, but it tries to concentrate the VMs in the, in a few hosts. Here you can see the migrations that one AI suggests for, for that, for, for achieve this distribution. So it's to get to us to move the VM with ID three to the host with ID one. It's very, very, very easy to, to follow the instructions. And then we have the other policies, the load balancing optimization that as you can see, we have the, the VMs distributed in the two hosts that we have in our environment. And then the final policy that is the immigration optimization. And in this case, no migrations are suggested because no, no optimization are found. But what? In case one AI produce something, it, it should be up here. And what? Returning to the, to the slides. Well this is the demo that we have just seen. And closing thoughts on the next step of, of this project. So the next steps and challenge that we, we are facing in, in one AI ops. First is implement the virtualization operations in order to apply the suggestion automatically. Currently the operations are only suggested but not performed in the, in the cloud system. Then we would like to improve the AI ops distribution as part of the OpenEvola software. And you need to install it separately. And finally we will, we will like to expand the functionality to provide anomaly detections, allocation based on memory prediction and network traffic. Because we only provide currently CPU usage prediction. And based on the result of the, of the tool, creates alerts and warnings. This project is totally open source. So you can go to the repository on GitHub and collaborate and suggest new features and, and, and changes. And finally I would like to, to encourage you to join to the OpenEvola community. So you can visit the forum and participate in, in discussion with other, with other OpenEvola users and, and learn and help together in the cloud community that we have created here. As closing a slide, I would like also to, to, to say that this project is, is founded by the European Union as part of the Horizon Europe Research and Innovation Program. So this project is called CONNIS. I, I recommend you to take a look in, in this URL. I can espel you if you want, COG and IT. COGNET. It's, it's very, very interesting project. Well, that's all. Thank you very much for your attention. So, questions? Yeah. Yeah, we use a linear models and a, and a, and a, and a, and a, and a, and a, and a, and and a, and a, and a vasilian models tool. By, by a sund. Sorry. By a sund model save any models. And would you be able to share the slides in the forms of website as też in the child. You can also find it in the repositories as, as yeah here in this repository all the data models and, algorithm applied are, are explained. Okay, thank you. You're welcome. Any more questions? Yeah. I think I'm basically, it was the same question as the last one. If you could quickly go back to, because you did actually have the model, and then in like a couple of slides before. Here. So here, does it explain where the Bayesian are we in here? It's not in there, because that was the bit I was wondering how the model worked. Okay, thank you. That's a question near our side. Okay. Thank you. Perfect. Any more questions? You're welcome. Ah, it's here. Are you optimizing for CPU utilization or not recorded? So also, is it that possible to also say, okay, optimize for availability or network throughput? Okay, he asked about if we optimize for CPU or for other attributes like network or memory. Currently, in the current state of the project, we are only make suggestions and predict the CPU usage. But the idea is to implement a prediction based on the network, on the memory, and other keystone attributes that you want to add to the tool. The idea is that you can change the prediction, the configuration. But really, it's just a prototype. So, for storage? For storage prediction. Yeah, we are also considering that. Sometimes, optimality is changing based on the cloud service provider. Do you also consider this or is this only based on regular hardware? Or what is your optimization? Yeah, we are considering that too. I mean, the way to optimize based on the location of the VM. It's not the same half a VM in your on-premise cloud than on the public cloud or on the edge. So, it's based on different policies that we are currently defining. Yeah, we are considering that. Any more questions? Okay, thank you so much.
Bare-Metal Networking For Everyone
Okay, hello everyone. My name is Mateusz. I work at ThreadHat as a principal software engineer in the Kubernetes bare metal networking team. So yeah, as the title of the talk says, we'll be talking about bare metal networking and I wanted this talk to be somehow a gentle intro into what you need to think about when you want to start doing Kubernetes on bare metal, but the thing that Kubernetes doesn't tell you you should care about. So we'll see in a moment what that means, but I work at ThreadHat. I already said this. I'm based in Switzerland. When I'm not doing computers, I'm doing farming. I actually make it much much better, but it doesn't pay the bills, so I need to do the stuff that I'm going to tell you about here. Well, it is what it is. I don't do AI as opposed to, you know, all the hype and all this kind of stuff, so yeah, I'm not really on the hype wave. Bare metal was never really hyped, so well, what can I say? Some intro why we may even think about doing containers for bare metal. Like, you know, no one ever told us to do so, so what the heck is the deal? So HPC and AI. This slide predates the AI hype, so sorry for this. I could remove it, but long story short, there are some workloads we really want to benefit from running for bare metal. You may have some fancy GPU from, let's not name the company, or some network adapter, which is, you know, something that you really want to have access to the hardware directly, or the other side of the scale. Something that you run and is critical to any part of the infrastructure that you already have. Like, for example, network equipment. You don't want to run router of your own data center as an instance in AWS, right? That would be somehow, yeah, we shouldn't do this this way. Or something which is almost forgotten, and you know, then people call me and put this use case. Benchmarking. How do you benchmark hardware, CPUs, and this kind of stuff if not by running workload directly on this hardware? Again, you don't want to create 50 VMs on some CPU, only to get the benchmark of this CPU performance. That would be chicken egg. Let's not do this. So now fast forward. We agree that we want to do Kubernetes, and we agree that we want to do this on bare metal. So we go to Kubernetes.something, I don't know what that is today. We go to the, you know, FAQ, installing a cluster, and we start reading. What do I need to do to install a cluster? Is there any tooling that would help me installing this cluster? And the very first page you see is this installing Kubernetes with deployment tools, and they tell you QubeADM and to some other tools. And we are like, oh, so lucky. There are tools that are going to do this stuff for us. Okay, let's check the first one. You go to QubeADM and we start reading. Using QubeADM, you can create a minimum viable Kubernetes clusters. And, okay, so is MVP really the production cluster that I'm going to run? Well, probably no. Let's keep that tool. The second one, we look into K-Opps. Okay, let's go to the website of K-Opps. Let's do the same. Installing Kubernetes, getting started, and we start reading. Deploying to AWS, to GCP, digital option, yada, yada, yada. None of them is deploying to bare metal. Thank you very much. End of the story. Let's check the last one. Maybe that's our chance. So we go to the Qube spray. It's a set of ansibles. So another story, you know, but, okay, someone gives us some method to deploy Kubernetes on bare metal. So we go, run Qube, Qube spray playbooks. With the bare metal infrastructure deployed, Qube spray came now in, so Kubernetes and set up the cluster. And you start reading those playbooks and you feel like, oh, this is so opinionated. So either I want to do my data, either I want to build my data center like they want me to build, or thank you very much, there is no tool. So let's agree that none of these three methods is for us. We need to do this stuff ourselves. So let's build the stuff, you know, brick by brick from the, from the beginning. So what, what we need to care about a cluster, and not only during the installation, but in general to have this cluster bootstrapped and then working. First of all is, of course, this is bare metal. At the end, you want to deploy this cluster because there will be some workload, right? You want to access this workload. As well, you want to access the API, right? Basic operations. You don't deploy the cluster for the sake of deploying it and running, consuming the energy. Then, of course, DNS infrastructure. You are deploying this in your data center. And then what? Are you going to give to your customers? And now, you know, type this IP address slash something, something to look at this fancy web, website or application that we deployed. No, you want to have it some very nice domain and, you know, but for that, again, DNS infrastructure, you need that. It doesn't come for free. The next step is we agreed that we are doing bare metal because we have some reason to do this and it's not like we just don't like a simple VM from AWS, which means there will be some non-standard network configuration. Doesn't really matter if fancy or not. It will be something more than just, you know, plug the server, turn it on because in most of the cases, people doing bare metal, they don't have DHCP in all the networks or they need some storage network and it all requires some fine tuning which doesn't come from default when you boot your Linux distro and some other dirty tricks that I'm going to tell you later because it's Kubernetes specific and I want to build my way up to this. So cluster load balancer because I told you that you need to have API and ingress to your workload and all this kind of stuff. The slide is overly complicated for two reasons. The first reason is because it is complicated. The other reason is because no one ever cared to make it less complicated. I know it sounds bad but it is what it is. So the only thing I want to tell you is that, you know, we are in the story of building a cluster installing it from scratch, which means we are starting bootstrapped from summer. Like, you know, you may be running those cube ADM create cluster, yada yada, from this laptop, right? So this laptop will be your initial bootstrapping infrastructure. On the other hand, at the other side of this room, I have those three servers that are going to be masters. So this somehow has to ride all together. I need to have some IP address that will be this API finally when I spawn all those nodes in the cluster. So I need to have some virtual IP which will be pointing toward this API, right? This is what I'm calling API VIP and it sounds complex but at the end it boils down to one sentence. When you start doing cube CTL commands at the end, you need to target some IP address. If you are deploying Bermetal infrastructure, you don't want to ever target specific node because if this node goes down, all your tooling goes down. So you want to have some virtual IP and you may have some load balancer from well-known companies as an appliance or you may want to just do it yourself with keep alive this. So I will show this in a second. And in this slide, what is then the part? So at some point, we have deployed those control plane nodes, those worker nodes and we have the API address which should be now pointing only to the control plane nodes not to your bootstrap so this laptop, it goes away from this story. But then you have some other IP address because you are deploying Quarkode. You are not only an admin now. You really have something that runs and your applications, you don't want to expose your control plane to anyone, right? Or do you? Well, you rather not. So you need another IP and exactly the same story. Where do you take all those IP from and who manages them? Yeah, you manage them. So what you are doing for this and of course I'm telling you about some very opinionated way of designing how to install Kubernetes cluster and it's opinionated because we decided, so let's do keep alive D in the combination with HAProxy. And I told you the story why we need the VIP so you should be already convinced that if we need that, then we keep alive D because it's very simple and it's proven in action. Why do we put HAProxy in this story also? And now it will be fast forward to some specific use cases and requirements that we got. Only thing to remember is that it won't be always the same stock for API and ingress because your admin control, as an administrator of the cluster, I have usually different requirements than the user, so different tools, different purposes. Because it's very easy to simply deploy keep alive D and tell it, you know, let's pick this 1.2.3.4 IP and put it somewhere in the pool of this servers, right? But then Kubernetes is about being highly available. So what happens if your one node goes down? Well, the IP address should float to some other node that works, right? But what does it mean from the network perspective that IP address floats? What's going to happen with the connections that you have to this IP address? We start having this kind, we start asking ourselves this kind of questions because we have now three servers in the control plane, QBAPI server runs in three of them, we kill one QBAPI server, unlucky us, it was the one that was holding the IP of the, you know, of how we access the cluster. What happens now? No access to the cluster. So either we wait for keep alive D to move this IP address, our tables to propagate and all this kind of stuff or, and this is what we decided, we put HAProxy in between the QBAPI server and keep alive D so that HAProxy, and this is something that, you know, people from Kubernetes want to kill me, HAProxy is much more stable than Kubernetes API. That's it. That's it. If you look at this, that Xeq, QBAPI server fails much, much more than HAProxy, so this is our way to keep this and as simple as it sounds, the problem that I want to solve is that when QBAPI server dies, I don't want the IP address to float because propagating ARP tables and expiring the caches takes too long and I just simply don't want to wait for that, so I put HAProxy there and, and yeah, the only thing to remember if you really take this path is that you need to fine tune the health checks because then the worst you can do is that if keep alive D starts to notice outage faster than HAProxy because HAProxy also balances the traffic, right? So then the order of actions is that you want QBAPI server to die, which shouldn't be happening, but it happens. HAProxy notices that and end of the story. That's, that's it, keep alive. This should never, should never notice this and of course we may go deeper and what happens if HAProxy dies? Well, this is now a game of statistics. Has it ever happened for us that QBAPI server and HAProxy died at the same time? Well, it never happened apart if you go to the server and just plug it out from the rack. So this is some corner case that we don't want to cover, but, but it doesn't really, really happen in the wild. Of course, there are some limitations because, you know, you can have IP address on the single node. This is disadvantage versus some, some appliance. The biggest problem here is that you need to have all this stuff in one single L2 segment. So in one broadcast domain, this is because keep alive D doesn't work across subnets. We have some ways to fix that by grouping nodes into different L2s and then having different keep alive Ds in those L2s. But still, this is, this is a pain point and this is something that you should really well design on the, on the paper if you, if you start doing this. But, you know, enough of load balancers because we could be talking ages about this. DNS, because we said that we want to, to do this DNS mambo jumbo and, you know, we don't want to use IP addresses only. So of course you are administrator, you manage the infra. You could say, but, you know, we have this DNS infrastructure there. It's maybe AWS, maybe Cloudflare, maybe, maybe something else. So we can just create records there. But, but then, you know, either you trust the user or you don't. And we don't. So another opinionated thing in our way of installing Kubernetes is that we spawn very minimal setup of core DNS, which will be providing the DNS resolution of what you want to all the nodes of the cluster and all the pods running in this cluster. So that when you start installation claiming that you will have API running on API.example.com, I don't worry if you already created this record on the external DNS. I will just spawn static pod running core DNS and I will create those records myself. So whatever I'm running in this cluster will have this. This again protects me because now what happens if we decouple this? You have your external, you know, DNS like most of the people. And how do you want your cluster to behave when this DNS infrastructure goes down? You have your data center, everything is okay. In some other data center, you have DNS and this DNS is out. Do you want now your cluster to be, you know, dying because pods want to talk to each other and they cannot resolve DNS? It should be all self-contained, right? You don't want to have those external dependencies. So yeah, this is something that we are doing. And the part I will skip is that network manager requires some tuning because for people knowing how containers are spawned, when you start a container, a copy of ETC resolve conf is taken at the moment of starting the container and is plugged into the container. Meaning that if you change configuration of your host regarding DNS, it will not be propagated to the container unless you restart the container. So yeah, for this reason we are also hacking this file around so that it would be really updating on the fly but I don't want to go into this. Something a bit more interesting because we are going now into Kubernetes APIs and how to extend this stuff is network configuration of the host. This is static configuration file for network manager and probably you've seen this and probably you've made some mistakes to this file not once. The problem I want to state here is that this is a static file. You go, you modify it, nothing happens. You may notice mistake in this file five years after because for five years you haven't rebooted your server and we don't want to have this scenario in Kubernetes world. When you define some configuration it should either apply immediately or fail immediately. So this kind of stuff that you need to do manual modifications of the file, it breaks this contract we have and another part is it simply doesn't scale. If you have 300 servers in your bare metal cluster, you are not doing those changes manually. Simply not. You have CRDs and this is what should be happening. This is some very, very simple example. I do some modification. I mistake slash for backslash. They detect that and that's easy but I'm configuring default gateway as an IP address from outside of my subnet and this is utterly wrong but nothing in network manager will prevent me from this configuration. I simply don't want to. We have this CRD defined that creates host configuration from the API and it may sound like chic and egg but it's all the matter of how we order the stuff. We define Kubernetes CRD that will be defining how you configure network manager on the host. You can do it per node, all this kind of stuff. I will just show you how that works very quickly. That's the one. I have this node which has this IP address on the last interface 10, 24402 and what I want to do now, I want this to be different. I want to change that. I want to change it from the Kubernetes in a declarative way so that whenever someone will be modifying this, the change will get reverted. I just created a YAML which will configure IP address on some interface. As simple as that and I will apply that with the hope that it works as expected. At the top we can see that this CRD is now progressing with the configuration progress. In fact, that was as simple as it is so we can see that this IP was removed. For a moment I was thinking who's going to ask but you already had IP in this subnet configured. What's going to happen? Well, that configuration wouldn't fly because you should not have two IPs from the same subnet on the same interface. This is a short demo of that. At the same time, it's Kubernetes API. It should protect us from doing stupid things. I will try to configure a stupid DNS server which has no way of existing because it's on the link local IPv6 subnet. If I try to apply that, something should protect me from doing this because that would actually break the configuration. Let's see our configuration right now. We have 111.1 as the DNS server and let's apply this manifest. Now that configures the wrong DNS. The change has been applied. It's wrong. At this moment your cluster starts to misbehave, your probes go down and so on. Let's give it around 10-15 seconds and this configuration should get reverted because there is a controller which in fact checks if your modifications to the network infrastructure on the host. After applying, do they make something not working as it should? In this scenario, we see that degraded failed to configure. It failed because this DNS server doesn't exist in reality. That was just a short demo of how we handle all that. It's a bunch of self-contained elements that once you start using them all together, you give you a very nice Kubernetes installer that does it all for you. Sometimes in an opinionated way, sometimes less. Now I told you that there will be some dirty tricks. In KubeNet, there is a concept of Node.IP and we are now moving to the Linux world. When you want your application in Linux world to run and interact with the network, it has to bind somewhere. This somewhere is IP address and the port. Let's forget the port. We are about IP address. If you have multiple network interfaces, where should Kubernetes listen? Everywhere on one IP address, on two IP addresses. If you have 10 interfaces, what do we do? I say that Kubernetes upstream doesn't solve it in a very smart way because it was designed to run on clouds with only one network interface. As we started expanding, it's not something that we still want. We developed some additional logic to check that and I will skip the details. In general, one more problem to think about. When you configure KubeNet manually, you need to think what the IP addresses should be there. This configuration is complicated because actually you can say bind everywhere or bind to one IP address or you can say bind to IPv4, like as a string IPv4 and what happens there? It's all you know. You get even stranger syntax IPv6 as a string, comma and then IPv4 address. All this kind of stuff you need to understand how it behaves and pick your choice. It's complex. You may get really confused once you start. We have some set of rules. I will skip them. You can go back to this. In general, some corner cases, I just showed you an example in which you shouldn't have multiple IP addresses in one subnet. What if you do? There are some people who do this for a reason and how do you want KubeNet to behave then? Also, one example that I have and this is just mind blowing. It killed me for like two weeks. Is your IPv6 address really an IPv6 address? Okay, this slide I skip. I got to this RF, sewage describes IPv4 compatible IPv6 addresses and I was like, what the heck is that? Let's go to all the libraries in all the known programming languages. Every of them has a function. Is an IP address IPv6 address? You go to implementation. How implementation looks like? If string contains, colon return true. Thank you very much, game over. It's as simple as that. Really, for the last 30 years of my life, I thought this is as simple as it is, but it's not. Let's take this. So, comma, comma, four times f, then comma, sorry, colon, and then we put IP address with dots. It is a correct address. There is RFC for this address. It may look stupid, but it's a well defined address and, you know, it breaks. Try opening a net cut socket to listen on this address. It will not work because half of the tools now think this is IPv6 address, half of the tools think this is IPv4 address. I did a stress on that and what I realized is that based on this address, it was trying to open a socket on a simple IPv4 address. At this moment, how should we treat that? This is the real case scenario. I got it from a customer who was trying to install Kubernetes and they wanted to use this subnet. I was like, what is that? Then we dig deeper and we realized that this is a monster. It should have never existed, but apparently it exists. If you find a set of parameters that you pass to net cut and it crashes, then something went wrong. So, in the end, yeah, choose wisely what you want to do and once you design your infrastructure really, you know, double check it with someone out there with upstream community. Is it really how you should be doing stuff? Because in a lot of cases, you realize that something misbehaves and, you know, and that was, yeah, one more thing. You think everything is okay, then you start to get and you tell you, oh, sorry, but, you know, in fact, with this cloud provider, you cannot use this syntax and then you realize, oh, I wanted to do all that, but I cannot because you tell me that I cannot. So, you know, and you realize it only at the end of the story once you spend two weeks on designing. So, that's it.
Instant Ramen: Quick and easy multi-cluster Kubernetes development on your laptop
Okay. All right. All right. Okay. We are ready to go. All right. Thanks, everybody. Thanks for sticking around till the end today. And a special shout out to those of you on the live stream as well. My name is Adam Litke and this is Near Software. And today we're going to get our money's worth out of this laptop. Something is not right here. I keep flipping on and off. Let's see. I'll do my best here. So we've come a long way since Linus introduced Linux to the world back in 1991. What started off on his personal computer is deployed pretty much everywhere these days in increasingly complex scenarios. Take Kubernetes, for example. Everyone's favorite clustered container orchestrator, which runs open source up and down the entire stack from the cluster nodes to the Kubelet and to the container runtime itself. And developers haven't stopped building or debugging screens. Kubevert is a Kubernetes add-on that allows you to seamlessly run your virtual machines in Kubernetes. And since the VMs are running in pods, like any other workload on Kubernetes, they integrate really well with whatever else is deployed there, be it your applications, storage, networking, monitoring, et cetera. And as people continue to deploy Kubernetes and Kubevert to host their mission-critical workloads, naturally they wonder what will happen when disaster strikes. Disaster recovery software exists to protect your data and return to normal operations as quickly as possible. And this is typically achieved using redundancy. So data can be replicated from primary environments to secondary environments, and applications, including virtual machines, are able to be started on the secondary environment at a moment's notice, should that be required. In this particular scenario, we have a primary data center DR1 in the west, a secondary data center DR2 in the east, and a hub cluster located somewhere in between. Now we prefer to run our applications on the primary environment because it's closer to where our customers are. But thanks to continuous data replication, we can rest easy knowing. We can start the application up on DR2 when required. So ramen DR is software that enables disaster recovery for multi-cluster Kubernetes environments. It does this by working with the storage to enable data replication according to a DR policy set by the administrator. And it talks with open cluster management to manage application placement, failover, and relocation flows. Today we're going to simulate this disaster for you. We're going to start by disabling the primary environment. We can then failover our virtual machine to the secondary environment. And I just want to note here that failover is different from live migration because live migration would require both environments to be up. In this specific scenario, obviously, we don't have access to DR1. So failover is going to take a couple of minutes, but we can be confident that the app can start back up on the secondary and environment with minimal data loss. So I've been kind of introducing a bunch of different components here that's quite the menu of open source ingredients. KubeVert is a operator managed Kubernetes deployment, which packages libvert and QMU into a container image, allowing you to run your virtual machines inside of a pod. It also comes with other utilities to help you manage your virtual machine storage and networking requirements. Rook is software that manages and integrates self-storage into the Kubernetes platform. Open cluster management stitches together multiple Kubernetes clusters and provides for application management, placement, scheduling. And then RoninDR is adding on those DR flows to open cluster management. So when we're considering a realistic multi-cluster DR environment, it's a beautiful thing, kind of like this bowl of ramen here to tempt you at dinner time. However, it's also complicated and expensive to operate, especially when we consider like the single developer use case. So the question we're trying to answer here is how can we enable development in this open source software stack without huge cloud budgets? And our answer is to scale down that environment so that it can run inside the kind of laptop that most of us are carrying around with us each day. And NIR has prepared a live demo right on this laptop that you're looking at that's going to show all this stuff working together, and we're going to simulate that disaster for you. So take it away. Yep. And I'm going to mute it so we don't annoy the live stream people. Okay. Put that in your pocket. Yep. Okay. So this is our stack, right? Three clusters. We have two identical clusters. Everything is ramen. And we are going to put it inside this laptop to see that we can do it because they are small and cheap. So what we want to do today is to stuff three data centers with ramen and kovir and stuff and a lot of other components and large part of Europe and stuff everything inside this laptop. Now note about this environment. The clusters are all in the same laptop, but they are remote clusters on different regions. And the cluster is standalone with its own storage. So how can we prepare this laptop for the demo? So I have a pack of instant ramen DR, which is very easy to use. You want one command. DRM start with the environment of your flavor. This in case is a kovir environment. And then you let it cook for 10 minutes until everything is ready. Sorry. So we are not going to wait 10 minutes now because it's a little thing. I prepared the environment before the talk and we'll just use it. So whatever we need, we need a Git repo because we're going to use GitOps. We will give Ocm some Git repo to pull the VM resources and deploy the application. So we use Adam's repo, Ocm kovir samples. And I talked it to customize the VM with SSH public key. So let's jump into the demo and increase the font size a bit. So I'm using a little tool to save my typing error for you and make everything more colorful. So first look what we have in this laptop. We have three clusters. DR1 is the primary cluster where we run the VM. DR2 is the secondary cluster. Something bad happens to DR1 and something bad will happen. Don't tell anyone. And Hub is orchestrating everything and controlling the other clusters. Now each of these are libvirt VMs inside the laptop. So let's look inside one of the clusters. We can use kubectl, just a normal kubectl clusters. And we see that we have a lot of stuff that DR1 installed for us. The most important parts for the demo are the kovir parts that you will run the VM, the CDI that will provision the VM disk from container image. Of course, it will be stored. So we have a complete RookSafe system inside that using the local disk of the cluster. And this will provide storage for the VM disk and volume replication between the clusters. And to protect the VM, we have the Raman DR cluster operator, which orchestrates the DR flows. And finally, we need the open cluster management components that lets Raman control the clusters. Because Raman extends open cluster management and depend on it. So let's look inside the Git repo. I'm running this inside the clone of the Git repo from other. The important parts in this repo for the demo are this VM, VM standalone pbc.js. This is VM optimized for the environment. The subscription, which is OCM resources for the VM. And DR are the Raman DR resources. So let's look first at the VM. We have a quick look to see what we have there. We will not look inside the YAMLs. You can check the Git repo later. We have a VM configuration. This VM is using a pbc because we are using a pbc-based VM. So we have this pbc here. And we need to provision a pbc somehow. So we have the source YAML, which tells CDI how to provision the pbc disk. So we can apply this customization to cluster DR1. And this will start the VM on cluster DR1, but we are not going to do it. Because nobody will protect this VM. It's just like a port that you start and it goes down and nobody protects it. So what we want to do is create OCM application. OCM application. We will use subscription-based application. These resources tells OCM how to protect the application, how to create it, like which cluster set to use, where Git repo is, the namespace that the VM is running, where to place the VM, and subscription ties everything together. So to start the VM, we apply this customization to the hub. Everything is done on the hub. And then OCM and Raman later will do the right thing. So at this point, OCM is starting the VM on cluster DR1 and we can watch it. Using kubctl to get the VM, VMI port and pbc. And we can see that the pbc is already bound. And Virt launcher is running and we have an IP address, so ROVM is running. But let's inspect the VM a little bit more to see where is our disk. So we can use ROVM Cess kubctl plugin to look at Cess layer. And we can run the RBDU command, in this case on cluster DR1. And we see that we have RBD image created for our VM. If we look inside the pbc, we will find this volume there. So if something bad happened to cluster DR1, we lose the running VM and the disk. And this image will be gone and we lost all the data. So how can we prevent it? We want to protect this VM is Raman. So to do this, we must tell OCM first that Raman is going to take over this VM and ACM should not change it. We do this by notating the placement with a special annotation and at this point Raman can take over. So how do we protect this with Raman? We need to apply the Raman resources. Basically it's one resource, a DRPC. The DRPC tells Raman how to find the application, how to protect it, which pbc should be protected and what is the policy. We are not going to look inside now, we can check the gtrepolator. So to protect the VM, we apply disk customization. Again on the hub, then Raman will do the right thing on the right cluster. So once we did it, our VM is protected Raman and you can watch it again. This time I'm watching also VRG and VR resources. VRG is the volume replication group. We have one such resource per each protected application and volume replication is the resource that entails the volume replication for each pbc. So we have one of them replication for every pbc. Now both of them are ready and primary, primary windows, this is the primary cluster, replicating data to the secondary cluster. So what does it mean that we replicate data to the other cluster? If you look again on the RBD images, if you remember we have seen that we have an RBD image on the primary cluster. If you run the same command on the secondary cluster, and this time I'm running the same command on context DR2. And we will find that we have an image on the primary cluster and we have the same image on the secondary cluster. So what's going on under the cover is that when Raman enables volume replication, a secondary replica of the image is created on the secondary cluster, and the RBD mirror demon is starting to replicate rights from this image on the primary cluster to this image on the secondary cluster. So if something bad happens to cluster DR1, we can use the secondary image to start the VM at the time of the last replication. So the last thing to show about the VM is that we have a logger inside updating the log file every 10 seconds. We can access the log file using the VIRT CTL SSH. We just run this command to see the start of the log and we see the line where the service was started, and then we see one line every 10 seconds. This will help us verify later when we recover from a disaster that we got the right data from the disk. So now we are ready for everything. Let's try to create a disaster. So one thing that is easy to do on the laptop is to suspend the cluster DR1. If you remember this is a Libre VM, so we can just suspend it. Now everything running there stopped. So let's try to access the VM again with VIRT CTL SSH. Let's try to tail the log and let's see if it works. Well, it does not seem to work because of course we suspended the VM so nothing there is accessible. If we had an important service on this VM, our users would be not happy now. So how can we fix this? Adam, do we have any idea? I was hoping you would tell us. Yes, so because our VM is replicated, we can just failover to the other cluster quickly. How would we failover? If you remember that we installed the DRPC, we can patch the DRPC. We set the action to failover and we set the failover cluster. And once we did it, Raman starts the failover. And you can start watching the failover on the other cluster. I'm running this again on the DR2 cluster because DR1 is not accessible. And we see that we have a PPC. It's impending. We have a volume replication group. We have a volume replication, but the volume replication book is not primary yet. It will take a while until the VM is stopped. So while you wait for it, let's understand what's going on under the cover. So the RBD image on the secondary cluster was replica, pulling data from the master for the main cluster. Raman has to stop this replication and promote it to a primary image that will replicate to another cluster. Once this is done, the VRG will be marked as primary and it should happen any second. And at this point Raman will change the application placement. It just became primary. So now Raman changed the placement of the application. And Ocm will see the change and will redeploy the VM on the second cluster using the subscription. And this should happen any second now. When Ocm deploys the application, it will reuse the PPC that Raman has restored and connected to the right RBD image. And it just happened. We see that the VRT Launcher is running. The VM is up. We have an IP address. So if we add this important service of the VM, this service should be absent. And it will be used as a VRT. But how do we know that we got the right data? Maybe we got just a new application with empty disk. Let's check the disk. Again, we can use the logger. We just dumped the entire log using SSH. This time I'm connecting to the cluster, the R2. And we see all the logs from the VM that run on cluster DR1 until we created the disaster. And we see the new logs when the VM started to run on cluster DR2. Note that we have a gap here between the last line logged when the VM was running on DR1 and the first log, which is about three minutes in this case. This gap depends on how fast we started the failover and the tech that there was an issue with the cluster. So we had a short downtime. The VM is running. We got all the data. Looks like a successful failover. So what's next? In a real cluster, you would try to fix the cluster, recover it. Maybe you need to reinstall it. At this point, you can relocate the VM or maybe later during some maintenance middle, you will relocate the VM back to the primary cluster. In this demo, we are done. And it's time for questions. The first three questions we'll get in Sotramen. Go ahead. The question is what about the IP address of the virtual machine? We're paying attention and noted that change. So what would you suggest? Sotramen does not do anything about the IP address. I think in a real application, you will have some load balancer to make sure that you can switch from one cluster to the other cluster. Probably using the DNS system because you have a nice name for the DNS. But basically, you will do what VIRT CTL is doing when you connect to the VM. You use the VM name, the name space, and you have to find the actual address. Yes. Very nice demo. How much you run at home? I don't have any cloud copies. I can re-image it to 4G. Is 60G enough? 16 will be too low. Yes. So the question is what do we need to run it at home? So first, you can try it at home, but you need big enough laptop. I think for COVID, you need like 32G with Sotramen. Yes. Because we have two clusters with 8GB, maybe you can trim them down a little bit, but 16 will be too low. And... Maybe you need a laptop. Yes. Maybe you need a laptop. If a door is wide, it's not so. If a door would be easier because this is what we use. But it should work on anything. We continue the question. Can I use two laptops, one for disaster recovery and one for the one that you made and the other for disaster recovery? Let's say old laptop. Repeat the question. I didn't answer the question exactly. Can you repeat it? Your presentation is from the same laptop. Can I use the solution for two laptops? So the question is, can we use different machines for this? You can, but it will be harder because you need to set up the network between them. In the same laptop, it's much easier because MiniCube handed most of the stuff for you. If you use different machines, you need to make sure that the clusters are accessible to... So it will be harder. I've got one over here. Yes. Is it required to use SIF or you can use an OSR by system? Repeat the question. Do we have to use SIF? Currently, we work with SIF, so it's optimized for SIF and it works. And we have a complete tool that you set and configure it. If you want to use something else, we have support for any storage, basically, but it doesn't work on Kubernetes yet. It's very commonly on the shift. It needs more work. Yes. The primary site is down. Is there any mechanism for the extra machine for starting by mistake, by itself? The question was, once the cluster is down, do we have any protection that the virtual machine will not start again on the same cluster? So we don't have any protection at the ramen level because SIF is protecting us. If the same VM starts again, let's say we resume the cluster, the application is still running then it will continue to run and try to write to the disk, which is fine because the disk is not replicated at this point. Because the application is done on the destination cluster. It's pulling data from the primary. Usually it will just fail or go into some error state and in real application, when ramen detects that it should not run, it will shut it down. So it's safe. There is one more question. Yes. Just because it's the end of the day. Just one more question. What happens when the hub that was controlling the two data centers goes down? The question was what happens when the hub goes down? Very good question. In a real setup, in OpenShift you have a hub recovery setup, so actually you need two hubs. One passive hub and one active hub. And there is a lot of setup that backup the hub and restore it. But for testing it doesn't matter. And also hopefully you're not running customer visible or end user visible workloads on the hub. So if it goes down you can repair it and people won't be quite as urgent of a disaster. So hopefully the other sites don't fail at the same time. Alright, thanks everybody for coming. What a good question.